Tabular Safe RL Baselines

This tutorial compares MASA’s tabular safe RL baselines on one constrained tabular environment. The point is to read the reward/safety tradeoff each method encodes, not to rank algorithms from a tiny run.

Runnable notebook: notebooks/tutorials/06_tabular_safe_rl_baselines.ipynb

Shared Setup

Use CPU-first setup before importing MASA/JAX modules:

import os

os.environ.setdefault("JAX_PLATFORMS", "cpu")
os.environ.setdefault("TF_CPP_MIN_LOG_LEVEL", "2")

All algorithms use the same constrained environment:

from masa.common.utils import make_env
from masa.envs.tabular.colour_grid_world import cost_fn, label_fn

def make_train_env():
    return make_env(
        "colour_grid_world",
        "cmdp",
        40,
        label_fn=label_fn,
        cost_fn=cost_fn,
        budget=0.0,
    )

colour_grid_world is small enough for tutorial execution, has discrete states/actions, and separates sparse reward from safety cost:

  • goal gives task reward,

  • blue gives safety cost,

  • the CMDP budget is 0.0.

Probe the Signals

Before training, run scripted rollouts to inspect the data that algorithms receive:

  • unsafe script: seed 1, actions [2, 2, 2, 2], reaches blue,

  • goal script: seed 4, actions [2] * 8 + [1] * 8, reaches goal.

The unsafe script shows cost=1.0, violation=1.0, cum_cost > 0, and satisfied=0.0. The goal script shows reward without violating the CMDP budget.

Algorithms Compared

Algorithm

Class

Safety mechanism

q_learning

QL

Task Q table only; costs are logged but not penalized in the update target.

q_learning_lambda

QL_Lambda

Subtracts cost_lambda * cost from the reward target.

lcrl

LCRL

Maps violations to an absorbing-style value based on r_min / (1 - gamma).

sem

SEM

Learns task Q plus auxiliary D and C safety tables that alter action selection.

recreg

RECREG

Learns task Q, backup B, risk S, and can report override_rate when actions are replaced.

The tutorial uses this compact registry:

ALGORITHMS = {
    "q_learning": (QL, {}),
    "q_learning_lambda": (QL_Lambda, {"cost_lambda": 1.0}),
    "lcrl": (LCRL, {"r_min": -1.0}),
    "sem": (SEM, {"r_min": -1.0, "cost_coef": 1.0}),
    "recreg": (
        RECREG,
        {
            "mode": "model_free",
            "model_checking": "none",
            "horizon": 3,
            "safety_prob": 0.2,
            "cost_coef": 2.0,
        },
    ),
}

Tiny Smoke Runs

Each algorithm is trained with the same tiny configuration:

algo.train(
    num_frames=20,
    eval_freq=10,
    log_freq=10,
    num_eval_episodes=1,
    stats_window_size=5,
)

These runs are intentionally too small for performance claims. They are diagnostics that prove:

  • all five algorithms can train on the same MASA wrapper stack,

  • train/eval logging includes reward and constraint metrics,

  • the learned objects expose the expected state tables.

Reading the Learned State

After training, inspect the object shapes:

  • QL, QL_Lambda, and LCRL expose Q,

  • SEM exposes Q, D, and C,

  • RECREG exposes Q, backup table B, and risk table S.

Use the printed train/rollout, train/stats, and eval/rollout blocks to interpret reward and safety together. For example, RECREG may add override_rate to episode metrics when it replaces a risky task action with a backup action.

What to Take Away

The algorithms differ less in the environment interaction loop than in how they turn cost and violation signals into learning or action-selection pressure:

  • QL is the task-only baseline,

  • QL_Lambda is a soft penalty baseline,

  • LCRL makes violations strongly unattractive,

  • SEM separates task and safety estimates,

  • RECREG introduces backup-policy intervention.

For real comparisons, increase training frames, run multiple seeds, and summarize confidence intervals. Keep this tutorial run as a quick wiring and interpretation check.