Tabular Safe RL Baselines¶
This tutorial compares MASA’s tabular safe RL baselines on one constrained tabular environment. The point is to read the reward/safety tradeoff each method encodes, not to rank algorithms from a tiny run.
Runnable notebook: notebooks/tutorials/06_tabular_safe_rl_baselines.ipynb
Probe the Signals¶
Before training, run scripted rollouts to inspect the data that algorithms receive:
unsafe script: seed
1, actions[2, 2, 2, 2], reachesblue,goal script: seed
4, actions[2] * 8 + [1] * 8, reachesgoal.
The unsafe script shows cost=1.0, violation=1.0, cum_cost > 0, and satisfied=0.0. The goal script shows reward without violating the CMDP budget.
Algorithms Compared¶
Algorithm |
Class |
Safety mechanism |
|---|---|---|
|
|
Task |
|
|
Subtracts |
|
|
Maps violations to an absorbing-style value based on |
|
|
Learns task |
|
|
Learns task |
The tutorial uses this compact registry:
ALGORITHMS = {
"q_learning": (QL, {}),
"q_learning_lambda": (QL_Lambda, {"cost_lambda": 1.0}),
"lcrl": (LCRL, {"r_min": -1.0}),
"sem": (SEM, {"r_min": -1.0, "cost_coef": 1.0}),
"recreg": (
RECREG,
{
"mode": "model_free",
"model_checking": "none",
"horizon": 3,
"safety_prob": 0.2,
"cost_coef": 2.0,
},
),
}
Tiny Smoke Runs¶
Each algorithm is trained with the same tiny configuration:
algo.train(
num_frames=20,
eval_freq=10,
log_freq=10,
num_eval_episodes=1,
stats_window_size=5,
)
These runs are intentionally too small for performance claims. They are diagnostics that prove:
all five algorithms can train on the same MASA wrapper stack,
train/eval logging includes reward and constraint metrics,
the learned objects expose the expected state tables.
Reading the Learned State¶
After training, inspect the object shapes:
QL,QL_Lambda, andLCRLexposeQ,SEMexposesQ,D, andC,RECREGexposesQ, backup tableB, and risk tableS.
Use the printed train/rollout, train/stats, and eval/rollout blocks to interpret reward and safety together. For example, RECREG may add override_rate to episode metrics when it replaces a risky task action with a backup action.
What to Take Away¶
The algorithms differ less in the environment interaction loop than in how they turn cost and violation signals into learning or action-selection pressure:
QLis the task-only baseline,QL_Lambdais a soft penalty baseline,LCRLmakes violations strongly unattractive,SEMseparates task and safety estimates,RECREGintroduces backup-policy intervention.
For real comparisons, increase training frames, run multiple seeds, and summarize confidence intervals. Keep this tutorial run as a quick wiring and interpretation check.