Continuous Safe RL Baselines¶
This page is a stub for the future continuous-action safe RL baselines tutorial. The current runnable baseline in this part of the codebase is PPO; CPO and PPO Lagrangian are important comparison points, but their docs pages are placeholders rather than tutorial-ready implementations.
Runnable notebook: notebooks/tutorials/07_continuous_safe_rl_baselines.ipynb
Current Scope¶
Baseline |
Status |
Tutorial state |
Safety role |
|---|---|---|---|
|
implemented |
runnable scaffold only |
uses MASA constraint wrappers for logging; no constrained objective in base PPO |
|
mentioned / placeholder docs |
not runnable here yet |
future constrained policy optimization baseline |
|
mentioned / placeholder docs |
not runnable here yet |
future Lagrangian penalty baseline for PPO-style training |
Continuous Cartpole Scaffold¶
cont_cartpole is the tiny continuous-control environment for this tutorial family. It uses:
observation space
Box(4,),action space
Box(1,),label
{"stable"}while cart position and pole angle remain in bounds,cost
0.0when stable and1.0otherwise.
The MASA wrapper setup is:
from masa.common.utils import make_env
from masa.envs.continuous.cartpole import cost_fn, label_fn
env = make_env(
"cont_cartpole",
"cmdp",
200,
label_fn=label_fn,
cost_fn=cost_fn,
budget=0.0,
)
This produces the same info["labels"] and info["constraint"] structure used throughout the earlier tutorials.
PPO Stub¶
The notebook includes a minimal PPO configuration sketch:
PPO_STUB_CONFIG = {
"learning_rate": 3e-4,
"n_steps": 32,
"batch_size": 32,
"n_epochs": 2,
"gamma": 0.99,
"gae_lambda": 0.95,
"clip_range": 0.2,
"device": "cpu",
}
It intentionally does not train. A full tutorial should add vectorization or normalization choices, an evaluation protocol, multiple seeds, and safety-specific comparisons.
Future Comparison Surface¶
The eventual continuous safe RL baselines tutorial should compare:
PPOas the unconstrained neural policy-gradient baseline with MASA safety logging,CPOas a constrained policy optimization baseline,PPO Lagrangianas a learned-penalty baseline.
Until those constrained on-policy baselines are implemented, this page remains a lightweight placeholder that verifies the continuous environment and records the intended tutorial shape.