Continuous Safe RL Baselines

This page is a stub for the future continuous-action safe RL baselines tutorial. The current runnable baseline in this part of the codebase is PPO; CPO and PPO Lagrangian are important comparison points, but their docs pages are placeholders rather than tutorial-ready implementations.

Runnable notebook: notebooks/tutorials/07_continuous_safe_rl_baselines.ipynb

Current Scope

Baseline

Status

Tutorial state

Safety role

PPO

implemented

runnable scaffold only

uses MASA constraint wrappers for logging; no constrained objective in base PPO

CPO

mentioned / placeholder docs

not runnable here yet

future constrained policy optimization baseline

PPO Lagrangian

mentioned / placeholder docs

not runnable here yet

future Lagrangian penalty baseline for PPO-style training

Continuous Cartpole Scaffold

cont_cartpole is the tiny continuous-control environment for this tutorial family. It uses:

  • observation space Box(4,),

  • action space Box(1,),

  • label {"stable"} while cart position and pole angle remain in bounds,

  • cost 0.0 when stable and 1.0 otherwise.

The MASA wrapper setup is:

from masa.common.utils import make_env
from masa.envs.continuous.cartpole import cost_fn, label_fn

env = make_env(
    "cont_cartpole",
    "cmdp",
    200,
    label_fn=label_fn,
    cost_fn=cost_fn,
    budget=0.0,
)

This produces the same info["labels"] and info["constraint"] structure used throughout the earlier tutorials.

PPO Stub

The notebook includes a minimal PPO configuration sketch:

PPO_STUB_CONFIG = {
    "learning_rate": 3e-4,
    "n_steps": 32,
    "batch_size": 32,
    "n_epochs": 2,
    "gamma": 0.99,
    "gae_lambda": 0.95,
    "clip_range": 0.2,
    "device": "cpu",
}

It intentionally does not train. A full tutorial should add vectorization or normalization choices, an evaluation protocol, multiple seeds, and safety-specific comparisons.

Future Comparison Surface

The eventual continuous safe RL baselines tutorial should compare:

  • PPO as the unconstrained neural policy-gradient baseline with MASA safety logging,

  • CPO as a constrained policy optimization baseline,

  • PPO Lagrangian as a learned-penalty baseline.

Until those constrained on-policy baselines are implemented, this page remains a lightweight placeholder that verifies the continuous environment and records the intended tutorial shape.