Probabilistic Shielding MiniPacman¶
This tutorial adapts masa/examples/prob_shield_example.py into a guided inspection of probabilistic shielding. Instead of running a long PPO training job, it builds the shielded MiniPacman environment and inspects what the wrapper adds.
Runnable notebook: notebooks/tutorials/09_probabilistic_shielding_minipacman.ipynb
Build MiniPacman with PCTL¶
MiniPacman is a discrete tabular environment with an exposed transition model. That makes it suitable for ProbShieldWrapperDisc, which needs known safety dynamics.
from masa.common.utils import make_env
from masa.envs.tabular.mini_pacman import cost_fn, label_fn
env = make_env(
"mini_pacman",
"pctl",
100,
label_fn=label_fn,
cost_fn=cost_fn,
alpha=0.01,
)
The pctl constraint uses cost_fn to mark ghost collisions as unsafe and tracks whether the episode satisfies the probabilistic safety threshold.
Add the Shield¶
The script example wraps that environment with ProbShieldWrapperDisc:
from masa.prob_shield.prob_shield_wrapper_v1 import ProbShieldWrapperDisc
shielded_env = ProbShieldWrapperDisc(
env,
init_safety_bound=0.01,
theta=1e-12,
max_vi_steps=2_000,
granularity=10,
)
During construction, the wrapper:
builds
successor_states_matrixandprobabilities,computes interval safety bounds in
safety_lb,records
max_successors,replaces the observation space with a dictionary containing
orig_obsandsafety_bound,replaces the action space with a
MultiDiscreteaugmented action format.
Inspect the Augmented Interface¶
After reset, the observation has two important pieces:
obs, info = shielded_env.reset(seed=0)
obs["orig_obs"] # original MiniPacman discrete state
obs["safety_bound"] # remaining safety budget
The augmented action format is:
[primary_action, fallback_action, beta_0, beta_1, ..., beta_K]
The first two entries select candidate MiniPacman actions. The remaining entries encode candidate successor safety-bound choices from 0 to granularity.
Projection Geometry on a Toy MDP¶
The projection step follows the geometry described in the probabilistic shielding paper: action choices live in a probability simplex, and the safety budget cuts out the unsafe part of that simplex. The tiny MDP below uses exact reach-unsafe values so the clipping operation is visible before we inspect the larger MiniPacman state.
For budget q=0.10, _project_act keeps action distributions inside the green safe half-space 0.02*pi0 + 0.08*pi1 + 0.20*pi2 <= 0.10. The red part of the simplex is the clipped-away region above that budget.¶
Inspect Projected Safe Actions¶
The notebook uses _parse_act and _project_act to show how candidate augmented actions are projected. These are internal inspection hooks, not a stable public API.
The projection returns:
safe_actions: a probability distribution over original MiniPacman actions,bounds: projected successor safety bounds,proj_penalty: how much the proposed bounds were changed,margin_penalty: a diagnostic margin term.
That inspection is the heart of the tutorial: it shows how the same MiniPacman action is filtered through the safety budget before the wrapped environment is stepped.
Step Once¶
When the shielded environment steps:
the wrapper parses the augmented action,
projects it to a safe distribution over original actions,
samples an original MiniPacman action,
steps the PCTL-constrained environment,
updates the next observation’s
safety_bound.
The returned info still contains MASA constraint data:
info["constraint"]
info["margin_penalty"]
info["proj_penalty"]
Where Training Fits¶
The original example trains PPO on the shielded environment. This tutorial intentionally stops before training so it stays fast and readable. A full training run would pass one shielded env to PPO for training and a separate shielded env for evaluation.