PPO

Source: masa/algorithms/ppo/ppo.py

PPO is MASA’s main general-purpose deep RL algorithm. It shares the same actor-critic backbone as A2C, but optimizes the clipped PPO surrogate objective over multiple minibatch epochs for each rollout.

Key Details

  • collects on-policy rollouts with the shared OnPolicyAlgorithm machinery

  • uses generalized advantage estimation

  • optimizes a clipped surrogate objective controlled by clip_range

  • trains over minibatches for multiple epochs per rollout

Implementation Notes

The implementation uses the shared policy family in masa/common/policies.py, so the main differences from A2C are in the loss and optimization schedule rather than the model structure.

The code supports several Gymnasium action types through the shared policy and action-formatting code, including discrete and continuous control settings.

When To Use It

Use PPO when:

  • you want the main neural RL baseline in MASA

  • you need a robust on-policy actor-critic method

  • you want the algorithm used by the shielding examples before moving to shield-parameterized variants