Basic Usage

This page shows the minimal way to use MASA without masa.common.utils.make_env(), by manually constructing a Gymnasium environment and wrapping it in the recommended order:

TimeLimit \(\rightarrow\) LabelledEnv \(\rightarrow\) BaseConstraintEnv \(\rightarrow\) ConstraintMonitor \(\rightarrow\) RewardMonitor

This is the same order enforced by make_env() (notably, TimeLimit must come first).

Overview

MASA components reason over labels (atomic predicates) derived from observations. The wrapper masa.common.labelled_env.LabelledEnv computes these labels on every gymnasium.Env.reset() and gymnasium.Env.step() and stores them in info["labels"].

Constraints then consume these labels and expose consistent metrics, while the monitor wrappers attach step/episode summaries to the info dictionary for logging and debugging.

Minimal environment construction

import gymnasium as gym

# Core MASA wrappers
from masa.common.wrappers import TimeLimit, ConstraintMonitor, RewardMonitor
from masa.common.labelled_env import LabelledEnv

# Simple Media Streaming environment
from masa.env.tabular.media_streaming import MediaStreaming

# Example constraint wrapper (a BaseConstraintEnv implementation)
from masa.common.constraints.cmdp import CumulativeCostEnv

# --- 1) Define label and cost functions ---

def label_fn(obs):
    """
    Example labelling function for MediaStreaming-like observations.

    Returns:
        set[str]: Atomic predicates holding in the current observation.
    """
    labels = set()
    # These keys are illustrative; adapt to your observation structure.
    try:
        if int(obs) == 0:
            labels.add("unsafe")
    except:
        return set()
    return labels

def cost_fn(labels):
    """
    Example 0/1 cost: unsafe if the 'unsafe' predicate holds.
    """
    return 1.0 if "unsafe" in labels else 0.0

# --- 1.5) Or use default label_fn and cost_fn supplied by the environment (recommended)
from masa.env.tabular.media_streaming import label_fn, cost_fn

# --- 2) Build the environment and wrap in the correct order ---

env = MediaStreaming()

# Recommended: apply TimeLimit first (episode length is enforced before anything else).
env = TimeLimit(env, max_episode_steps=1_000)

# Attach labels to info["labels"] at every reset/step.
env = LabelledEnv(env, label_fn)

# Apply a constraint wrapper (example: cumulative cost/budget style constraint).
# Typical kwargs are shown; consult the constraint's docstring / Constraints API reference.
env = CumulativeConstraintEnv(env, cost_fn=cost_fn, budget=25.0)

# Finally, attach monitoring wrappers for constraints and reward logging.
env = ConstraintMonitor(env)
env = RewardMonitor(env)

Random-agent interaction loop (Gymnasium-style)

import numpy as np

num_episodes = 3

for ep in range(num_episodes):
    obs, info = env.reset(seed=ep)

    ep_return = 0.0
    ep_len = 0

    # Your monitors/constraints may attach additional fields; labels are always in info["labels"]
    labels = info.get("labels", set())
    print(f"[episode {ep}] reset labels={labels}")

    terminated = truncated = False
    while not (terminated or truncated):
        action = env.action_space.sample()

        obs, reward, terminated, truncated, info = env.step(action)
        ep_return += float(reward)
        ep_len += 1

        labels = info.get("labels", set())

        # Common pattern: monitors expose step metrics in info (names may vary by constraint).
        constraint = info.get("constraint", {})
        if isinstance(constraint, dict) and "step" in constraint:
            step_cost = constraint["step"].get("cost", 0.0)
            violated = constraint["step"].get("violated", False)

        if violated:
            print(f"  step={ep_len:04d} VIOLATION labels={labels} cost={step_cost}")

    # Episode-end metrics are often attached on the final transition by the monitors.
    # Again, keys vary; print what you care about.
    constraint = info.get("constraint", {})
    if isinstance(constraint, dict) and "episode" in constraint:
        ep_cost = constraint["episode"].get("cum_cost", None)
        ep_satisfied = constraint["episode"].get("satisfied", None)

    print(
        f"[episode {ep}] return={ep_return:.2f} len={ep_len} "
        f"episode_cost={ep_cost} episode_satisfied={ep_satisfied}"
   )

Training with PPO

Below is a minimal example showing how to initialize and train PPO (provided by MASA) using the wrapped environment. The specific PPO constructor and train API may include additional options (e.g., logging, eval env, saving); the snippet mirrors the general style used in MASA runs.

from masa.algorithms.ppo import PPO

# (Optional) create a separate evaluation environment with the same wrapper stack.
def make_eval_env():
    eval_env = MediaStreaming()
    eval_env = TimeLimit(eval_env, max_episode_steps=1_000)
    eval_env = LabelledEnv(eval_env, label_fn)
    eval_env = CumulativeConstraintEnv(eval_env, cost_fn=cost_fn, budget=25.0)
    eval_env = ConstraintMonitor(eval_env)
    eval_env = RewardMonitor(eval_env)
    return eval_env

eval_env = make_eval_env()

# Initialize PPO.
# Common kwargs (device, seed, logging) follow the same pattern as other MASA algorithms.
algo = PPO(
    env,
    seed=0,
    device="auto",
    verbose=1,
    eval_env=eval_env,            # optional
    tensorboard_logdir=None,      # optional
)

# Train PPO. MASA algorithms automatically support eval/log frequencies and windowed stats.
algo.train(
    total_timesteps=200_000,
    num_eval_episodes=10,         # optional
    eval_freq=10_000,             # optional
    log_freq=2_000,               # optional
    stats_window_size=100,        # optional
)

API Reference for make_env()

masa.common.utils.make_env(env_id: str, constraint: str, max_episode_steps: int, *, label_fn: LabelFn | None = None, **constraint_kwargs) gym.Env[source]

Construct a fully wrapped MASA environment using the canonical wrapper order.

This helper creates a Gymnasium environment and applies MASA wrappers in the recommended and enforced order:

TimeLimit \(\rightarrow\) LabelledEnv \(\rightarrow\) BaseConstraintEnv \(\rightarrow\) ConstraintMonitor \(\rightarrow\) RewardMonitor

The resulting environment exposes labels, constraint metrics, and reward summaries exclusively via the Gymnasium info dictionary. Observations and rewards themselves are left unchanged.

Parameters:
  • env_id – Environment identifier registered in ENV_REGISTRY.

  • constraint – Constraint identifier registered in CONSTRAINT_REGISTRY.

  • max_episode_steps – Maximum number of steps per episode. Applied via TimeLimit as the outermost wrapper.

  • label_fn – Optional function mapping observations to atomic predicate labels. If provided, labels are computed on every reset and step and stored under info["labels"].

  • **constraint_kwargs – Additional keyword arguments forwarded to the constraint wrapper constructor.

Returns:

A fully wrapped Gymnasium environment compatible with MASA algorithms, monitors, and logging utilities.

Notes

  • Wrapper order is fixed and enforced.

  • Constraints are reset automatically on environment reset.

  • All semantic metadata (labels, costs, violations, metrics) is communicated via the info dictionary.

Next Steps