Pendulum#

Pendulum is a single-joint swing-up and balance task. The goal is to swing the pole up and keep it inverted using one motor torque.

Task Description#

A single-link pendulum with one hinge joint is driven by a single motor (configurable gear). The motor’s torque rotates the rod in a plane, enabling swing-up from arbitrary initial angles, inverted balance, and maintenance. Torque is limited by the actuator ctrlrange; by modulating its magnitude and direction, the policy must accumulate energy to swing up and stabilize near the inverted position while damping angular-velocity-induced oscillations.

Action Space#

Item	Details
Type	`Box(-1.0, 1.0, (1,), float32)`
Dimension	1

Observation Space#

Item	Details
Type	`Box(-inf, inf, (3,), float32)`
Dimension	3

Order: cos(theta), sin(theta), angular velocity.

Reward Function Design#

Upright reward: encourages angle near π (inverted)
Energy shaping: target energy near inverted position
Penalties: ang_vel^2, ctrl^2, (ctrl - prev_ctrl)^2 to reduce oscillation and aggressive actuation

Initial State#

Angle randomized in [-pi, pi]
Angular velocity small random noise (if configured)
Control history (prev_ctrl) reset to zero

Episode Termination Conditions#

No fall/angle termination; only NaN check
Episode length limited by max_episode_seconds

1. Environment Preview#

uv run scripts/view.py --env pendulum

2. Start Training#

# Train with default parameters
uv run scripts/train.py --env pendulum

# Customize parallel environments
uv run scripts/train.py --env pendulum --num-envs 1024

# Enable rendering during training
uv run scripts/train.py --env pendulum --render

3. View Training Progress#

uv run tensorboard --logdir runs/pendulum

4. Test Training Results#

# Auto-discover best policy (recommended)
uv run scripts/play.py --env pendulum

# Manually specify a policy file
uv run scripts/play.py --env pendulum --policy runs/pendulum/nn/best_policy.pickle

Tip: Policies are auto-selected from runs/pendulum/. You can override with --policy.

Configuration Parameters#

Environment Configuration#

@dataclass
class PendulumEnvCfg(EnvCfg):
    model_file: str = ".../pendulum.xml"  # MJCF model (gear=5)
    max_episode_seconds: float = 20.0
    sim_dt: float = 0.0125
    ctrl_dt: float = 0.025

Training Configuration (PPO example)#

@rlcfg("pendulum")
@dataclass
class PendulumPPO(PPOCfg):
    seed: int = 42
    max_env_steps: int = 10_000_000
    num_envs: int = 1024
    learning_rate: float = 3e-4
    rollouts: int = 32
    learning_epochs: int = 5
    mini_batches: int = 4
    policy_hidden_layer_sizes: tuple[int, ...] = (64, 64)
    value_hidden_layer_sizes: tuple[int, ...] = (64, 64)

Expected Training Results#

Pendulum can swing up and stay near inverted
Oscillation around upright is reduced by angular-velocity and control-change penalties