Linear Quadratic Regulator#
LQR (Linear Quadratic Regulator) is a classic continuous control and stabilization task. This repository currently provides two variants:
dm-lqr-2-1: two masses connected by a rope, with only the last mass actuateddm-lqr-6-2: six masses connected as a chain, with only the last two masses actuated
The goal is to drive the whole system back to the center and keep it near equilibrium with minimal control effort.
Task Description#
Both tasks can be viewed as one-dimensional spring-damper chain stabilization problems. Each mass has a single translational degree of freedom along the x-axis. Neighboring masses are coupled by rope-like spring forces, and the system is affected by:
body damping on each mass
spring forces and relative damping between neighboring masses
a center-restoring force pulling the system toward the origin
control inputs applied only to the actuated terminal degrees of freedom
In practice:
dm-lqr-2-1is the simpler version and is useful for verifying whether the policy can learn a stable equilibriumdm-lqr-6-2is more difficult because the controller must propagate its effect through a longer chain
Action Space#
dm-lqr-2-1#
Item |
Details |
|---|---|
Type |
|
Dimension |
1 |
Index |
Action Description |
Min |
Max |
XML Joint |
|---|---|---|---|---|
0 |
Control input applied to the last mass |
-1.0 |
1.0 |
|
dm-lqr-6-2#
Item |
Details |
|---|---|
Type |
|
Dimension |
2 |
Index |
Action Description |
Min |
Max |
XML Joint |
|---|---|---|---|---|
0 |
Control input applied to the second-last mass |
-1.0 |
1.0 |
|
1 |
Control input applied to the last mass |
-1.0 |
1.0 |
|
Observation Space#
The observation is formed by concatenating all positions qpos and velocities qvel.
dm-lqr-2-1#
Item |
Details |
|---|---|
Type |
|
Dimension |
4 |
Index |
Observation |
Meaning |
|---|---|---|
0 |
|
Position of the first mass |
1 |
|
Position of the second mass |
2 |
|
Velocity of the first mass |
3 |
|
Velocity of the second mass |
dm-lqr-6-2#
Item |
Details |
|---|---|
Type |
|
Dimension |
12 |
The first 6 dimensions are q0 ~ q5, and the last 6 dimensions are dq0 ~ dq5.
Reward Function Design#
The current reward is composed of state cost, velocity cost, control cost, success bonus, and out-of-bounds penalty:
state_cost = 0.5 * sum(qpos ** 2)
velocity_cost = 0.5 * velocity_cost_coef * sum(qvel ** 2)
control_cost = 0.5 * control_cost_coef * sum(action ** 2)
reward = 1.0 - (state_cost + velocity_cost + control_cost)
reward += success_bonus
reward -= out_of_bounds_penalty
Intuitively:
the farther the system is from the origin, the lower the reward
larger velocities reduce the reward
aggressive control inputs reduce the reward
entering a small stable region around the origin yields a success bonus
leaving the valid state boundary triggers an additional penalty
Initial State#
At reset:
the position vector is sampled in a random direction and normalized to a fixed norm
all initial velocities are set to zero
With the current configuration:
dm-lqr-2-1starts with position norm around0.8dm-lqr-6-2starts with position norm around1.0
Episode Termination Conditions#
An episode terminates and resets when any of the following conditions is met:
success condition is reached: the position norm is below the success distance threshold and the velocity norm is below the success velocity threshold
out-of-bounds condition is reached: any position exceeds the position boundary or any velocity exceeds the velocity boundary
the full state is sufficiently close to zero
NaNappears in the observation or action
Usage Guide#
1. Environment Preview#
uv run scripts/view.py --env dm-lqr-2-1
uv run scripts/view.py --env dm-lqr-6-2
2. Start Training#
uv run scripts/train.py --env dm-lqr-2-1
uv run scripts/train.py --env dm-lqr-6-2
3. View Training Progress#
uv run tensorboard --logdir runs/dm-lqr-2-1
uv run tensorboard --logdir runs/dm-lqr-6-2
4. Test Training Results#
uv run scripts/play.py --env dm-lqr-2-1
uv run scripts/play.py --env dm-lqr-6-2
Expected Training Results#
dm-lqr-2-1#
The actuated mass pulls the unactuated mass back toward the center.
Both positions and velocities converge to a small neighborhood of zero.
The learned policy does not settle at a biased off-center equilibrium.
dm-lqr-6-2#
The last two actuated masses gradually pull the entire chain back toward the center.
The chain remains stable without obvious divergence or persistent oscillation.
Success rate increases during training while the out-of-bounds rate decreases.