Ping Pong Ball Bouncing#
Train a single-arm robotic manipulator to control a paddle for continuous ball bouncing, maintaining the ball at a target height and position.
Task Description#
Bounce Ball is a single-arm robotic manipulation task using a 6-DOF Peitian AIR4-560 industrial robotic arm to control the position of an end-effector paddle. The agent controls the position changes of the arm’s 6 joints as actions, making the ping pong ball bounce continuously on the paddle and keeping it as close as possible to the target height and target horizontal position.
Action Space#
Item |
Details |
|---|---|
Type |
|
Dimension |
6 |
The joints correspond as follows:
Index |
Action Meaning (Joint Position Change) |
Min Value |
Max Value |
Corresponding XML Name |
|---|---|---|---|---|
0 |
Joint1 (Base Rotation) Position Change |
-1 |
1 |
|
1 |
Joint2 (Upper Arm) Position Change |
-1 |
1 |
|
2 |
Joint3 (Forearm) Position Change |
-1 |
1 |
|
3 |
Joint4 (Wrist Rotation) Position Change |
-1 |
1 |
|
4 |
Joint5 (Wrist Pitch) Position Change |
-1 |
1 |
|
5 |
Joint6 (Wrist Rotation) Position Change |
-1 |
1 |
|
Observation Space#
Item |
Details |
|---|---|
Type |
|
Dimension |
29 |
The observation space consists of the following parts (in order):
Part |
Content Description |
Dimension |
Remarks |
|---|---|---|---|
dof_pos |
Position information for each degree of freedom |
13 |
First 6 are arm joints, last 7 are ball’s free joint (3 position + 4 quaternion) |
dof_vel |
Velocity information for each degree of freedom |
12 |
Velocity is derivative of position |
paddle_pos |
Paddle position information |
3 |
x, y, z coordinates of paddle center |
target_height |
Target height |
1 |
Target height for current environment |
Index |
Observation |
Min Value |
Max Value |
XML Name |
Type (Unit) |
|---|---|---|---|---|---|
0-5 |
Arm Joint Angles |
-Inf |
Inf |
Joint1-6 |
Angle (rad) |
6-8 |
Ball Position [x, y, z] |
-Inf |
Inf |
ball_link |
Position (m) |
9-12 |
Ball Orientation Quaternion [w,x,y,z] |
-Inf |
Inf |
ball_link |
Quaternion |
13-18 |
Arm Joint Angular Velocities |
-Inf |
Inf |
Joint1-6 |
Angular Velocity (rad/s) |
19-24 |
Ball Velocity [vx,vy,vz,wx,wy,wz] |
-Inf |
Inf |
ball_link |
Velocity (m/s, rad/s) |
25-27 |
Paddle Position [x, y, z] |
-Inf |
Inf |
blocker |
Position (m) |
28 |
Target Height |
-Inf |
Inf |
- |
Position (m) |
Reward Function#
The reward function uses a composite design with multiple reward and penalty terms to guide the robot to learn a stable ball bouncing strategy. All reward parameters can be adjusted through the configuration file.
Main Reward Terms#
1. Horizontal Position Reward#
Design Rationale: This is the core reward term, ensuring the ball stays directly above the paddle. Through a vertical distance weighting mechanism, when the ball is close to the paddle (about to hit), the horizontal position requirement is stricter, guiding the strategy to align precisely at critical moments.
Formula:
Weight: 2.0
2. Out of Position Penalty#
Design Rationale: Applies strong penalty for severe deviation from target position to prevent the ball from flying out of control range. Uses sigmoid function for smooth transition, avoiding discontinuous reward function.
Formula:
Weight: 1.0
3. Velocity Matching Reward#
Design Rationale: Based on projectile motion physics, encourages the ball’s trajectory to have the desired velocity (0.5 m/s) at target height. This ensures the ball doesn’t pass through the target height too fast or too slow, facilitating stable control.
Formula:
Weight: 2.0
4. Height Reward#
Design Rationale: Directly encourages the ball to approach target height, one of the core task objectives. Higher weight (4.5) ensures the strategy prioritizes height control. Target height is randomly sampled (0.3-0.6 m) in each environment to improve policy generalization.
Formula:
Weight: 4.5
5. Height Progress Reward#
Design Rationale: Encourages the ball to reach higher positions, helping the strategy quickly learn to hit the ball upward in early training, avoiding the “no-hit” local optimum.
Formula:
Weight: 1.0
6. Controlled Upward Velocity Reward#
Design Rationale: Only rewards upward velocity when the ball’s horizontal position is good, avoiding “random hitting” behavior. Ideal velocity is calculated from physics formula, ensuring the ball can exactly reach target height. This reward guides the strategy to learn precise hitting force.
Formula:
Weight: 1.5
7. Consecutive Bounces Reward#
Design Rationale: Encourages multiple consecutive successful bounces, guiding the strategy to learn stable long-term control. Uses logarithmic function to avoid infinite reward growth, while requiring good ball position to give reward.
Formula:
Weight: 0.8
8. High Bounce Count Reward#
Design Rationale: Gives extra reward for high bounce counts (≥3), further encouraging long-term stable control. Uses sigmoid activation function for smooth reward growth.
Formula:
Weight: 0.3
9. Paddle-Ball Horizontal Alignment Reward#
Design Rationale: Encourages the paddle to actively move directly below the ball rather than waiting for the ball to fall. The closer the vertical distance, the greater the weight, requiring more precise alignment at the moment of hitting. Extra reward (boost) is given at bounce moment to reinforce correct hitting behavior.
Formula:
Weight: 0.6
10. Paddle Home Position Reward#
Design Rationale: Encourages the paddle to return to home position when the ball is far away, avoiding the paddle staying at high position for long time. Uses distance dynamic factor: when ball is far from paddle (\(d_{vert} > 0.15\) m), increase reward to encourage quick return; when ball is close, decrease reward to allow paddle to move up for hitting. This design makes paddle motion more energy-efficient and natural.
Formula:
Weight: 1.5
Penalty Terms#
1. Excessive Upward Velocity Penalty#
Design Rationale: Prevents ball velocity from being too fast (>3.5 m/s) and losing control, ensuring ball motion stays within controllable range.
Formula:
Weight: 1.0
2. Downward Velocity Penalty#
Design Rationale: Penalizes ball moving downward (\(v_z < -0.2\) m/s), encouraging the strategy to hit the ball in time, avoiding free fall.
Formula:
Weight: 1.0
3. Paddle Height Violation Penalty#
Design Rationale: Applies strong penalty when paddle deviates too far from home position (>0.1 m), ensuring paddle doesn’t stay at high position for long time.
Formula:
Weight: 1.0
4. Action Change Rate Penalty#
Design Rationale: Penalizes drastic action changes, encouraging smooth control strategy.
Formula:
Weight: \(10^{-4}\)
5. Joint Velocity Penalty#
Design Rationale: Penalizes excessive joint velocities, encouraging energy-efficient and smooth motion.
Formula:
Weight: \(10^{-4}\)
Total Reward Calculation#
Initial State#
Robot Initialization#
Joint Angles:
Default angles: [0°, 40°, 110°, 0°, -60°, 0°]
Random noise: Uniform random noise in [-0.1, 0.1] radians added to each joint
Joint Velocities:
Initialized to zero with small random noise
Ball Initialization#
Position:
Base position:
ball_init_posfrom config file (default [0.58856, 0, 0.45] m)Random noise: Uniform random noise in [-0.01, 0.01] m
Velocity:
Initialized to zero
Orientation:
Quaternion: [0, 0, 0, 1] (identity quaternion, no rotation)
Target Height#
Target height for each environment is randomly sampled in [0.4, 0.6] m range to improve policy generalization.
Episode Termination Conditions#
Ball Falls: Ball z-coordinate < 0.05m (near ground)
Ball Too High: Ball z-coordinate > target height + 1.0m (lost control)
Horizontal Deviation Too Far: Ball x or y coordinate absolute value > 1.5m
Joint Velocity Too High: Any joint angular velocity > 2π rad/s (360°/s)
Timeout: Episode duration exceeds maximum allowed time
Usage Guide#
1. Environment Preview#
uv run scripts/view.py --env bounce_ball
2. Start Training#
uv run scripts/train.py --env bounce_ball
3. View Training Progress#
uv run tensorboard --logdir runs/bounce_ball
4. Test Training Results#
uv run scripts/play.py --env bounce_ball
Expected Training Results#
Consecutive Bouncing: Capable of achieving 3 or more consecutive bounces
Position Control: Ball’s horizontal position (x-coordinate) stable within target position ± 0.05m range
Height Control: Ball’s height stable within target height 0.8 ± 0.1m range
Velocity Control: Ball’s upward velocity maintained within reasonable range (0.1-1.5 m/s)
Stable Control: Capable of maintaining stable bouncing for 20 seconds without dropping
Known Issues#
JAX Backend Training Performance: The JAX version currently shows suboptimal training performance. For better results, it is recommended to use the PyTorch backend for this environment.