This repository is the starter code for the NYU Reinforcement Learning and Optimal Control project in which students train a Unitree Go2 walking policy in Isaac Lab starting from a minimal baseline and improve it via reward shaping and robustness strategies. Please read this README fully before starting and follow the exact workflow and naming rules below to ensure your runs integrate correctly with the cluster scripts and grading pipeline.
- Fork this repository and do not change the repository name in your fork.
- Your fork must be named rob6323_go2_project so cluster scripts and paths work without modification.
- GitHub Account: You must have a GitHub account to fork this repository and manage your code. If you do not have one, sign up here.
- Project Webpage: https://machines-in-motion.github.io/RL_class_go2_project/
- Project Tutorial: https://github.com/machines-in-motion/rob6323_go2_project/blob/master/tutorial/tutorial.md
- Connect to the NYU Greene HPC via SSH; if you are off-campus or not on NYU Wi‑Fi, you must connect through the NYU VPN before SSHing to Greene.
- The official instructions include example SSH config snippets and commands for greene.hpc.nyu.edu and dtn.hpc.nyu.edu as well as VPN and gateway options: https://sites.google.com/nyu.edu/nyu-hpc/accessing-hpc?authuser=0#h.7t97br4zzvip.
After logging into Greene, cd into your home directory (cd $HOME). You must clone your fork into $HOME only (not scratch or archive). This ensures subsequent scripts and paths resolve correctly on the cluster. Since this is a private repository, you need to authenticate with GitHub. You have two options:
The easiest way to avoid managing keys manually is to configure VS Code Remote SSH. If set up correctly, VS Code forwards your local credentials to the cluster.
- Follow the NYU HPC VS Code guide to set up the connection.
Tip: Once connected to Greene in VS Code, you can clone directly without using the terminal:
- Sign in to GitHub: Click the "Accounts" icon (user profile picture) in the bottom-left sidebar. If you aren't signed in, click "Sign in with GitHub" and follow the browser prompts to authorize VS Code.
- Clone the Repo: Open the Command Palette (
Ctrl+Shift+PorCmd+Shift+P), type Git: Clone, and select it.- Select Destination: When prompted, select your home directory (
/home/<netid>/) as the clone location.For more details, see the VS Code Version Control Documentation.
If you prefer using a standard terminal, you must generate a unique SSH key on the Greene cluster and add it to your GitHub account:
- Generate a key: Run the
ssh-keygencommand on Greene (follow the official GitHub documentation on generating a new SSH key). - Add the key to GitHub: Copy the output of your public key (e.g.,
cat ~/.ssh/id_ed25519.pub) and add it to your account settings (follow the GitHub documentation on adding a new SSH key).
Once authenticated, run the following commands. Replace <your-git-ssh-url> with the SSH URL of your fork (e.g., git@github.com:YOUR_USERNAME/rob6323_go2_project.git).
cd $HOME
git clone <your-git-ssh-url> rob6323_go2_project
Note: You must ensure the target directory is named exactly rob6323_go2_project. This ensures subsequent scripts and paths resolve correctly on the cluster.
- Enter the project directory and run the installer to set up required dependencies and cluster-side tooling.
cd $HOME/rob6323_go2_project
./install.sh
Do not skip this step, as it configures the environment expected by the training and evaluation scripts. It will launch a job in burst to set up things and clone the IsaacLab repo inside your greene storage. You must wait until the job in burst is complete before launching your first training. To check the progress of the job, you can run ssh burst "squeue -u $USER", and the job should disappear from there once it's completed. It takes around 30 minutes to complete.
You should see something similar to the screenshot below (captured from Greene):
In this output, the ST (state) column indicates the job status:
PD= pending in the queue (waiting for resources).CF= instance is being configured.R= job is running.
On burst, it is common for an instance to fail to configure; in that case, the provided scripts automatically relaunch the job when this happens, so you usually only need to wait until the job finishes successfully and no longer appears in squeue.
- In this project you'll only have to modify the two files below, which define the Isaac Lab task and its configuration (including PPO hyperparameters).
- source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/rob6323_go2_env.py
- source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/rob6323_go2_env_cfg.py PPO hyperparameters are defined in source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/agents/rsl_rl_ppo_cfg.py, but you shouldn't need to modify them.
- Option A (recommended): Use VS Code Remote SSH from your laptop to edit files on Greene; follow the NYU HPC VS Code guide and connect to a compute node as instructed (VPN required off‑campus) (https://sites.google.com/nyu.edu/nyu-hpc/training-support/general-hpc-topics/vs-code). If you set it correctly, it makes the login process easier, among other things, e.g., cloning a private repo.
- Option B: Edit directly on Greene using a terminal editor such as nano.
nano source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/rob6323_go2_env.py
- Option C: Develop locally on your machine, push to your fork, then pull changes on Greene within your $HOME/rob6323_go2_project clone.
Tip: Don't forget to regularly push your work to github
- From $HOME/rob6323_go2_project on Greene, submit a training job via the provided script.
cd "$HOME/rob6323_go2_project"
./train.sh
- Check job status with SLURM using squeue on the burst head node as shown below.
ssh burst "squeue -u $USER"
Be aware that jobs can be canceled and requeued by the scheduler or underlying provider policies when higher-priority work preempts your resources, which is normal behavior on shared clusters using preemptible partitions.
- When a job completes, logs are written under logs in your project clone on Greene in the structure logs/[job_id]/rsl_rl/go2_flat_direct/[date_time]/.
- Inside each run directory you will find a TensorBoard events file (events.out.tfevents...), neural network checkpoints (model_[epoch].pt), YAML files with the exact PPO and environment parameters, and a rollout video under videos/play/ that showcases the trained policy.
Use rsync to copy results from the cluster to your local machine. It is faster and can resume interrupted transfers. Run this on your machine (NOT on Greene):
rsync -avzP -e 'ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' <netid>@dtn.hpc.nyu.edu:/home/<netid>/rob6323_go2_project/logs ./
Explanation of flags:
-a: Archive mode (preserves permissions, times, and recursive).-v: Verbose output.-z: Compresses data during transfer (faster over network).-P: Shows progress bar and allows resuming partial transfers.
You can inspect training metrics (reward curves, loss values, episode lengths) using TensorBoard. This requires installing it on your local machine.
-
Install TensorBoard: On your local computer (do NOT run this on Greene), install the package:
pip install tensorboard -
Launch the Server: Navigate to the folder where you downloaded your logs and start the server:
# Assuming you are in the directory containing the 'logs' folder tensorboard --logdir ./logs -
View Metrics: Open your browser to the URL shown (usually
http://localhost:6006/).
Burst storage is accessible only from a job running on burst, not from the burst login node. The provided scripts do not automatically synchronize error logs back to your home directory on Greene. However, you will need access to these logs to debug failed jobs. These error logs differ from the logs in the previous section.
The suggested way to inspect these logs is via the Open OnDemand web interface:
- Navigate to https://ood-burst-001.hpc.nyu.edu.
- Select Files > Home Directory from the top menu.
- You will see a list of files, including your
.errlog files. - Click on any
.errfile to view its content directly in the browser.
Important: Do not modify anything inside the
rob6323_go2_projectfolder on burst storage. This directory is managed by the job scripts, and manual changes may cause synchronization issues or job failures.
- The assignment expects you to go beyond velocity tracking by adding principled reward terms (posture stabilization, foot clearance, slip minimization, smooth actions, contact and collision penalties), robustness via domain randomization, and clear benchmarking metrics for evaluation as described in the course guidelines.
- Keep your repository organized, document your changes in the README, and ensure your scripts are reproducible, as these factors are part of grading alongside policy quality and the short demo video deliverable.
- Isaac Lab documentation — Everything you need to know about IsaacLab, and more!
- Isaac Lab ANYmal C environment — This targets ANYmal C (not Unitree Go2), so use it as a reference and adapt robot config, assets, and reward to Go2.
- DMO (IsaacGym) Go2 walking project page • Go2 walking environment used by the authors • Config file used by the authors — Look at the function
compute_reward_CaT(beware that some reward terms have a weight of 0 and thus are deactivated, check weights in the config file); this implementation includes strong reward shaping, domain randomization, and training disturbances for robust sim‑to‑real, but it is written for legacy IsaacGym and the challenge is to re-implement it in Isaac Lab. - API References:
- ArticulationData (
robot.data) — Containsroot_pos_w,joint_pos,projected_gravity_b, etc. - ContactSensorData (
_contact_sensor.data) — Containsnet_forces_w(contact forces).
- ArticulationData (
Students should only edit README.md below this line.
Author: Arshia Sangwan
Course: ROB-6323 Reinforcement Learning and Optimal Control
Institution: NYU Tandon School of Engineering
This project transforms a minimal two-reward baseline into a comprehensive quadruped locomotion system capable of robust velocity tracking, terrain traversal, and dynamic acrobatic maneuvers. Starting from the base repo, I developed two complete training environments:
-
Flat Terrain Locomotion (Main Task): A production-ready walking policy with 26 reward terms, automatic curriculum learning, and extensive domain randomization for sim-to-real transfer.
-
Controlled Backflip (Bonus): A dynamic aerial maneuver environment that teaches the robot to perform a complete backward somersault with soft landing and recovery.
The following sections detail every technical decision, implementation choice, and the reasoning behind each component.
- Project Architecture
- Main Task: Flat Terrain Locomotion
- Bonus Task : Controlled Backflip
- Training and Evaluation
- Results
- References
rob6323_go2_project/
source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/
rob6323_go2_env.py # Main locomotion environment (1380 lines)
rob6323_go2_env_cfg.py # Main configuration (455 lines)
rob6323_go2_backflip_env.py # Backflip environment (1046 lines)
rob6323_go2_backflip_env_cfg.py # Backflip configuration (262 lines)
agents/
rsl_rl_ppo_cfg.py # PPO hyperparameters
train.sh # Flat terrain training
train_backflip.sh # Backflip training
| File | Lines | Description |
|---|---|---|
rob6323_go2_env.py |
1380 | Complete rewrite with 26 rewards, curriculum, randomization |
rob6323_go2_env_cfg.py |
455 | Comprehensive configuration with documented parameters |
rob6323_go2_backflip_env.py |
1046 | New file: phase-based backflip training |
rob6323_go2_backflip_env_cfg.py |
262 | New file: backflip-specific configuration |
The baseline environment provided only two reward terms: linear velocity tracking and yaw rate tracking. While functional, this minimal approach produces policies that lack the robustness, efficiency, and natural motion quality required for real-world deployment.
My approach follows three core principles:
-
Reward Decomposition: Rather than a monolithic reward function, I decompose the objective into 26 specialized terms, each addressing a specific aspect of locomotion quality. This provides clear learning signals and enables fine-grained behavior tuning.
-
Progressive Difficulty: Training begins with simple conditions and gradually introduces complexity through automatic curriculum learning. This prevents the policy from being overwhelmed early in training.
-
Sim-to-Real Awareness: Every design decision considers the eventual transfer to physical hardware. Domain randomization, observation noise, and actuator modeling all serve this goal.
I implemented 26 reward terms organized into six functional categories. Each term was carefully designed based on established literature and tuned through extensive experimentation.
These rewards encourage the policy to follow velocity commands accurately.
| Term | Formula | Scale | Rationale |
|---|---|---|---|
track_lin_vel_xy |
exp(-error^2 / 0.25) | +1.5 | Exponential kernel provides smooth gradients near zero error and naturally bounds the reward. Inspired by ETH Zurich's legged_gym implementation. |
track_ang_vel_z |
exp(-error^2 / 0.25) | +0.8 | Same kernel for yaw rate. Lower scale because turning is secondary to forward motion in most tasks. |
The exponential kernel is superior to quadratic error for several reasons: it saturates at 1.0 for perfect tracking (preventing reward explosion), provides strong gradients near the target, and degrades gracefully for large errors.
These penalties maintain a stable, natural body configuration.
| Term | Formula | Scale | Rationale |
|---|---|---|---|
orient |
gravity_xy^2 | -2.0 | Penalizes roll and pitch deviation using the projected gravity vector in body frame. When the robot is level, gravity_x and gravity_y are zero. |
base_height |
(h - 0.34)^2 | -10.0 | Maintains the Go2's natural standing height of 0.34m. Prevents both crouching (inefficient) and over-extension (unstable). |
lin_vel_z |
vel_z^2 | -2.0 | Penalizes vertical bouncing, which wastes energy and indicates poor gait quality. |
ang_vel_xy |
ang_vel_xy^2 | -0.05 | Dampens rotational oscillations in roll and pitch. Small scale because some rotation is natural during locomotion. |
These penalties reduce actuator wear and produce more natural motion.
| Term | Formula | Scale | Rationale |
|---|---|---|---|
action_rate |
(a_t - a_{t-1})^2 | -0.01 | Penalizes the first derivative of actions (velocity). Reduces high-frequency oscillations. |
action_smoothness |
(a_t - 2a_{t-1} + a_{t-2})^2 | -0.005 | Penalizes the second derivative (acceleration/jerk). Uses finite differences to compute action acceleration. |
The combination of first and second derivative penalties produces remarkably smooth motion while still allowing rapid responses when needed.
These penalties protect the hardware from damage and encourage efficient motion.
| Term | Formula | Scale | Rationale |
|---|---|---|---|
dof_vel |
joint_vel^2 | -5e-4 | Limits joint speeds. The Go2 motors have a 30 rad/s velocity limit. |
dof_acc |
((v_t - v_{t-1})/dt)^2 | -2.5e-7 | Limits joint accelerations. Protects gearboxes from shock loads. |
dof_pos_limits |
max(0, pos - upper) + max(0, lower - pos) | -10.0 | Penalizes approach to mechanical joint limits defined in the URDF. |
torque |
torque^2 | -1e-5 | Encourages energy-efficient, low-torque solutions. |
hip_regularization |
(hip_angle - default)^2 | -0.5 | Keeps hip abduction near the default pose. Inspired by legged_gym. Prevents splayed leg configurations. |
These rewards shape natural gait patterns and foot placement.
| Term | Formula | Scale | Rationale |
|---|---|---|---|
feet_clearance |
swing_mask * max(0, 0.08 - foot_z) | -2.0 | Penalizes low foot height during swing phase. Target clearance is 8cm, essential for obstacle avoidance. |
tracking_contacts |
1 - mean(abs(actual - desired)) | +0.3 | Rewards matching the clock-based gait schedule. Promotes regular trotting rhythm. |
feet_air_time |
(air_time - 0.5) * first_contact | +0.25 | Rewards approximately 0.5 second swing duration. Encourages dynamic gaits over shuffling. |
feet_slip |
contact * max(0, foot_vel_xy - 0.1) | -0.2 | Penalizes foot sliding while in contact with ground. Slip wastes energy and indicates poor placement. |
stumble |
count(thigh_contact) + count(calf_contact) | -2.0 | Penalizes non-foot body parts contacting the ground. These events indicate stumbling. |
feet_contact_force |
sum(max(0, force - 400N)) | -1e-4 | Penalizes excessive impact forces above 400N. Protects feet from damage. |
foot_impact_vel |
first_contact * (vel_z)^2 | -0.3 | Penalizes high downward velocity at touchdown. Encourages soft landings. |
These rewards address overall behavior and safety.
| Term | Formula | Scale | Rationale |
|---|---|---|---|
collision |
(base_force > 5N) | -5.0 | Penalizes body collisions with ground. Binary penalty when the base touches the ground. |
raibert_heuristic |
(foot_pos - target)^2 | -0.5 | Encourages Raibert-style footstep placement: foot lands ahead when moving forward. |
stand_still |
joint_motion when cmd=0 | -0.2 | Penalizes motion when commanded to stand still. Prevents drifting at zero command. |
symmetry |
(FL_h - FR_h)^2 + (RL_h - RR_h)^2 | -0.1 | Penalizes left-right asymmetry. Inspired by DMO. Prevents limping gaits. |
termination |
is_terminated | -200.0 | Large penalty at episode end due to failure. Strongly discourages falling. |
power |
sum(abs(torque * vel)) | -1e-5 | Penalizes mechanical power consumption. Encourages energy efficiency. |
alive |
1.0 | +0.2 | Constant reward each timestep. Encourages survival without dominating other terms. |
Training difficulty increases automatically over 5 million environment steps through four phases:
| Progress | Phase | Features Enabled |
|---|---|---|
| 0-25% | Foundation | No randomization. Robot learns basic walking mechanics. |
| 25-50% | Light Randomization | Friction variation (0.5-1.25), mass variation (15%), initial pose randomization. |
| 50-75% | Medium Randomization | Motor strength variation (10%), PD gain variation (10%), observation noise. |
| 75-100% | Full Randomization | Push disturbances (0.5 m/s every 10 seconds), all previous randomizations. |
The curriculum factor is computed as:
progress = total_steps / curriculum_end_step
curriculum_factor = clamp(progress, 0.0, 1.0)This factor gates all randomization intensities, observation noise magnitudes, and command range scaling.
To bridge the sim-to-real gap, I randomize six categories of parameters:
| Parameter | Range | Purpose |
|---|---|---|
| Ground Friction | 0.5 - 1.25 | Simulates different floor surfaces (tile, carpet, concrete, outdoor) |
| Body Mass | 85% - 115% of nominal | Accounts for payloads and manufacturing variation |
| Motor Strength | 90% - 110% | Simulates actuator degradation and voltage variation |
| PD Gains (Kp, Kd) | 90% - 110% | Accounts for control loop uncertainty |
| Stiction | 0.1 - 0.6 Nm | Models static friction in gearboxes and bearings |
| Viscous Damping | 0.01 - 0.05 Nm*s/rad | Models velocity-dependent friction |
Real motors exhibit internal friction that simulation typically ignores. I implemented a physics-based friction model:
tau_stiction = Fs * tanh(qdot / velocity_threshold) # Static friction
tau_viscous = mu_v * qdot # Viscous damping
tau_output = tau_PD - tau_stiction - tau_viscousThe hyperbolic tangent provides a smooth transition around zero velocity, avoiding numerical issues while capturing the "sticking" behavior of real actuators.
The policy receives a 52-dimensional observation vector:
| Dimensions | Content | Scaling | Noise Level |
|---|---|---|---|
| 0-2 | Body linear velocity | x2.0 | 10% |
| 3-5 | Body angular velocity | x0.25 | 20% |
| 6-8 | Projected gravity | x1.0 | 5% |
| 9-11 | Velocity commands | Variable | 0% |
| 12-23 | Joint positions (offset from default) | x1.0 | 1% |
| 24-35 | Joint velocities | x0.05 | High |
| 36-47 | Previous actions | x1.0 | 0% |
| 48-51 | Gait clock signals | x1.0 | 0% |
I added sinusoidal clock signals that encode the desired gait phase:
# Trot gait: diagonal legs move together
foot_indices = [phase + 0.5, phase, phase, phase + 0.5] # FL, FR, RL, RR
clock_inputs = sin(2 * pi * foot_indices)This provides explicit timing information to the policy, significantly accelerating the learning of regular trotting gaits. The approach is inspired by "Walk These Ways" from CMU.
The baseline used Isaac Lab's built-in position controller. I replaced this with explicit PD torque control:
torques = Kp * (desired_pos - joint_pos) - Kd * joint_vel
torques = torques * motor_strength # Domain randomization
torques = clip(torques, -torque_limit, torque_limit)
robot.set_joint_effort_target(torques)This enables randomization of Kp, Kd, and motor strength, which is essential for producing policies robust to actuator variations.
The backflip represents one of the most challenging dynamic maneuvers for quadruped robots. It requires explosive power generation, mid-air attitude control, and precise landing timing. I designed a complete phase-based training system that decomposes this complex skill into learnable sub-behaviors.
The backflip is divided into five distinct phases, each with specific objectives:
| Phase | Duration | Objective | Key Metrics |
|---|---|---|---|
| CROUCH | 0-10% | Lower center of mass, compress legs | Height 22cm, slight backward lean |
| LAUNCH | 10-23% | Explosive leg extension, initiate rotation | Vertical velocity 3 m/s, pitch rate 12 rad/s |
| FLIGHT | 23-57% | Tuck legs, maintain rotation rate | Peak height 0.9m, angular momentum conservation |
| EXTEND | 57-73% | Extend legs for landing preparation | Reduce rotation rate, orient feet downward |
| LAND | 73-100% | Soft touchdown, recover to standing | Low impact velocity, upright orientation |
PHASE_CROUCH = 0 # Prepare for jump
PHASE_LAUNCH = 1 # Explosive takeoff
PHASE_FLIGHT = 2 # Mid-air rotation
PHASE_EXTEND = 3 # Landing preparation
PHASE_LAND = 4 # Recovery to standingTo detect a complete backflip, I track cumulative pitch rotation:
current_pitch = atan2(-gravity_x, -gravity_z)
delta_pitch = current_pitch - previous_pitch
cumulative_pitch += delta_pitch # Full flip = -2*pi radiansThe expected height follows projectile motion:
h(t) = h0 + v0*t - 0.5*g*t^2
height_reward = exp(-3.0 * |actual - expected|)
The backflip policy receives 56-dimensional observations:
| Dimensions | Content |
|---|---|
| 0-2 | Body linear velocity |
| 3-5 | Body angular velocity |
| 6-8 | Projected gravity (orientation) |
| 9-11 | Phase information (sin, cos, phase_id) |
| 12-23 | Joint position offsets |
| 24-35 | Joint velocities |
| 36-47 | Previous actions |
| 48-51 | Target motion (pitch_vel, height, pitch, time) |
| 52-55 | Feet contact states |
| Phase | Reward | Scale | Description |
|---|---|---|---|
| Crouch | crouch_phase | 1.0 | Proper pre-jump posture |
| Launch | launch_phase | 5.0 | Explosive takeoff with rotation initiation |
| Flight | flight_rotation | 5.0 | Maintain target angular velocity |
| Flight | tuck | 2.0 | Proper tucked leg position |
| Extend | extend_phase | 1.5 | Prepare legs for landing |
| Land | land_soft | 3.0 | Low impact velocity at touchdown |
| Land | land_upright | 4.0 | Land on feet, not back |
| Term | Scale | Description |
|---|---|---|
| angular_momentum | +1.0 | Track target rotation rate |
| height_trajectory | +1.0 | Follow ballistic flight path |
| crash_penalty | -50.0 | Landing on back, head, or side |
| early_contact | -10.0 | Ground contact during flight |
| success_bonus | +100.0 | Complete flip with stable recovery |
Training starts with partial rotations and gradually increases:
rotation_curriculum = 0.5 # Start at 50% rotation
if success_rate > 0.5:
rotation_curriculum += 0.01 # Increase toward 100%./train_backflip.shFlat Terrain (Main Task)
cd $HOME/rob6323_go2_project
./train.shBackflip (Bonus)
./train_backflip.shssh burst "squeue -u $USER"rsync -avzP YOUR_NETID@dtn.hpc.nyu.edu:/home/YOUR_NETID/rob6323_go2_project/logs ./tensorboard --logdir ./logs| Metric | Baseline | My Implementation |
|---|---|---|
| Episode Length | ~100 steps (falling) | ~1000 steps (full episode) |
| Velocity Tracking Error | ~1.0 m/s | Less than 0.3 m/s |
| Foot Slip Events | ~50 per episode | Less than 10 per episode |
| Curriculum Completion | N/A | 100% by 5M steps |
I implemented comprehensive TensorBoard logging for all 26 reward terms, plus:
vel_tracking_error: Command following accuracyorientation_error: Body stability measureslip_count: Total foot sliding eventsenergy_consumption: Mechanical power integralepisode_length: Survival durationcurriculum_factor: Training progress (0.0 to 1.0)
| Iterations | Observed Behavior |
|---|---|
| 0-500 | Learning to crouch and attempt jumps |
| 500-1000 | Initiating rotation, partial flips |
| 1000-1500 | Completing rotation, crash landings |
| 1500-2000 | Soft landings, recovery to standing |
| 2000+ | Consistent, repeatable backflips |
