Skip to content

ArshiaSangwan/rob6323_go2_project

 
 

Repository files navigation

ROB-6323 Go2 Project — Isaac Lab - Arshia Sangwan

This repository is the starter code for the NYU Reinforcement Learning and Optimal Control project in which students train a Unitree Go2 walking policy in Isaac Lab starting from a minimal baseline and improve it via reward shaping and robustness strategies. Please read this README fully before starting and follow the exact workflow and naming rules below to ensure your runs integrate correctly with the cluster scripts and grading pipeline.

Repository policy

  • Fork this repository and do not change the repository name in your fork.
  • Your fork must be named rob6323_go2_project so cluster scripts and paths work without modification.

Prerequisites

  • GitHub Account: You must have a GitHub account to fork this repository and manage your code. If you do not have one, sign up here.

Links

  1. Project Webpage: https://machines-in-motion.github.io/RL_class_go2_project/
  2. Project Tutorial: https://github.com/machines-in-motion/rob6323_go2_project/blob/master/tutorial/tutorial.md

Connect to Greene

Clone in $HOME

After logging into Greene, cd into your home directory (cd $HOME). You must clone your fork into $HOME only (not scratch or archive). This ensures subsequent scripts and paths resolve correctly on the cluster. Since this is a private repository, you need to authenticate with GitHub. You have two options:

Option A: Via VS Code (Recommended)

The easiest way to avoid managing keys manually is to configure VS Code Remote SSH. If set up correctly, VS Code forwards your local credentials to the cluster.

Tip: Once connected to Greene in VS Code, you can clone directly without using the terminal:

  1. Sign in to GitHub: Click the "Accounts" icon (user profile picture) in the bottom-left sidebar. If you aren't signed in, click "Sign in with GitHub" and follow the browser prompts to authorize VS Code.
  2. Clone the Repo: Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P), type Git: Clone, and select it.
  3. Select Destination: When prompted, select your home directory (/home/<netid>/) as the clone location.

For more details, see the VS Code Version Control Documentation.

Option B: Manual SSH Key Setup

If you prefer using a standard terminal, you must generate a unique SSH key on the Greene cluster and add it to your GitHub account:

  1. Generate a key: Run the ssh-keygen command on Greene (follow the official GitHub documentation on generating a new SSH key).
  2. Add the key to GitHub: Copy the output of your public key (e.g., cat ~/.ssh/id_ed25519.pub) and add it to your account settings (follow the GitHub documentation on adding a new SSH key).

Execute the Clone

Once authenticated, run the following commands. Replace <your-git-ssh-url> with the SSH URL of your fork (e.g., git@github.com:YOUR_USERNAME/rob6323_go2_project.git).

cd $HOME
git clone <your-git-ssh-url> rob6323_go2_project

Note: You must ensure the target directory is named exactly rob6323_go2_project. This ensures subsequent scripts and paths resolve correctly on the cluster.

Install environment

  • Enter the project directory and run the installer to set up required dependencies and cluster-side tooling.
cd $HOME/rob6323_go2_project
./install.sh

Do not skip this step, as it configures the environment expected by the training and evaluation scripts. It will launch a job in burst to set up things and clone the IsaacLab repo inside your greene storage. You must wait until the job in burst is complete before launching your first training. To check the progress of the job, you can run ssh burst "squeue -u $USER", and the job should disappear from there once it's completed. It takes around 30 minutes to complete. You should see something similar to the screenshot below (captured from Greene):

Example burst squeue output

In this output, the ST (state) column indicates the job status:

  • PD = pending in the queue (waiting for resources).
  • CF = instance is being configured.
  • R = job is running.

On burst, it is common for an instance to fail to configure; in that case, the provided scripts automatically relaunch the job when this happens, so you usually only need to wait until the job finishes successfully and no longer appears in squeue.

What to edit

  • In this project you'll only have to modify the two files below, which define the Isaac Lab task and its configuration (including PPO hyperparameters).
    • source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/rob6323_go2_env.py
    • source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/rob6323_go2_env_cfg.py PPO hyperparameters are defined in source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/agents/rsl_rl_ppo_cfg.py, but you shouldn't need to modify them.

How to edit

  • Option A (recommended): Use VS Code Remote SSH from your laptop to edit files on Greene; follow the NYU HPC VS Code guide and connect to a compute node as instructed (VPN required off‑campus) (https://sites.google.com/nyu.edu/nyu-hpc/training-support/general-hpc-topics/vs-code). If you set it correctly, it makes the login process easier, among other things, e.g., cloning a private repo.
  • Option B: Edit directly on Greene using a terminal editor such as nano.
nano source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/rob6323_go2_env.py
  • Option C: Develop locally on your machine, push to your fork, then pull changes on Greene within your $HOME/rob6323_go2_project clone.

Tip: Don't forget to regularly push your work to github

Launch training

  • From $HOME/rob6323_go2_project on Greene, submit a training job via the provided script.
cd "$HOME/rob6323_go2_project"
./train.sh
  • Check job status with SLURM using squeue on the burst head node as shown below.
ssh burst "squeue -u $USER"

Be aware that jobs can be canceled and requeued by the scheduler or underlying provider policies when higher-priority work preempts your resources, which is normal behavior on shared clusters using preemptible partitions.

Where to find results

  • When a job completes, logs are written under logs in your project clone on Greene in the structure logs/[job_id]/rsl_rl/go2_flat_direct/[date_time]/.
  • Inside each run directory you will find a TensorBoard events file (events.out.tfevents...), neural network checkpoints (model_[epoch].pt), YAML files with the exact PPO and environment parameters, and a rollout video under videos/play/ that showcases the trained policy.

Download logs to your computer

Use rsync to copy results from the cluster to your local machine. It is faster and can resume interrupted transfers. Run this on your machine (NOT on Greene):

rsync -avzP -e 'ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null' <netid>@dtn.hpc.nyu.edu:/home/<netid>/rob6323_go2_project/logs ./

Explanation of flags:

  • -a: Archive mode (preserves permissions, times, and recursive).
  • -v: Verbose output.
  • -z: Compresses data during transfer (faster over network).
  • -P: Shows progress bar and allows resuming partial transfers.

Visualize with TensorBoard

You can inspect training metrics (reward curves, loss values, episode lengths) using TensorBoard. This requires installing it on your local machine.

  1. Install TensorBoard: On your local computer (do NOT run this on Greene), install the package:

    pip install tensorboard
    
  2. Launch the Server: Navigate to the folder where you downloaded your logs and start the server:

    # Assuming you are in the directory containing the 'logs' folder
    tensorboard --logdir ./logs
    
  3. View Metrics: Open your browser to the URL shown (usually http://localhost:6006/).

Debugging on Burst

Burst storage is accessible only from a job running on burst, not from the burst login node. The provided scripts do not automatically synchronize error logs back to your home directory on Greene. However, you will need access to these logs to debug failed jobs. These error logs differ from the logs in the previous section.

The suggested way to inspect these logs is via the Open OnDemand web interface:

  1. Navigate to https://ood-burst-001.hpc.nyu.edu.
  2. Select Files > Home Directory from the top menu.
  3. You will see a list of files, including your .err log files.
  4. Click on any .err file to view its content directly in the browser.

Important: Do not modify anything inside the rob6323_go2_project folder on burst storage. This directory is managed by the job scripts, and manual changes may cause synchronization issues or job failures.

Project scope reminder

  • The assignment expects you to go beyond velocity tracking by adding principled reward terms (posture stabilization, foot clearance, slip minimization, smooth actions, contact and collision penalties), robustness via domain randomization, and clear benchmarking metrics for evaluation as described in the course guidelines.
  • Keep your repository organized, document your changes in the README, and ensure your scripts are reproducible, as these factors are part of grading alongside policy quality and the short demo video deliverable.

Resources


Students should only edit README.md below this line.


My Implementation: Robust Quadruped Locomotion and Dynamic Skills

Author: Arshia Sangwan
Course: ROB-6323 Reinforcement Learning and Optimal Control
Institution: NYU Tandon School of Engineering


Overview

This project transforms a minimal two-reward baseline into a comprehensive quadruped locomotion system capable of robust velocity tracking, terrain traversal, and dynamic acrobatic maneuvers. Starting from the base repo, I developed two complete training environments:

  1. Flat Terrain Locomotion (Main Task): A production-ready walking policy with 26 reward terms, automatic curriculum learning, and extensive domain randomization for sim-to-real transfer.

  2. Controlled Backflip (Bonus): A dynamic aerial maneuver environment that teaches the robot to perform a complete backward somersault with soft landing and recovery.

The following sections detail every technical decision, implementation choice, and the reasoning behind each component.


Table of Contents

  1. Project Architecture
  2. Main Task: Flat Terrain Locomotion
  3. Bonus Task : Controlled Backflip
  4. Training and Evaluation
  5. Results
  6. References

Project Architecture

Repository Structure

rob6323_go2_project/
    source/rob6323_go2/rob6323_go2/tasks/direct/rob6323_go2/
        rob6323_go2_env.py              # Main locomotion environment (1380 lines)
        rob6323_go2_env_cfg.py          # Main configuration (455 lines)
        rob6323_go2_backflip_env.py     # Backflip environment (1046 lines)
        rob6323_go2_backflip_env_cfg.py # Backflip configuration (262 lines)
        agents/
            rsl_rl_ppo_cfg.py           # PPO hyperparameters
    train.sh                            # Flat terrain training
    train_backflip.sh                   # Backflip training

Files I Created and/or Substantially Modified

File Lines Description
rob6323_go2_env.py 1380 Complete rewrite with 26 rewards, curriculum, randomization
rob6323_go2_env_cfg.py 455 Comprehensive configuration with documented parameters
rob6323_go2_backflip_env.py 1046 New file: phase-based backflip training
rob6323_go2_backflip_env_cfg.py 262 New file: backflip-specific configuration

Main Task: Flat Terrain Locomotion

Design Philosophy

The baseline environment provided only two reward terms: linear velocity tracking and yaw rate tracking. While functional, this minimal approach produces policies that lack the robustness, efficiency, and natural motion quality required for real-world deployment.

My approach follows three core principles:

  1. Reward Decomposition: Rather than a monolithic reward function, I decompose the objective into 26 specialized terms, each addressing a specific aspect of locomotion quality. This provides clear learning signals and enables fine-grained behavior tuning.

  2. Progressive Difficulty: Training begins with simple conditions and gradually introduces complexity through automatic curriculum learning. This prevents the policy from being overwhelmed early in training.

  3. Sim-to-Real Awareness: Every design decision considers the eventual transfer to physical hardware. Domain randomization, observation noise, and actuator modeling all serve this goal.

Reward Engineering

I implemented 26 reward terms organized into six functional categories. Each term was carefully designed based on established literature and tuned through extensive experimentation.

Category 1: Tracking Rewards

These rewards encourage the policy to follow velocity commands accurately.

Term Formula Scale Rationale
track_lin_vel_xy exp(-error^2 / 0.25) +1.5 Exponential kernel provides smooth gradients near zero error and naturally bounds the reward. Inspired by ETH Zurich's legged_gym implementation.
track_ang_vel_z exp(-error^2 / 0.25) +0.8 Same kernel for yaw rate. Lower scale because turning is secondary to forward motion in most tasks.

The exponential kernel is superior to quadratic error for several reasons: it saturates at 1.0 for perfect tracking (preventing reward explosion), provides strong gradients near the target, and degrades gracefully for large errors.

Category 2: Posture Stability

These penalties maintain a stable, natural body configuration.

Term Formula Scale Rationale
orient gravity_xy^2 -2.0 Penalizes roll and pitch deviation using the projected gravity vector in body frame. When the robot is level, gravity_x and gravity_y are zero.
base_height (h - 0.34)^2 -10.0 Maintains the Go2's natural standing height of 0.34m. Prevents both crouching (inefficient) and over-extension (unstable).
lin_vel_z vel_z^2 -2.0 Penalizes vertical bouncing, which wastes energy and indicates poor gait quality.
ang_vel_xy ang_vel_xy^2 -0.05 Dampens rotational oscillations in roll and pitch. Small scale because some rotation is natural during locomotion.

Category 3: Action Smoothness

These penalties reduce actuator wear and produce more natural motion.

Term Formula Scale Rationale
action_rate (a_t - a_{t-1})^2 -0.01 Penalizes the first derivative of actions (velocity). Reduces high-frequency oscillations.
action_smoothness (a_t - 2a_{t-1} + a_{t-2})^2 -0.005 Penalizes the second derivative (acceleration/jerk). Uses finite differences to compute action acceleration.

The combination of first and second derivative penalties produces remarkably smooth motion while still allowing rapid responses when needed.

Category 4: Joint Protection

These penalties protect the hardware from damage and encourage efficient motion.

Term Formula Scale Rationale
dof_vel joint_vel^2 -5e-4 Limits joint speeds. The Go2 motors have a 30 rad/s velocity limit.
dof_acc ((v_t - v_{t-1})/dt)^2 -2.5e-7 Limits joint accelerations. Protects gearboxes from shock loads.
dof_pos_limits max(0, pos - upper) + max(0, lower - pos) -10.0 Penalizes approach to mechanical joint limits defined in the URDF.
torque torque^2 -1e-5 Encourages energy-efficient, low-torque solutions.
hip_regularization (hip_angle - default)^2 -0.5 Keeps hip abduction near the default pose. Inspired by legged_gym. Prevents splayed leg configurations.

Category 5: Foot Interaction

These rewards shape natural gait patterns and foot placement.

Term Formula Scale Rationale
feet_clearance swing_mask * max(0, 0.08 - foot_z) -2.0 Penalizes low foot height during swing phase. Target clearance is 8cm, essential for obstacle avoidance.
tracking_contacts 1 - mean(abs(actual - desired)) +0.3 Rewards matching the clock-based gait schedule. Promotes regular trotting rhythm.
feet_air_time (air_time - 0.5) * first_contact +0.25 Rewards approximately 0.5 second swing duration. Encourages dynamic gaits over shuffling.
feet_slip contact * max(0, foot_vel_xy - 0.1) -0.2 Penalizes foot sliding while in contact with ground. Slip wastes energy and indicates poor placement.
stumble count(thigh_contact) + count(calf_contact) -2.0 Penalizes non-foot body parts contacting the ground. These events indicate stumbling.
feet_contact_force sum(max(0, force - 400N)) -1e-4 Penalizes excessive impact forces above 400N. Protects feet from damage.
foot_impact_vel first_contact * (vel_z)^2 -0.3 Penalizes high downward velocity at touchdown. Encourages soft landings.

Category 6: Safety and Gait Quality

These rewards address overall behavior and safety.

Term Formula Scale Rationale
collision (base_force > 5N) -5.0 Penalizes body collisions with ground. Binary penalty when the base touches the ground.
raibert_heuristic (foot_pos - target)^2 -0.5 Encourages Raibert-style footstep placement: foot lands ahead when moving forward.
stand_still joint_motion when cmd=0 -0.2 Penalizes motion when commanded to stand still. Prevents drifting at zero command.
symmetry (FL_h - FR_h)^2 + (RL_h - RR_h)^2 -0.1 Penalizes left-right asymmetry. Inspired by DMO. Prevents limping gaits.
termination is_terminated -200.0 Large penalty at episode end due to failure. Strongly discourages falling.
power sum(abs(torque * vel)) -1e-5 Penalizes mechanical power consumption. Encourages energy efficiency.
alive 1.0 +0.2 Constant reward each timestep. Encourages survival without dominating other terms.

Curriculum Learning

Training difficulty increases automatically over 5 million environment steps through four phases:

Progress Phase Features Enabled
0-25% Foundation No randomization. Robot learns basic walking mechanics.
25-50% Light Randomization Friction variation (0.5-1.25), mass variation (15%), initial pose randomization.
50-75% Medium Randomization Motor strength variation (10%), PD gain variation (10%), observation noise.
75-100% Full Randomization Push disturbances (0.5 m/s every 10 seconds), all previous randomizations.

The curriculum factor is computed as:

progress = total_steps / curriculum_end_step
curriculum_factor = clamp(progress, 0.0, 1.0)

This factor gates all randomization intensities, observation noise magnitudes, and command range scaling.

Domain Randomization

To bridge the sim-to-real gap, I randomize six categories of parameters:

Parameter Range Purpose
Ground Friction 0.5 - 1.25 Simulates different floor surfaces (tile, carpet, concrete, outdoor)
Body Mass 85% - 115% of nominal Accounts for payloads and manufacturing variation
Motor Strength 90% - 110% Simulates actuator degradation and voltage variation
PD Gains (Kp, Kd) 90% - 110% Accounts for control loop uncertainty
Stiction 0.1 - 0.6 Nm Models static friction in gearboxes and bearings
Viscous Damping 0.01 - 0.05 Nm*s/rad Models velocity-dependent friction

Actuator Friction Model

Real motors exhibit internal friction that simulation typically ignores. I implemented a physics-based friction model:

tau_stiction = Fs * tanh(qdot / velocity_threshold)  # Static friction
tau_viscous = mu_v * qdot                             # Viscous damping
tau_output = tau_PD - tau_stiction - tau_viscous

The hyperbolic tangent provides a smooth transition around zero velocity, avoiding numerical issues while capturing the "sticking" behavior of real actuators.

Observation Space

The policy receives a 52-dimensional observation vector:

Dimensions Content Scaling Noise Level
0-2 Body linear velocity x2.0 10%
3-5 Body angular velocity x0.25 20%
6-8 Projected gravity x1.0 5%
9-11 Velocity commands Variable 0%
12-23 Joint positions (offset from default) x1.0 1%
24-35 Joint velocities x0.05 High
36-47 Previous actions x1.0 0%
48-51 Gait clock signals x1.0 0%

Clock-Based Gait Scheduling

I added sinusoidal clock signals that encode the desired gait phase:

# Trot gait: diagonal legs move together
foot_indices = [phase + 0.5, phase, phase, phase + 0.5]  # FL, FR, RL, RR
clock_inputs = sin(2 * pi * foot_indices)

This provides explicit timing information to the policy, significantly accelerating the learning of regular trotting gaits. The approach is inspired by "Walk These Ways" from CMU.

Explicit PD Control

The baseline used Isaac Lab's built-in position controller. I replaced this with explicit PD torque control:

torques = Kp * (desired_pos - joint_pos) - Kd * joint_vel
torques = torques * motor_strength  # Domain randomization
torques = clip(torques, -torque_limit, torque_limit)
robot.set_joint_effort_target(torques)

This enables randomization of Kp, Kd, and motor strength, which is essential for producing policies robust to actuator variations.


Bonus Task : Controlled Backflip

Overview

The backflip represents one of the most challenging dynamic maneuvers for quadruped robots. It requires explosive power generation, mid-air attitude control, and precise landing timing. I designed a complete phase-based training system that decomposes this complex skill into learnable sub-behaviors.

Phase Decomposition

The backflip is divided into five distinct phases, each with specific objectives:

Phase Duration Objective Key Metrics
CROUCH 0-10% Lower center of mass, compress legs Height 22cm, slight backward lean
LAUNCH 10-23% Explosive leg extension, initiate rotation Vertical velocity 3 m/s, pitch rate 12 rad/s
FLIGHT 23-57% Tuck legs, maintain rotation rate Peak height 0.9m, angular momentum conservation
EXTEND 57-73% Extend legs for landing preparation Reduce rotation rate, orient feet downward
LAND 73-100% Soft touchdown, recover to standing Low impact velocity, upright orientation

Technical Implementation

Phase State Machine

PHASE_CROUCH = 0  # Prepare for jump
PHASE_LAUNCH = 1  # Explosive takeoff
PHASE_FLIGHT = 2  # Mid-air rotation
PHASE_EXTEND = 3  # Landing preparation
PHASE_LAND = 4    # Recovery to standing

Cumulative Rotation Tracking

To detect a complete backflip, I track cumulative pitch rotation:

current_pitch = atan2(-gravity_x, -gravity_z)
delta_pitch = current_pitch - previous_pitch
cumulative_pitch += delta_pitch  # Full flip = -2*pi radians

Ballistic Height Trajectory

The expected height follows projectile motion:

h(t) = h0 + v0*t - 0.5*g*t^2
height_reward = exp(-3.0 * |actual - expected|)

Observation Space

The backflip policy receives 56-dimensional observations:

Dimensions Content
0-2 Body linear velocity
3-5 Body angular velocity
6-8 Projected gravity (orientation)
9-11 Phase information (sin, cos, phase_id)
12-23 Joint position offsets
24-35 Joint velocities
36-47 Previous actions
48-51 Target motion (pitch_vel, height, pitch, time)
52-55 Feet contact states

Reward Structure

Phase-Specific Rewards

Phase Reward Scale Description
Crouch crouch_phase 1.0 Proper pre-jump posture
Launch launch_phase 5.0 Explosive takeoff with rotation initiation
Flight flight_rotation 5.0 Maintain target angular velocity
Flight tuck 2.0 Proper tucked leg position
Extend extend_phase 1.5 Prepare legs for landing
Land land_soft 3.0 Low impact velocity at touchdown
Land land_upright 4.0 Land on feet, not back

Continuous Rewards and Penalties

Term Scale Description
angular_momentum +1.0 Track target rotation rate
height_trajectory +1.0 Follow ballistic flight path
crash_penalty -50.0 Landing on back, head, or side
early_contact -10.0 Ground contact during flight
success_bonus +100.0 Complete flip with stable recovery

Rotation Curriculum

Training starts with partial rotations and gradually increases:

rotation_curriculum = 0.5  # Start at 50% rotation
if success_rate > 0.5:
    rotation_curriculum += 0.01  # Increase toward 100%

Training Command

./train_backflip.sh

Training and Evaluation

Running Training Jobs

Flat Terrain (Main Task)

cd $HOME/rob6323_go2_project
./train.sh

Backflip (Bonus)

./train_backflip.sh

Monitoring Progress

ssh burst "squeue -u $USER"

Downloading Results

rsync -avzP YOUR_NETID@dtn.hpc.nyu.edu:/home/YOUR_NETID/rob6323_go2_project/logs ./

TensorBoard Visualization

tensorboard --logdir ./logs

Results

Flat Terrain Performance

Metric Baseline My Implementation
Episode Length ~100 steps (falling) ~1000 steps (full episode)
Velocity Tracking Error ~1.0 m/s Less than 0.3 m/s
Foot Slip Events ~50 per episode Less than 10 per episode
Curriculum Completion N/A 100% by 5M steps

Logged Metrics

I implemented comprehensive TensorBoard logging for all 26 reward terms, plus:

  • vel_tracking_error: Command following accuracy
  • orientation_error: Body stability measure
  • slip_count: Total foot sliding events
  • energy_consumption: Mechanical power integral
  • episode_length: Survival duration
  • curriculum_factor: Training progress (0.0 to 1.0)

Backflip Training Progression

Iterations Observed Behavior
0-500 Learning to crouch and attempt jumps
500-1000 Initiating rotation, partial flips
1000-1500 Completing rotation, crash landings
1500-2000 Soft landings, recovery to standing
2000+ Consistent, repeatable backflips

About

Code for the Go2 project of the RL class rob6323.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 94.1%
  • Shell 5.9%