Skip to content

This Guidance demonstrates how to build an AI-assisted robot training and fleet management system using Amazon Bedrock foundation models and AWS Trainium. It helps organizations overcome the complexity of training robots for precise tasks and managing fleets at scale

License

Notifications You must be signed in to change notification settings

aws-solutions-library-samples/guidance-for-ai-driven-robotic-simulation-and-training-on-aws

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Guidance for AI-Driven Robotic Simulation and Training on AWS

This guidance showcases a robotic learning system (Imitation Learning) that combines the intelligence of foundation models with the precision of ML and mathematical algorithms, all accelerated by AWS Trainium/GPU and managed through modern cloud-native technologies. This guidance also enables developers to train (reinforcement learning) robotic agents using NVIDIA Isaac Sim on Amazon EKS with LLM-generated reward functions via Bedrock, then automatically deploy trained models to physical robots through AWS IoT services.

Table of Contents

Required

  1. Overview
  2. Prerequisites
  3. Deployment Steps
  4. Deployment Validation
  5. Running the Guidance
  6. Next Steps
  7. Cleanup
  8. Notices

Optional

  1. FAQ, known issues, additional considerations, and limitations
  2. Revisions
  3. Authors

Overview

This guidance demonstrates how to build an AI-assisted robotic learning system that combines foundation models from Amazon Bedrock with reinforcement learning capabilities accelerated by AWS Trainium. The system enables robots to learn complex manipulation tasks through imitation learning and reinforcement learning, with automatic deployment to physical robots via AWS IoT services.

Why did you build this Guidance? Traditional robotic training requires extensive manual programming and domain expertise. This guidance solves the challenge of creating adaptive robotic systems that can learn from demonstrations and improve through reinforcement learning, leveraging AWS's AI/ML services for scalable robot training.

What problem does this Guidance solve?

  • Reduces the complexity of training robotic manipulation tasks
  • Enables continuous learning and improvement of robot policies
  • Provides scalable infrastructure for robot training using cloud-native technologies
  • Integrates foundation models for intelligent reward function generation
  • Automates the deployment pipeline from simulation to physical robots

Architecture Flow:

  1. Data Collection: UR5 robot performs T-bar pushing tasks in NVIDIA Isaac Sim
  2. Policy Training: ACT (Action Chunking Transformer) policy learns from demonstration data
  3. Reinforcement Learning: Policy is fine-tuned using reward functions generated by Amazon Bedrock
  4. Infrastructure: AWS Trainium/GPU instances accelerate training, managed through Amazon EKS
  5. Deployment: Trained models are deployed to physical robots via AWS IoT services

Cost

You are responsible for the cost of the AWS services used while running this Guidance. As of December 2024, the cost for running this Guidance with the default settings in the US East (N. Virginia) Region is approximately $801.20 per month for processing 100 training episodes with continuous robot learning.

We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.

Sample Cost Table

The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.

AWS service Dimensions Cost [USD]
Amazon EC2 (g4dn.xlarge) 1 instance running 24/7 for Isaac Sim $ 367.20
Amazon EKS Cluster 1 cluster for container orchestration $ 73.00
Amazon EC2 (trn1.2xlarge) 2 instances for Trainium training (8 hours/day) $ 256.00
Amazon Bedrock (Claude 3) 10,000 requests for reward function generation $ 45.00
Amazon S3 500 GB storage for training data and models $ 11.50
AWS Secrets Manager 1 secret for password management $ 0.40
Amazon VPC NAT Gateway and data transfer $ 45.60
AWS IoT Core 50,000 messages for robot deployment $ 2.50
Total $ 801.20

Prerequisites

Operating System

These deployment instructions are optimized to best work on Ubuntu 24.04 LTS with NVIDIA GPU support. Deployment on other OS may require additional steps.

Required packages:

  • Node.js 18+ and npm (for CDK deployment)
  • AWS CLI v2
  • Docker and Docker Compose
  • Python 3.10+
  • NVIDIA Docker runtime (for GPU support)
  • ROS 2 Jazzy

Installation commands:

# Install Node.js and npm
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs

# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Install Docker
sudo apt-get update
sudo apt-get install -y docker.io docker-compose
sudo usermod -aG docker $USER

# Install NVIDIA Docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Third-party tools

Required third-party tools:

  • NVIDIA Isaac Sim 4.5.0: Physics simulation environment for robot training
  • LeRobot: Robotics learning framework for policy training
  • PyTorch 2.3.1: Deep learning framework with CUDA 12.1 support
  • OpenCV: Computer vision library for image processing
  • ROS 2 Jazzy: Robot Operating System for robot control
  • MoveIt: Motion planning framework for robotic arms

AWS account requirements

Required AWS account setup:

  • Amazon Bedrock access: Enable Claude 3 model access in your AWS account
  • EC2 instance limits: Ensure sufficient quota for g4dn.xlarge and trn1.2xlarge instances
  • EKS service: Enable Amazon EKS service in your target region
  • IAM permissions: Administrator access or specific permissions for:
    • EC2 instance management
    • EKS cluster creation
    • Bedrock model invocation
    • S3 bucket operations
    • Secrets Manager access
    • IoT Core messaging
  • VPC: Default VPC or custom VPC with internet gateway
  • Key Pair: EC2 key pair for SSH access to instances

aws cdk bootstrap (if sample code has aws-cdk)

This Guidance uses AWS CDK for infrastructure deployment. If you are using AWS CDK for the first time, please perform the following bootstrapping:

# Install AWS CDK globally
npm install -g aws-cdk

# Bootstrap your AWS account for CDK
cdk bootstrap aws://ACCOUNT-NUMBER/REGION

# Example:
cdk bootstrap aws://123456789012/us-east-1

Note: Replace ACCOUNT-NUMBER with your AWS account ID and REGION with your target AWS region.

Service limits

Critical service limits that may require increases:

  • EC2 Instance Limits:

    • g4dn.xlarge instances: Default limit may be 0-5 per region
    • trn1.2xlarge instances: Default limit may be 0-2 per region
    • Request limit increase
  • EKS Cluster Limit: Default 100 clusters per region (usually sufficient)

  • Bedrock Model Access:

  • S3 Storage: Default limits are typically sufficient for this guidance

Supported Regions

Recommended regions (all required services available):

  • us-east-1 (N. Virginia) - Recommended for best service availability
  • us-west-2 (Oregon)
  • eu-west-1 (Ireland)

Service availability considerations:

  • AWS Trainium (trn1 instances): Limited to specific regions
  • Amazon Bedrock: Claude 3 models available in select regions
  • NVIDIA Isaac Sim: Requires GPU-enabled instances (g4dn, p3, p4 families)

Note: Verify Trainium instance availability in your target region before deployment.

Deployment Steps

  1. Clone the repository using command:

    # Make sure git-lfs is installed (https://git-lfs.com)
    git lfs install
    git clone https://github.com/aws-solutions-library-samples/guidance-for-ai-driven-robotic-simulation-and-training-on-aws.git
  2. Navigate to the repository folder:

    cd guidance-for-ai-driven-robotic-simulation-and-training-on-aws
  3. Navigate to the CDK deployment directory:

    cd deployment/cdk-nodejs
  4. Install CDK dependencies:

    npm install
  5. Configure AWS credentials:

    aws configure
  6. Bootstrap CDK (if first time using CDK in this account/region):

    cdk bootstrap
  7. Deploy the infrastructure stack:

    Follow steps mentioned in this doc

  8. Capture the deployed resources:

    # Get EC2 instance ID
    aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`InstanceId`].OutputValue' --output text
    
    # Get instance public IP
    aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`InstancePublicIp`].OutputValue' --output text
    
    # Get S3 bucket name
    aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`S3BucketName`].OutputValue' --output text
  9. Connect to the EC2 instance via SSH:

    ssh -i your-key.pem ubuntu@<INSTANCE_PUBLIC_IP>
  10. Wait for Isaac Sim installation to complete (check status):

    tail -f /var/log/user-data.log
    # Wait until you see "Phase 2 completed"

Deployment Validation

Validate successful deployment:

  1. CloudFormation Stack Status:

    aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].StackStatus'

    Expected output: "CREATE_COMPLETE"

  2. EC2 Instance Status:

    aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PublicIpAddress]' --output table

    Expected: Instance should be in running state

  3. Isaac Sim Installation:

    # SSH to instance and check
    ls -la /home/ubuntu/isaacsim/installation_complete

    Expected: File should exist

  4. DCV Server Status:

    # On the EC2 instance
    sudo systemctl status dcvserver
    dcv list-sessions

    Expected: DCV server running with active session

  5. Access DCV Web Interface:

    • Open browser to https://<INSTANCE_PUBLIC_IP>:8443
    • Login with username ubuntu and password from Secrets Manager
  6. S3 Bucket Creation:

    aws s3 ls | grep robotics

    Expected: Bucket with robotics prefix should be listed

Running the Guidance

Step 1: Access the Isaac Sim Environment

  1. Connect to the EC2 instance via DCV at https://<INSTANCE_PUBLIC_IP>:8443
  2. Login with username ubuntu and retrieve password from AWS Secrets Manager:
    aws secretsmanager get-secret-value --secret-id <SECRET_NAME> --query SecretString --output text

Step 2: Start Simulation and Data collection

Follow the commands

Iterative Training Approach:

  1. Initial Training: Run RL fine-tune for 30 minutes
  2. Evaluation: Test with data collection to measure performance improvement
  3. Iteration: If performance is insufficient, continue RL training for another 30 minutes
  4. Repeat: Continue until desired accuracy threshold is achieved

Monitoring Progress:

  • Training metrics are logged to console
  • Model checkpoints saved automatically
  • Success rate and accuracy displayed in real-time
  • Nova Pro provides intelligent observations of robot behavior

Next Steps

Customization and Enhancement Options:

  1. Modify Training Parameters:

    • Adjust learning rate, batch size, and training epochs in RL_Finetune.py
    • Customize reward functions for different manipulation tasks
    • Modify success thresholds and accuracy floors
  2. Extend to Different Robot Tasks:

    • Replace T-bar pushing with other manipulation tasks (pick-and-place, assembly)
    • Modify the Isaac Sim scene files in source/ur5_nova/configuration/
    • Update observation and action spaces for new tasks
  3. Scale Training Infrastructure:

    • Deploy multiple Trainium instances for distributed training
    • Use Amazon SageMaker for managed training workflows
    • Implement model versioning with Amazon SageMaker Model Registry
  4. Integrate Additional Foundation Models:

    • Use different Bedrock models for reward function generation
    • Implement multi-modal learning with vision-language models
    • Add natural language instruction following capabilities
  5. Deploy to Physical Robots:

    • Configure AWS IoT Core for robot fleet management
    • Implement over-the-air model updates
    • Add real-world sensor integration and calibration
  6. Production Optimization:

    • Implement model quantization for edge deployment
    • Add monitoring and alerting with Amazon CloudWatch
    • Set up automated retraining pipelines with Amazon EventBridge

Cleanup

To completely remove all resources created by this Guidance:

  1. Stop running processes on EC2 instance:

    # SSH to the instance and stop any running training
    pkill -f python3
    sudo systemctl stop dcvserver
  2. Empty S3 bucket contents:

    # Get bucket name from CloudFormation output
    BUCKET_NAME=$(aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`S3BucketName`].OutputValue' --output text)
    
    # Empty the bucket
    aws s3 rm s3://$BUCKET_NAME --recursive
  3. Delete CloudFormation stacks:

    # Delete all CDK stacks
    cdk destroy --all
    
    # Confirm deletion when prompted
  4. Verify resource deletion:

    # Check that stacks are deleted
    aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE
    
    # Verify EC2 instances are terminated
    aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name]' --output table
  5. Manual cleanup:

    • EKS clusters: If EKS stack deletion fails, manually delete from console
    • VPC resources: Delete any remaining ENIs or security groups
    • IAM roles: Remove any custom IAM roles if they weren't deleted
    • Secrets Manager: Delete the password secret if it remains

Note: The cleanup process may take 10-15 minutes to complete. Monitor the CloudFormation console to ensure all stacks are successfully deleted.

FAQ, known issues, additional considerations, and limitations (optional)

Known issues

  1. Isaac Sim Installation Timeout:

    • Issue: Isaac Sim download may timeout on slower connections
    • Resolution: SSH to instance and manually restart download:
      cd /home/ubuntu/isaacsim
      wget --continue https://download.isaacsim.omniverse.nvidia.com/isaac-sim-standalone-5.1.0-linux-x86_64.zip
  2. DCV Connection Issues:

    • Issue: Cannot connect to DCV web interface
    • Resolution: Check security group allows port 8443 and restart DCV:
      sudo systemctl restart dcvserver
  3. CUDA Out of Memory:

    • Issue: Training fails with CUDA OOM errors
    • Resolution: Reduce batch size in training scripts or use smaller model
  4. Trainium Instance Unavailability:

    • Issue: trn1 instances not available in region
    • Resolution: Use alternative regions or switch to GPU instances (p3, g4dn)
  5. Bedrock Model Access:

    • Issue: Access denied to Claude 3 models
    • Resolution: Request model access in Bedrock console before deployment

Additional considerations

  • This Guidance creates EC2 instances that are billed per hour while running, including during idle time
  • Trainium instances (trn1.2xlarge) are premium instances with higher hourly costs
  • The guidance creates a VPC with NAT Gateway that incurs hourly charges
  • Isaac Sim requires significant disk space (>50GB) and may increase EBS costs
  • DCV server creates a remote desktop session accessible over the internet - ensure strong passwords
  • Training data is stored in S3 and may accumulate over time, monitor storage costs
  • Bedrock API calls are charged per request - monitor usage during reward function generation
  • The system generates significant network traffic during training which may incur data transfer charges

Security Considerations:

  • DCV web interface is exposed to the internet on port 8443
  • Ensure strong passwords are set via Secrets Manager
  • Consider restricting access to specific IP ranges in security groups
  • Training data may contain sensitive information - ensure proper S3 bucket policies

For any feedback, questions, or suggestions, please use the issues tab under this repo.

Revisions

Document all notable changes to this project.

Consider formatting this section based on Keep a Changelog, and adhering to Semantic Versioning.

Notices

Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.

About

This Guidance demonstrates how to build an AI-assisted robot training and fleet management system using Amazon Bedrock foundation models and AWS Trainium. It helps organizations overcome the complexity of training robots for precise tasks and managing fleets at scale

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •