Guidance for AI-Driven Robotic Simulation and Training on AWS

This guidance showcases a robotic learning system (Imitation Learning) that combines the intelligence of foundation models with the precision of ML and mathematical algorithms, all accelerated by AWS Trainium/GPU and managed through modern cloud-native technologies. This guidance also enables developers to train (reinforcement learning) robotic agents using NVIDIA Isaac Sim on Amazon EKS with LLM-generated reward functions via Bedrock, then automatically deploy trained models to physical robots through AWS IoT services.

Overview

This guidance demonstrates how to build an AI-assisted robotic learning system that combines foundation models from Amazon Bedrock with reinforcement learning capabilities accelerated by AWS Trainium. The system enables robots to learn complex manipulation tasks through imitation learning and reinforcement learning, with automatic deployment to physical robots via AWS IoT services.

Why did you build this Guidance? Traditional robotic training requires extensive manual programming and domain expertise. This guidance solves the challenge of creating adaptive robotic systems that can learn from demonstrations and improve through reinforcement learning, leveraging AWS's AI/ML services for scalable robot training.

What problem does this Guidance solve?

Reduces the complexity of training robotic manipulation tasks
Enables continuous learning and improvement of robot policies
Provides scalable infrastructure for robot training using cloud-native technologies
Integrates foundation models for intelligent reward function generation
Automates the deployment pipeline from simulation to physical robots

Architecture Flow:

Data Collection: UR5 robot performs T-bar pushing tasks in NVIDIA Isaac Sim
Policy Training: ACT (Action Chunking Transformer) policy learns from demonstration data
Reinforcement Learning: Policy is fine-tuned using reward functions generated by Amazon Bedrock
Infrastructure: AWS Trainium/GPU instances accelerate training, managed through Amazon EKS
Deployment: Trained models are deployed to physical robots via AWS IoT services

Cost

You are responsible for the cost of the AWS services used while running this Guidance. As of December 2024, the cost for running this Guidance with the default settings in the US East (N. Virginia) Region is approximately $801.20 per month for processing 100 training episodes with continuous robot learning.

We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.

Sample Cost Table

The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.

AWS service	Dimensions	Cost [USD]
Amazon EC2 (g4dn.xlarge)	1 instance running 24/7 for Isaac Sim	$ 367.20
Amazon EKS Cluster	1 cluster for container orchestration	$ 73.00
Amazon EC2 (trn1.2xlarge)	2 instances for Trainium training (8 hours/day)	$ 256.00
Amazon Bedrock (Claude 3)	10,000 requests for reward function generation	$ 45.00
Amazon S3	500 GB storage for training data and models	$ 11.50
AWS Secrets Manager	1 secret for password management	$ 0.40
Amazon VPC	NAT Gateway and data transfer	$ 45.60
AWS IoT Core	50,000 messages for robot deployment	$ 2.50
Total		$ 801.20

Prerequisites

Operating System

These deployment instructions are optimized to best work on Ubuntu 24.04 LTS with NVIDIA GPU support. Deployment on other OS may require additional steps.

Required packages:

Node.js 18+ and npm (for CDK deployment)
AWS CLI v2
Docker and Docker Compose
Python 3.10+
NVIDIA Docker runtime (for GPU support)
ROS 2 Jazzy

Installation commands:

# Install Node.js and npm
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs

# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Install Docker
sudo apt-get update
sudo apt-get install -y docker.io docker-compose
sudo usermod -aG docker $USER

# Install NVIDIA Docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Third-party tools

Required third-party tools:

NVIDIA Isaac Sim 4.5.0: Physics simulation environment for robot training
LeRobot: Robotics learning framework for policy training
PyTorch 2.3.1: Deep learning framework with CUDA 12.1 support
OpenCV: Computer vision library for image processing
ROS 2 Jazzy: Robot Operating System for robot control
MoveIt: Motion planning framework for robotic arms

AWS account requirements

Required AWS account setup:

Amazon Bedrock access: Enable Claude 3 model access in your AWS account
EC2 instance limits: Ensure sufficient quota for g4dn.xlarge and trn1.2xlarge instances
EKS service: Enable Amazon EKS service in your target region
IAM permissions: Administrator access or specific permissions for:
- EC2 instance management
- EKS cluster creation
- Bedrock model invocation
- S3 bucket operations
- Secrets Manager access
- IoT Core messaging
VPC: Default VPC or custom VPC with internet gateway
Key Pair: EC2 key pair for SSH access to instances

aws cdk bootstrap (if sample code has aws-cdk)

This Guidance uses AWS CDK for infrastructure deployment. If you are using AWS CDK for the first time, please perform the following bootstrapping:

# Install AWS CDK globally
npm install -g aws-cdk

# Bootstrap your AWS account for CDK
cdk bootstrap aws://ACCOUNT-NUMBER/REGION

# Example:
cdk bootstrap aws://123456789012/us-east-1

Note: Replace ACCOUNT-NUMBER with your AWS account ID and REGION with your target AWS region.

Service limits

Critical service limits that may require increases:

EC2 Instance Limits:
- g4dn.xlarge instances: Default limit may be 0-5 per region
- trn1.2xlarge instances: Default limit may be 0-2 per region
- Request limit increase
EKS Cluster Limit: Default 100 clusters per region (usually sufficient)
Bedrock Model Access:
- Claude 3 models require explicit access request
- Request model access
S3 Storage: Default limits are typically sufficient for this guidance

Supported Regions

Recommended regions (all required services available):

us-east-1 (N. Virginia) - Recommended for best service availability
us-west-2 (Oregon)
eu-west-1 (Ireland)

Service availability considerations:

AWS Trainium (trn1 instances): Limited to specific regions
Amazon Bedrock: Claude 3 models available in select regions
NVIDIA Isaac Sim: Requires GPU-enabled instances (g4dn, p3, p4 families)

Note: Verify Trainium instance availability in your target region before deployment.

Deployment Steps

Clone the repository using command:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://github.com/aws-solutions-library-samples/guidance-for-ai-driven-robotic-simulation-and-training-on-aws.git

Navigate to the repository folder:

cd guidance-for-ai-driven-robotic-simulation-and-training-on-aws

Navigate to the CDK deployment directory:
```
cd deployment/cdk-nodejs
```
Install CDK dependencies:
```
npm install
```
Configure AWS credentials:
```
aws configure
```
Bootstrap CDK (if first time using CDK in this account/region):
```
cdk bootstrap
```
Deploy the infrastructure stack:

Follow steps mentioned in this doc

Capture the deployed resources:

# Get EC2 instance ID
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`InstanceId`].OutputValue' --output text

# Get instance public IP
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`InstancePublicIp`].OutputValue' --output text

# Get S3 bucket name
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`S3BucketName`].OutputValue' --output text

Connect to the EC2 instance via SSH:

ssh -i your-key.pem ubuntu@<INSTANCE_PUBLIC_IP>

Wait for Isaac Sim installation to complete (check status):

tail -f /var/log/user-data.log
# Wait until you see "Phase 2 completed"

Deployment Validation

Validate successful deployment:

CloudFormation Stack Status:

aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].StackStatus'

Expected output: "CREATE_COMPLETE"

EC2 Instance Status:

aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PublicIpAddress]' --output table

Expected: Instance should be in running state

Isaac Sim Installation:

# SSH to instance and check
ls -la /home/ubuntu/isaacsim/installation_complete

Expected: File should exist

DCV Server Status:

# On the EC2 instance
sudo systemctl status dcvserver
dcv list-sessions

Expected: DCV server running with active session

Access DCV Web Interface:
- Open browser to https://<INSTANCE_PUBLIC_IP>:8443
- Login with username ubuntu and password from Secrets Manager
S3 Bucket Creation:
```
aws s3 ls | grep robotics
```
Expected: Bucket with robotics prefix should be listed

Running the Guidance

Step 1: Access the Isaac Sim Environment

Connect to the EC2 instance via DCV at https://<INSTANCE_PUBLIC_IP>:8443

Login with username ubuntu and retrieve password from AWS Secrets Manager:

aws secretsmanager get-secret-value --secret-id <SECRET_NAME> --query SecretString --output text

Step 2: Start Simulation and Data collection

Follow the commands

Iterative Training Approach:

Initial Training: Run RL fine-tune for 30 minutes
Evaluation: Test with data collection to measure performance improvement
Iteration: If performance is insufficient, continue RL training for another 30 minutes
Repeat: Continue until desired accuracy threshold is achieved

Monitoring Progress:

Training metrics are logged to console
Model checkpoints saved automatically
Success rate and accuracy displayed in real-time
Nova Pro provides intelligent observations of robot behavior

Next Steps

Customization and Enhancement Options:

Modify Training Parameters:
- Adjust learning rate, batch size, and training epochs in RL_Finetune.py
- Customize reward functions for different manipulation tasks
- Modify success thresholds and accuracy floors
Extend to Different Robot Tasks:
- Replace T-bar pushing with other manipulation tasks (pick-and-place, assembly)
- Modify the Isaac Sim scene files in source/ur5_nova/configuration/
- Update observation and action spaces for new tasks
Scale Training Infrastructure:
- Deploy multiple Trainium instances for distributed training
- Use Amazon SageMaker for managed training workflows
- Implement model versioning with Amazon SageMaker Model Registry
Integrate Additional Foundation Models:
- Use different Bedrock models for reward function generation
- Implement multi-modal learning with vision-language models
- Add natural language instruction following capabilities
Deploy to Physical Robots:
- Configure AWS IoT Core for robot fleet management
- Implement over-the-air model updates
- Add real-world sensor integration and calibration
Production Optimization:
- Implement model quantization for edge deployment
- Add monitoring and alerting with Amazon CloudWatch
- Set up automated retraining pipelines with Amazon EventBridge

Cleanup

To completely remove all resources created by this Guidance:

Stop running processes on EC2 instance:

# SSH to the instance and stop any running training
pkill -f python3
sudo systemctl stop dcvserver

Empty S3 bucket contents:

# Get bucket name from CloudFormation output
BUCKET_NAME=$(aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`S3BucketName`].OutputValue' --output text)

# Empty the bucket
aws s3 rm s3://$BUCKET_NAME --recursive

Delete CloudFormation stacks:

# Delete all CDK stacks
cdk destroy --all

# Confirm deletion when prompted

Verify resource deletion:

# Check that stacks are deleted
aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE

# Verify EC2 instances are terminated
aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name]' --output table

Manual cleanup:
- EKS clusters: If EKS stack deletion fails, manually delete from console
- VPC resources: Delete any remaining ENIs or security groups
- IAM roles: Remove any custom IAM roles if they weren't deleted
- Secrets Manager: Delete the password secret if it remains

Note: The cleanup process may take 10-15 minutes to complete. Monitor the CloudFormation console to ensure all stacks are successfully deleted.

FAQ, known issues, additional considerations, and limitations (optional)

Known issues

Isaac Sim Installation Timeout:
- Issue: Isaac Sim download may timeout on slower connections
- Resolution: SSH to instance and manually restart download:
```
cd /home/ubuntu/isaacsim
wget --continue https://download.isaacsim.omniverse.nvidia.com/isaac-sim-standalone-5.1.0-linux-x86_64.zip
```
DCV Connection Issues:
- Issue: Cannot connect to DCV web interface
- Resolution: Check security group allows port 8443 and restart DCV:
```
sudo systemctl restart dcvserver
```
CUDA Out of Memory:
- Issue: Training fails with CUDA OOM errors
- Resolution: Reduce batch size in training scripts or use smaller model
Trainium Instance Unavailability:
- Issue: trn1 instances not available in region
- Resolution: Use alternative regions or switch to GPU instances (p3, g4dn)
Bedrock Model Access:
- Issue: Access denied to Claude 3 models
- Resolution: Request model access in Bedrock console before deployment

Additional considerations

This Guidance creates EC2 instances that are billed per hour while running, including during idle time
Trainium instances (trn1.2xlarge) are premium instances with higher hourly costs
The guidance creates a VPC with NAT Gateway that incurs hourly charges
Isaac Sim requires significant disk space (>50GB) and may increase EBS costs
DCV server creates a remote desktop session accessible over the internet - ensure strong passwords
Training data is stored in S3 and may accumulate over time, monitor storage costs
Bedrock API calls are charged per request - monitor usage during reward function generation
The system generates significant network traffic during training which may incur data transfer charges

Security Considerations:

DCV web interface is exposed to the internet on port 8443
Ensure strong passwords are set via Secrets Manager
Consider restricting access to specific IP ranges in security groups
Training data may contain sensitive information - ensure proper S3 bucket policies

For any feedback, questions, or suggestions, please use the issues tab under this repo.

Revisions

Document all notable changes to this project.

Consider formatting this section based on Keep a Changelog, and adhering to Semantic Versioning.

Notices

Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
deployment		deployment
source		source
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Guidance for AI-Driven Robotic Simulation and Training on AWS

Table of Contents

Required

Overview

Cost

Sample Cost Table

Prerequisites

Operating System

Third-party tools

AWS account requirements

aws cdk bootstrap (if sample code has aws-cdk)

Service limits

Supported Regions

Deployment Steps

Deployment Validation

Running the Guidance

Next Steps

Cleanup

FAQ, known issues, additional considerations, and limitations (optional)

Revisions

Notices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

aws-solutions-library-samples/guidance-for-ai-driven-robotic-simulation-and-training-on-aws

Folders and files

Latest commit

History

Repository files navigation

Guidance for AI-Driven Robotic Simulation and Training on AWS

Table of Contents

Required

Overview

Cost

Sample Cost Table

Prerequisites

Operating System

Third-party tools

AWS account requirements

aws cdk bootstrap (if sample code has aws-cdk)

Service limits

Supported Regions

Deployment Steps

Deployment Validation

Running the Guidance

Next Steps

Cleanup

FAQ, known issues, additional considerations, and limitations (optional)

Revisions

Notices

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages