This guidance showcases a robotic learning system (Imitation Learning) that combines the intelligence of foundation models with the precision of ML and mathematical algorithms, all accelerated by AWS Trainium/GPU and managed through modern cloud-native technologies. This guidance also enables developers to train (reinforcement learning) robotic agents using NVIDIA Isaac Sim on Amazon EKS with LLM-generated reward functions via Bedrock, then automatically deploy trained models to physical robots through AWS IoT services.
- Overview
- Prerequisites
- Deployment Steps
- Deployment Validation
- Running the Guidance
- Next Steps
- Cleanup
- Notices
Optional
This guidance demonstrates how to build an AI-assisted robotic learning system that combines foundation models from Amazon Bedrock with reinforcement learning capabilities accelerated by AWS Trainium. The system enables robots to learn complex manipulation tasks through imitation learning and reinforcement learning, with automatic deployment to physical robots via AWS IoT services.
Why did you build this Guidance? Traditional robotic training requires extensive manual programming and domain expertise. This guidance solves the challenge of creating adaptive robotic systems that can learn from demonstrations and improve through reinforcement learning, leveraging AWS's AI/ML services for scalable robot training.
What problem does this Guidance solve?
- Reduces the complexity of training robotic manipulation tasks
- Enables continuous learning and improvement of robot policies
- Provides scalable infrastructure for robot training using cloud-native technologies
- Integrates foundation models for intelligent reward function generation
- Automates the deployment pipeline from simulation to physical robots
Architecture Flow:
- Data Collection: UR5 robot performs T-bar pushing tasks in NVIDIA Isaac Sim
- Policy Training: ACT (Action Chunking Transformer) policy learns from demonstration data
- Reinforcement Learning: Policy is fine-tuned using reward functions generated by Amazon Bedrock
- Infrastructure: AWS Trainium/GPU instances accelerate training, managed through Amazon EKS
- Deployment: Trained models are deployed to physical robots via AWS IoT services
You are responsible for the cost of the AWS services used while running this Guidance. As of December 2024, the cost for running this Guidance with the default settings in the US East (N. Virginia) Region is approximately $801.20 per month for processing 100 training episodes with continuous robot learning.
We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.
The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.
| AWS service | Dimensions | Cost [USD] |
|---|---|---|
| Amazon EC2 (g4dn.xlarge) | 1 instance running 24/7 for Isaac Sim | $ 367.20 |
| Amazon EKS Cluster | 1 cluster for container orchestration | $ 73.00 |
| Amazon EC2 (trn1.2xlarge) | 2 instances for Trainium training (8 hours/day) | $ 256.00 |
| Amazon Bedrock (Claude 3) | 10,000 requests for reward function generation | $ 45.00 |
| Amazon S3 | 500 GB storage for training data and models | $ 11.50 |
| AWS Secrets Manager | 1 secret for password management | $ 0.40 |
| Amazon VPC | NAT Gateway and data transfer | $ 45.60 |
| AWS IoT Core | 50,000 messages for robot deployment | $ 2.50 |
| Total | $ 801.20 |
These deployment instructions are optimized to best work on Ubuntu 24.04 LTS with NVIDIA GPU support. Deployment on other OS may require additional steps.
Required packages:
- Node.js 18+ and npm (for CDK deployment)
- AWS CLI v2
- Docker and Docker Compose
- Python 3.10+
- NVIDIA Docker runtime (for GPU support)
- ROS 2 Jazzy
Installation commands:
# Install Node.js and npm
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Install Docker
sudo apt-get update
sudo apt-get install -y docker.io docker-compose
sudo usermod -aG docker $USER
# Install NVIDIA Docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart dockerRequired third-party tools:
- NVIDIA Isaac Sim 4.5.0: Physics simulation environment for robot training
- LeRobot: Robotics learning framework for policy training
- PyTorch 2.3.1: Deep learning framework with CUDA 12.1 support
- OpenCV: Computer vision library for image processing
- ROS 2 Jazzy: Robot Operating System for robot control
- MoveIt: Motion planning framework for robotic arms
Required AWS account setup:
- Amazon Bedrock access: Enable Claude 3 model access in your AWS account
- EC2 instance limits: Ensure sufficient quota for g4dn.xlarge and trn1.2xlarge instances
- EKS service: Enable Amazon EKS service in your target region
- IAM permissions: Administrator access or specific permissions for:
- EC2 instance management
- EKS cluster creation
- Bedrock model invocation
- S3 bucket operations
- Secrets Manager access
- IoT Core messaging
- VPC: Default VPC or custom VPC with internet gateway
- Key Pair: EC2 key pair for SSH access to instances
This Guidance uses AWS CDK for infrastructure deployment. If you are using AWS CDK for the first time, please perform the following bootstrapping:
# Install AWS CDK globally
npm install -g aws-cdk
# Bootstrap your AWS account for CDK
cdk bootstrap aws://ACCOUNT-NUMBER/REGION
# Example:
cdk bootstrap aws://123456789012/us-east-1Note: Replace ACCOUNT-NUMBER with your AWS account ID and REGION with your target AWS region.
Critical service limits that may require increases:
-
EC2 Instance Limits:
- g4dn.xlarge instances: Default limit may be 0-5 per region
- trn1.2xlarge instances: Default limit may be 0-2 per region
- Request limit increase
-
EKS Cluster Limit: Default 100 clusters per region (usually sufficient)
-
Bedrock Model Access:
- Claude 3 models require explicit access request
- Request model access
-
S3 Storage: Default limits are typically sufficient for this guidance
Recommended regions (all required services available):
- us-east-1 (N. Virginia) - Recommended for best service availability
- us-west-2 (Oregon)
- eu-west-1 (Ireland)
Service availability considerations:
- AWS Trainium (trn1 instances): Limited to specific regions
- Amazon Bedrock: Claude 3 models available in select regions
- NVIDIA Isaac Sim: Requires GPU-enabled instances (g4dn, p3, p4 families)
Note: Verify Trainium instance availability in your target region before deployment.
-
Clone the repository using command:
# Make sure git-lfs is installed (https://git-lfs.com) git lfs install git clone https://github.com/aws-solutions-library-samples/guidance-for-ai-driven-robotic-simulation-and-training-on-aws.git -
Navigate to the repository folder:
cd guidance-for-ai-driven-robotic-simulation-and-training-on-aws -
Navigate to the CDK deployment directory:
cd deployment/cdk-nodejs -
Install CDK dependencies:
npm install
-
Configure AWS credentials:
aws configure
-
Bootstrap CDK (if first time using CDK in this account/region):
cdk bootstrap
-
Deploy the infrastructure stack:
-
Capture the deployed resources:
# Get EC2 instance ID aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`InstanceId`].OutputValue' --output text # Get instance public IP aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`InstancePublicIp`].OutputValue' --output text # Get S3 bucket name aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`S3BucketName`].OutputValue' --output text
-
Connect to the EC2 instance via SSH:
ssh -i your-key.pem ubuntu@<INSTANCE_PUBLIC_IP>
-
Wait for Isaac Sim installation to complete (check status):
tail -f /var/log/user-data.log # Wait until you see "Phase 2 completed"
Validate successful deployment:
-
CloudFormation Stack Status:
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].StackStatus'Expected output:
"CREATE_COMPLETE" -
EC2 Instance Status:
aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PublicIpAddress]' --output table
Expected: Instance should be in
runningstate -
Isaac Sim Installation:
# SSH to instance and check ls -la /home/ubuntu/isaacsim/installation_completeExpected: File should exist
-
DCV Server Status:
# On the EC2 instance sudo systemctl status dcvserver dcv list-sessionsExpected: DCV server running with active session
-
Access DCV Web Interface:
- Open browser to
https://<INSTANCE_PUBLIC_IP>:8443 - Login with username
ubuntuand password from Secrets Manager
- Open browser to
-
S3 Bucket Creation:
aws s3 ls | grep roboticsExpected: Bucket with robotics prefix should be listed
Step 1: Access the Isaac Sim Environment
- Connect to the EC2 instance via DCV at
https://<INSTANCE_PUBLIC_IP>:8443 - Login with username
ubuntuand retrieve password from AWS Secrets Manager:aws secretsmanager get-secret-value --secret-id <SECRET_NAME> --query SecretString --output text
Step 2: Start Simulation and Data collection
Iterative Training Approach:
- Initial Training: Run RL fine-tune for 30 minutes
- Evaluation: Test with data collection to measure performance improvement
- Iteration: If performance is insufficient, continue RL training for another 30 minutes
- Repeat: Continue until desired accuracy threshold is achieved
Monitoring Progress:
- Training metrics are logged to console
- Model checkpoints saved automatically
- Success rate and accuracy displayed in real-time
- Nova Pro provides intelligent observations of robot behavior
Customization and Enhancement Options:
-
Modify Training Parameters:
- Adjust learning rate, batch size, and training epochs in
RL_Finetune.py - Customize reward functions for different manipulation tasks
- Modify success thresholds and accuracy floors
- Adjust learning rate, batch size, and training epochs in
-
Extend to Different Robot Tasks:
- Replace T-bar pushing with other manipulation tasks (pick-and-place, assembly)
- Modify the Isaac Sim scene files in
source/ur5_nova/configuration/ - Update observation and action spaces for new tasks
-
Scale Training Infrastructure:
- Deploy multiple Trainium instances for distributed training
- Use Amazon SageMaker for managed training workflows
- Implement model versioning with Amazon SageMaker Model Registry
-
Integrate Additional Foundation Models:
- Use different Bedrock models for reward function generation
- Implement multi-modal learning with vision-language models
- Add natural language instruction following capabilities
-
Deploy to Physical Robots:
- Configure AWS IoT Core for robot fleet management
- Implement over-the-air model updates
- Add real-world sensor integration and calibration
-
Production Optimization:
- Implement model quantization for edge deployment
- Add monitoring and alerting with Amazon CloudWatch
- Set up automated retraining pipelines with Amazon EventBridge
To completely remove all resources created by this Guidance:
-
Stop running processes on EC2 instance:
# SSH to the instance and stop any running training pkill -f python3 sudo systemctl stop dcvserver -
Empty S3 bucket contents:
# Get bucket name from CloudFormation output BUCKET_NAME=$(aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==`S3BucketName`].OutputValue' --output text) # Empty the bucket aws s3 rm s3://$BUCKET_NAME --recursive
-
Delete CloudFormation stacks:
# Delete all CDK stacks cdk destroy --all # Confirm deletion when prompted
-
Verify resource deletion:
# Check that stacks are deleted aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE # Verify EC2 instances are terminated aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name]' --output table
-
Manual cleanup:
- EKS clusters: If EKS stack deletion fails, manually delete from console
- VPC resources: Delete any remaining ENIs or security groups
- IAM roles: Remove any custom IAM roles if they weren't deleted
- Secrets Manager: Delete the password secret if it remains
Note: The cleanup process may take 10-15 minutes to complete. Monitor the CloudFormation console to ensure all stacks are successfully deleted.
Known issues
-
Isaac Sim Installation Timeout:
- Issue: Isaac Sim download may timeout on slower connections
- Resolution: SSH to instance and manually restart download:
cd /home/ubuntu/isaacsim wget --continue https://download.isaacsim.omniverse.nvidia.com/isaac-sim-standalone-5.1.0-linux-x86_64.zip
-
DCV Connection Issues:
- Issue: Cannot connect to DCV web interface
- Resolution: Check security group allows port 8443 and restart DCV:
sudo systemctl restart dcvserver
-
CUDA Out of Memory:
- Issue: Training fails with CUDA OOM errors
- Resolution: Reduce batch size in training scripts or use smaller model
-
Trainium Instance Unavailability:
- Issue: trn1 instances not available in region
- Resolution: Use alternative regions or switch to GPU instances (p3, g4dn)
-
Bedrock Model Access:
- Issue: Access denied to Claude 3 models
- Resolution: Request model access in Bedrock console before deployment
Additional considerations
- This Guidance creates EC2 instances that are billed per hour while running, including during idle time
- Trainium instances (trn1.2xlarge) are premium instances with higher hourly costs
- The guidance creates a VPC with NAT Gateway that incurs hourly charges
- Isaac Sim requires significant disk space (>50GB) and may increase EBS costs
- DCV server creates a remote desktop session accessible over the internet - ensure strong passwords
- Training data is stored in S3 and may accumulate over time, monitor storage costs
- Bedrock API calls are charged per request - monitor usage during reward function generation
- The system generates significant network traffic during training which may incur data transfer charges
Security Considerations:
- DCV web interface is exposed to the internet on port 8443
- Ensure strong passwords are set via Secrets Manager
- Consider restricting access to specific IP ranges in security groups
- Training data may contain sensitive information - ensure proper S3 bucket policies
For any feedback, questions, or suggestions, please use the issues tab under this repo.
Document all notable changes to this project.
Consider formatting this section based on Keep a Changelog, and adhering to Semantic Versioning.
Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.