[FEATURE] Cloud Deployment of Darwin #119
HarshAgarwal11
started this conversation in
Ideas
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Darwin ML Platform - AWS Cloud Infrastructure Refinement
Version: 1.0
Date: January 2026
Scope: Phase 1 - Compute, Workspace, MLFlow, Serve
Table of Contents
1. Architecture Overview
1.1 Architecture Diagram
1.2 Phase 1 Service Dependencies
1.3 Architecture Diagram Notes
darwinhq/*) instead, but ECR is better for production/var/www/fsx/workspace2. Prerequisites
2.1 AWS Account Requirements
2.2 Required IAM Permissions for OpenTofu
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:*", "eks:*", "rds:*", "s3:*", "ecr:*", "elasticfilesystem:*", "iam:*", "secretsmanager:*", "kms:*", "logs:*", "route53:*", "elasticloadbalancing:*", "autoscaling:*", "cloudwatch:*" ], "Resource": "*" } ] }2.3 Subnet CIDR Planning
Recommended:
/22for private subnets (EKS pods need many IPs)3. Infrastructure Tasks Breakdown
Task Matrix
Legend:
Total Estimated Effort: ~73 hours (excluding testing and iteration)
4. OpenTofu Module Structure
4.1 Recommended Directory Structure
4.2 Provider Configuration
5. Detailed Implementation Steps
Task 1: VPC Setup
What OpenTofu Does:
Variables needed:
Task 2-3: Public and Private Subnets
What OpenTofu Does:
Task 4: NAT Gateway
What OpenTofu Does:
Cost Consideration:
Recommendation: Single for dev, Multi for prod
Task 5: Route Tables and Networking
What OpenTofu Does:
Task 6: VPC Endpoints
What OpenTofu Does:
Cost:
Task 7: Security Groups
What OpenTofu Does:
Task 8: RDS MySQL
What OpenTofu Does:
Database Schemas Required:
Darwin services expect these databases (create after RDS is up):
darwin_computedarwin_mlflowdarwin_workspacedarwin_serveTask 9: EFS
What OpenTofu Does:
Task 10: S3 Buckets
What OpenTofu Does:
Task 11: ECR Repositories
What OpenTofu Does:
Alternative: Use Docker Hub (
darwinhq/*)Task 12: EKS Cluster (Auto Mode)
What OpenTofu Does:
Task 13: EKS Node Pools
What OpenTofu Does:
Task 14: IAM Roles for Service Accounts (IRSA)
What OpenTofu Does:
Task 15: Upload Kubeconfig to S3
What OpenTofu Does:
Note: DCM downloads kubeconfig from S3 path:
mlp/cluster_manager/configs/<cluster_name>Task 16: ArgoCD Installation
What OpenTofu Does:
Task 17: Kubernetes Operators
What OpenTofu Does:
Task 18: Darwin Helm Deployment
What OpenTofu Does:
6. Code Changes Required
6.1 Configuration Changes (Non-Breaking)
CONFIG_SERVICE_MYSQL_HOSTto RDS endpointservices.yamlCONFIG_SERVICE_S3_PATHto real S3 bucketconstants/constants.goKubeConfigS3Prefixmatches S3 keyconstants/constants.pyBASE_EFS_PATHmatches mount path6.2 Code Changes for Node Pool Selection
File:
darwin-compute/core/src/compute_core/util/utils.pyCurrent code uses node selectors like:
group["nodeSelector"]["darwin.dream11.com/resource"] = "ray-cluster"group["nodeSelector"]["karpenter.k8s.aws/instance-family"] = "p4d"Changes needed:
6.3 Serve Node Pool Configuration
For Jupyter kernels, Spark History Server, and model inference pods:
6.4 Files to Modify
darwin-compute/core/src/compute_core/util/utils.pyadd_eks_tolerations(), add new node selectorsdarwin-compute/core/src/compute_core/util/yaml_generator_v2/head_node_handler.pydarwin-compute/core/src/compute_core/util/yaml_generator_v2/worker_node_handler.pydarwin-cluster-manager/services/jupyterClient/utils.godarwin-cluster-manager/services/spark_history_server/utils.go7. Deployment Sequence
Phase 1: Foundation (OpenTofu)
Phase 2: EKS Cluster (OpenTofu)
Phase 3: Operators (OpenTofu/Helm)
# 9. Apply all helm releases tofu apply -target=module.helm_releasesPhase 4: Darwin Deployment (OpenTofu/Helm)
Phase 5: Validation
8. Validation Checklist
Infrastructure Validation
aws ec2 describe-vpcs --filters "Name=tag:Name,Values=darwin-*"aws ec2 describe-subnets --filters "Name=vpc-id,Values=$VPC_ID"aws ec2 describe-nat-gatewaysaws rds describe-db-instancesaws efs describe-file-systemsaws eks describe-cluster --name darwin-devKubernetes Validation
kubectl get nodeskubectl get pods -n kube-systemkubectl get pods -n ray-systemkubectl get pods -n monitoringkubectl get pods -n darwinConnectivity Validation
kubectl exec -it <pod> -- nc -zv $RDS_HOST 3306kubectl exec -it <pod> -- aws s3 ls $BUCKETkubectl exec -it <pod> -- ls /var/www/fsx/workspace9. Estimated Costs
Monthly Cost Breakdown (Dev Environment)
Production Multipliers
Appendix A: Environment Variables Reference
Darwin Services Environment Variables
Appendix B: Troubleshooting Guide
Common Issues
Debug Commands
Document generated: January 2026
Author: Darwin Platform Team
Version: 1.0
Beta Was this translation helpful? Give feedback.
All reactions