The Distributed AI Training Platform is a scalable, cloud-native system for training AI models across multiple worker nodes, orchestrated by a master node. It leverages Kubernetes on Amazon EKS, Amazon DocumentDB for data storage, Amazon S3 for model storage, Apache Kafka for message passing, and Spring Boot for the application framework.
- Scalability: Distributes training tasks across multiple worker nodes.
- Reliability: Ensures high availability using Kubernetes deployments.
- Security: Connects securely to AWS services using SSL/TLS and IAM roles.
- Monitoring: Provides health checks via Spring Boot Actuator.
- Master Service: Orchestrates training, communicates via Kafka, stores metadata in DocumentDB, and uploads models to S3.
- Worker Service: Performs AI training, receives tasks via Kafka, and interacts with DocumentDB and S3.
- Amazon DocumentDB: Stores training metadata and intermediate results.
- Amazon S3: Stores datasets and trained models.
- Apache Kafka: Facilitates message passing between master and workers.
- Kubernetes (EKS): Orchestrates deployments in the
ap-south-1region.
- AWS Account with access to EKS, DocumentDB, S3, and IAM.
kubectl, Docker, Maven, AWS CLI, and Java 17 installed.- An EKS cluster in the
ap-south-1region. - IAM Role for Service Account (IRSA) configured for EKS.
git clone https://github.com/adityatiwari/distributed-ai-training-platform.git
cd distributed-ai-training-platform-
DocumentDB:
- Create a cluster in
ap-south-1. - Endpoint:
ai-training-db.cluster-ct48ogi24zxp.ap-south-1.docdb.amazonaws.com:27017. - Database:
trainingdb, Username:dbuser, Password:SiliconValley100%. - Enable SSL.
- Create a cluster in
-
S3:
- Create a bucket (e.g.,
ai-training-models) inap-south-1.
- Create a bucket (e.g.,
-
Kafka:
- Deploy Kafka in EKS (endpoint:
kafka.default.svc.cluster.local:9092).
- Deploy Kafka in EKS (endpoint:
-
IAM Role (IRSA):
- Create an IAM role (
ai-training-role) with DocumentDB and S3 permissions. - Associate with a Kubernetes service account (
ai-training-sa).
- Create an IAM role (
-
Master Service (
master-service/src/main/resources/application.properties):spring.data.mongodb.uri=mongodb://dbuser:SiliconValley100%25@ai-training-db.cluster-ct48ogi24zxp.ap-south-1.docdb.amazonaws.com:27017/trainingdb?ssl=true spring.kafka.bootstrap-servers=kafka.default.svc.cluster.local:9092 aws.region=ap-south-1 management.endpoints.web.exposure.include=health management.endpoint.health.show-details=always
-
Worker Service (
worker-service/src/main/resources/application.properties):spring.data.mongodb.uri=mongodb://dbuser:SiliconValley100%25@ai-training-db.cluster-ct48ogi24zxp.ap-south-1.docdb.amazonaws.com:27017/trainingdb?ssl=true spring.kafka.bootstrap-servers=kafka.default.svc.cluster.local:9092 aws.region=ap-south-1 management.endpoints.web.exposure.include=health management.endpoint.health.show-details=always
-
Download certificates:
curl -o rds-ap-south-1-bundle.pem https://truststore.pki.rds.amazonaws.com/ap-south-1/ap-south-1-bundle.pem curl -o s3-root-ca.pem https://www.amazontrust.com/repository/AmazonRootCA1.pem
Split
rds-ap-south-1-bundle.pemintoap-south-1-cert-1.pem(root CA) andap-south-1-subordinate-ca.pem(subordinate CA). -
Import into JVM’s
cacerts:keytool -import -trustcacerts -file ap-south-1-cert-1.pem -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -alias rds-ap-south-1-root-ca -noprompt keytool -import -trustcacerts -file ap-south-1-subordinate-ca.pem -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -alias rds-ap-south-1-subordinate-ca -noprompt keytool -import -trustcacerts -file s3-root-ca.pem -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -alias s3-root-ca -noprompt
-
Master Service:
cd master-service mvn clean package docker build -t 676206948598.dkr.ecr.ap-south-1.amazonaws.com/master-service:latest . docker push 676206948598.dkr.ecr.ap-south-1.amazonaws.com/master-service:latest
-
Worker Service:
cd worker-service mvn clean package docker build -t 676206948598.dkr.ecr.ap-south-1.amazonaws.com/worker-service:latest . docker push 676206948598.dkr.ecr.ap-south-1.amazonaws.com/worker-service:latest
-
ECR Secret:
kubectl create secret docker-registry ecr-secret \ --docker-server=676206948598.dkr.ecr.ap-south-1.amazonaws.com \ --docker-username=AWS \ --docker-password=$(aws ecr get-login-password --region ap-south-1) -
Deployments (
deployments.yaml):# Master Service Deployment apiVersion: apps/v1 kind: Deployment metadata: name: master-deployment labels: app: master spec: replicas: 1 selector: matchLabels: app: master template: metadata: labels: app: master spec: serviceAccountName: ai-training-sa imagePullSecrets: - name: ecr-secret containers: - name: master image: 676206948598.dkr.ecr.ap-south-1.amazonaws.com/master-service:latest ports: - containerPort: 8080 env: - name: SPRING_DATA_MONGODB_URI value: "mongodb://dbuser:SiliconValley100%25@ai-training-db.cluster-ct48ogi24zxp.ap-south-1.docdb.amazonaws.com:27017/trainingdb?ssl=true" - name: SPRING_KAFKA_BOOTSTRAP_SERVERS value: "kafka.default.svc.cluster.local:9092" - name: AWS_REGION value: "ap-south-1" resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" readinessProbe: httpGet: path: /actuator/health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 # Worker Service Deployment apiVersion: apps/v1 kind: Deployment metadata: name: worker-deployment labels: app: worker spec: replicas: 3 selector: matchLabels: app: worker template: metadata: labels: app: worker spec: serviceAccountName: ai-training-sa initContainers: - name: wait-for-kafka image: busybox command: ["sh", "-c", "until nc -z kafka.default.svc.cluster.local 9092; do echo 'Waiting for Kafka...'; sleep 2; done"] containers: - name: worker image: 676206948598.dkr.ecr.ap-south-1.amazonaws.com/worker-service:latest env: - name: SPRING_DATA_MONGODB_URI value: "mongodb://dbuser:SiliconValley100%25@ai-training-db.cluster-ct48ogi24zxp.ap-south-1.docdb.amazonaws.com:27017/trainingdb?ssl=true" - name: SPRING_KAFKA_BOOTSTRAP_SERVERS value: "kafka.default.svc.cluster.local:9092" - name: AWS_REGION value: "ap-south-1" resources: requests: memory: "1Gi" cpu: "500m" limits: memory: "2Gi" cpu: "1"
Apply:
kubectl apply -f deployments.yaml
-
Service (
services.yaml):apiVersion: v1 kind: Service metadata: name: master-service spec: selector: app: master ports: - protocol: TCP port: 80 targetPort: 8080 type: ClusterIP
Apply:
kubectl apply -f services.yaml
- Check pod status:
kubectl get pods -l app=master kubectl get pods -l app=worker
- Check deployment status:
kubectl rollout status deployment master-deployment kubectl rollout status deployment worker-deployment
- Check logs:
kubectl logs -l app=master --tail=100 kubectl logs -l app=worker --tail=100
Scale worker nodes:
kubectl scale deployment worker-deployment --replicas=5- Health endpoint:
/actuator/health(master service). - Use
kubectl logsandkubectl describe podfor debugging. - Integrate with AWS CloudWatch for logging.
- Pod Not Ready: Check
kubectl describe podfor readiness probe failures. - DocumentDB Issues: Verify URI, credentials, and
cacertstruststore. - S3 Access: Ensure IAM role permissions and IRSA setup.
- Update code in
master-serviceorworker-service. - Rebuild and push images:
Repeat for
cd master-service mvn clean package docker build -t 676206948598.dkr.ecr.ap-south-1.amazonaws.com/master-service:latest . docker push 676206948598.dkr.ecr.ap-south-1.amazonaws.com/master-service:latest
worker-service. - Restart deployments:
kubectl rollout restart deployment master-deployment kubectl rollout restart deployment worker-deployment
- Used JVM’s
cacertstruststore to simplify SSL configuration. - Configured Spring Boot Actuator for reliable health checks.
- Debugged SSL issues with
-Djavax.net.debug=ssl:handshake:verbose.
- Add liveness probes for automatic pod restarts.
- Implement CI/CD with GitHub Actions.
- Integrate Prometheus and Grafana for monitoring.
For questions, contact Aditya Tiwari at [email protected].