Skip to content

Latest commit

 

History

History
233 lines (179 loc) · 8 KB

File metadata and controls

233 lines (179 loc) · 8 KB

Darwin ML Platform - Agent Entry Point

Darwin is an enterprise-grade, end-to-end machine learning platform. This repository handles deployment orchestration - building Docker images and deploying the entire platform to Kubernetes using Helm charts.


🚀 Setup Workflow

1. ./init.sh      # Interactive use case selection (creates .setup/enabled-services.yaml)
2. ./setup.sh     # Build images, create Kind cluster, push to local registry
3. ./start.sh     # Deploy to Kubernetes via Helm

init.sh Modes:

Mode Command Description
Default ./init.sh Simplified preset selection (Training / Inference)
Dev Mode ./init.sh --dev-mode Granular service-by-service selection
All ./init.sh --all Enable all services without prompts

Presets (Default Mode):

Preset Features Enabled Use Case
Training Compute + MLFlow Model training, experiments, distributed compute
Inference Serve + MLFlow Model deployment, real-time predictions

Other Flags:

  • ./setup.sh -y - Skip prompts (auto-answer yes)

🏗️ Project Architecture

Platform Components

Feature Applications Description
Compute darwin-compute, darwin-cluster-manager Ray cluster management & K8s orchestration
Workspace darwin-workspace Project & Jupyter environment management
Feature Store darwin-ofs-v2, darwin-ofs-v2-admin, darwin-ofs-v2-consumer Online feature serving (<10ms latency)
MLflow darwin-mlflow, darwin-mlflow-app Experiment tracking & model registry
Serve ml-serve-app, artifact-builder Model deployment & Docker image building
Catalog darwin-catalog Data asset discovery & lineage
Chronos chronos, chronos-consumer Event processing & metadata tracking
Workflow darwin-workflow ML pipeline orchestration (Airflow-based)

Datastores

Datastore Usage
MySQL Metadata storage for all services
Cassandra Feature Store values (high-throughput)
OpenSearch Chronos events, Compute metadata
Kafka + Zookeeper Event streaming, feature materialization
LocalStack S3 emulation for artifacts
Airflow Workflow DAG execution
Elasticsearch Workflow search (alternative to OpenSearch)

Infrastructure Operators

  • KubeRay Operator (v1.1.0) - Ray cluster lifecycle management
  • Nginx - Ingress controller
  • Grafana - Monitoring dashboards

📁 Key Files Reference

File Purpose
init.sh Interactive service selection wizard (run first)
setup.sh Creates cluster & builds all images
start.sh Deploys platform via Helm with config overrides
services.yaml Application registry - defines available services, datastores, operators
service-dependencies.yaml Service-to-service and service-to-datastore dependencies
.setup/config.env Runtime configuration (generated)
.setup/enabled-services.yaml User-selected services config (generated by init.sh)
helm/darwin/ Main Helm umbrella chart
kind/ Local Kubernetes cluster config
deployer/ Base images (Python, Java, Go) and build scripts

🔧 Agent Instructions

  1. Run init.sh first - This creates .setup/enabled-services.yaml with user's service selections
  2. Load prompts on-demand - Only read prompts relevant to the current task
  3. Check .setup/enabled-services.yaml - This is the source of truth for which services are enabled
  4. Check services.yaml - Defines available applications, datastores, and operators
  5. Check service-dependencies.yaml - Understand service dependencies before enabling/disabling
  6. Check .setup/config.env - Contains current KUBECONFIG and DOCKER_REGISTRY
  7. Respect .odin/ conventions - Each submodule must have build.sh, setup.sh, start.sh

Common Operations

Check Cluster Status

kubectl get pods -n darwin          # Darwin services
kubectl get pods -n ray             # Ray clusters
kubectl get pods -n serve           # Model serving pods

Rebuild a Single Service

# Build and push image
sh deployer/scripts/image-builder.sh -a <app-name> -t <base-path> -p <path> -e <base-image> -r $DOCKER_REGISTRY

# Restart deployment
kubectl rollout restart deployment/<service-name> -n darwin

Access Services (Local)

  • Compute: http://localhost/compute/*
  • Feature Store: http://localhost/feature-store/*
  • MLflow UI: http://localhost/mlflow-app/*
  • Chronos: http://localhost/chronos/*
  • Catalog: http://localhost/darwin-catalog/*
  • Workspace: http://localhost/workspace/*
  • Workflow: http://localhost/workflow/*

Adding New Services

  1. Add entry to services.yaml under applications:
  2. Add dependencies to service-dependencies.yaml
  3. Create Helm subchart in helm/darwin/charts/services/
  4. Update init.sh if it's a new feature group
  5. Update start.sh with helm path mapping in get_helm_path()

Adding New Datastores

  1. Add entry to services.yaml under datastores:
  2. Create templates in helm/darwin/charts/datastores/templates/
  3. Update service-dependencies.yaml for services that need it

📦 Service Dependencies (Quick Reference)

darwin-compute         → darwin-cluster-manager
darwin-workspace       → darwin-compute
darwin-workflow        → darwin-compute, darwin-cluster-manager
ml-serve-app           → artifact-builder, darwin-cluster-manager, darwin-mlflow-app
darwin-mlflow-app      → darwin-mlflow
darwin-ofs-v2          → darwin-ofs-v2-admin
darwin-ofs-v2-consumer → darwin-ofs-v2-admin
chronos-consumer       → chronos

🛠️ CLI Tools

Darwin CLI

Unified command-line interface for all Darwin services:

source .venv/bin/activate
darwin config set --env darwin-local
darwin serve configure
darwin serve create --name my-model --type api --space serve
darwin serve deploy-model --serve-name my-model --model-uri mlflow-artifacts:/...

📖 Full documentation: darwin-cli/README.md


📊 Ray Runtimes

Image Ray Version Python Spark
ray:2.37.0 2.37.0 3.10 -
ray:2.53.0 2.53.0 3.10 -
ray:2.37.0-darwin-sdk 2.37.0 3.10 3.5.0

Darwin SDK Runtime includes Spark integration for distributed data processing.


📚 Additional Documentation

Document Location
Main README README.md
Darwin CLI darwin-cli/README.md
Helm Umbrella Chart helm/darwin/UMBRELLA_CHART.md
Deployment Order helm/darwin/DEPLOYMENT_ORDER.md
Feature Store Architecture feature-store/ARCHITECTURE.md

🐛 Troubleshooting

Common Issues

Cluster not reachable:

source .setup/config.env
kubectl cluster-info

Service not starting:

kubectl describe pod <pod-name> -n darwin
kubectl logs <pod-name> -n darwin

Helm deployment failed:

helm status darwin -n darwin
helm history darwin -n darwin

LocalStack S3 issues:

kubectl port-forward svc/darwin-localstack -n darwin 4566:4566
AWS_ACCESS_KEY_ID=test AWS_SECRET_ACCESS_KEY=test aws s3 ls --endpoint-url=http://localhost:4566

🎯 Quick Task Navigation

Task Steps
First-time setup (simple) Run ./init.sh → Select Training/Inference → ./setup.sh./start.sh
First-time setup (advanced) Run ./init.sh --dev-mode → Select individual services → ./setup.sh./start.sh
Add new microservice Edit services.yaml → Add Helm chart → Update init.sh/start.sh
Enable/disable service Edit .setup/enabled-services.yaml → Run ./start.sh
Rebuild images Run ./setup.sh -y
Debug pod kubectl logs/describe → Check service dependencies
Deploy model Use Darwin CLI (see darwin-cli/README.md#serve-commands)