Skip to content

dream-horizon-org/darwin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

140 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Darwin ML Platform

🌐 Visit Darwin Platform

Darwin is an enterprise-grade, end-to-end machine learning platform designed for production-scale AI/ML workloads. It provides a unified ecosystem for the complete ML lifecycleβ€”from distributed compute and feature engineering to experiment tracking, model deployment, and real-time inference serving.


🎯 Why Darwin?

Darwin solves critical challenges in production ML infrastructure:

  • Unified Platform: Single platform for training, serving, and feature engineeringβ€”no context switching between disparate tools
  • Production-Grade Scalability: Built on Kubernetes and Ray for elastic, distributed compute at scale
  • Cost Optimization: Intelligent auto-scaling, spot instance support, and policy-based auto-termination
  • Developer Velocity: SDK-first design with CLI tools for rapid experimentation and deployment
  • Enterprise Ready: Multi-tenancy, RBAC, audit logging, and metadata lineage out of the box
  • Low-Latency Serving: Sub-10ms feature retrieval and optimized model inference pipelines

πŸ—οΈ Architecture Overview

graph TB
    subgraph "User Interface Layer"
        UI[Workspace UI/Jupyter]
        CLI[Darwin CLI]
        SDK[Python SDKs]
    end

    subgraph "Orchestration Layer"
        Workspace[Workspace Service<br/>Projects & Codespaces]
        MLflow[MLflow<br/>Experiment Tracking]
        Chronos[Chronos<br/>Event & Metadata]
        Workflow[Darwin Workflow<br/>Pipeline Orchestration]
    end

    subgraph "Compute Layer"
        Compute[Darwin Compute<br/>Cluster Management]
        DCM[Darwin Cluster Manager<br/>K8s Orchestration]
        Ray[Ray Clusters<br/>Distributed Execution]
    end

    subgraph "Data Layer"
        FS[Feature Store<br/>Online/Offline Features]
        Catalog[Darwin Catalog<br/>Asset Discovery]
    end

    subgraph "Serving Layer"
        Serve[ML Serve<br/>Model Deployment]
        Builder[Artifact Builder<br/>Image Building]
    end

    subgraph "Infrastructure"
        MySQL[(MySQL<br/>Metadata)]
        Cassandra[(Cassandra<br/>Features)]
        OpenSearch[(OpenSearch<br/>Events)]
        S3[(S3<br/>Artifacts)]
        Kafka[(Kafka<br/>Streaming)]
        K8s[Kubernetes/EKS]
    end

    UI --> Workspace
    CLI --> Serve
    CLI --> Compute
    SDK --> Compute
    SDK --> FS
    SDK --> MLflow

    Workspace --> Compute
    Workspace --> Chronos
    MLflow --> S3
    MLflow --> MySQL

    Compute --> DCM
    DCM --> Ray
    Ray --> K8s

    Serve --> Builder
    Serve --> DCM
    Builder --> K8s

    FS --> Cassandra
    FS --> Kafka
    FS --> MySQL
    Catalog --> MySQL
    Catalog --> OpenSearch
    Chronos --> OpenSearch
    Chronos --> Kafka

    Workflow --> Compute
    Workflow --> MySQL
    Workflow --> Airflow[(Airflow<br/>DAG Execution)]

    style Darwin fill:#e1f5ff
    style Compute fill:#ffe1e1
    style FS fill:#e1ffe1
    style Serve fill:#fff5e1
Loading

πŸ“¦ Platform Components

1. Darwin Compute

Distributed compute orchestration for ML workloads

  • Ray Cluster Management: Create, scale, and manage Ray 2.37.0 clusters on Kubernetes
  • Multi-Runtime Support: Pre-configured runtimes (Ray + Python 3.10 + Spark 3.5.1)
  • Resource Optimization:
    • Spot/on-demand instance mixing
    • Auto-termination policies (idle detection, CPU thresholds)
    • Cost monitoring with Slack alerts
  • Package Management: Dynamic installation of PyPI, Maven, and workspace packages
  • Jupyter Integration: Managed Jupyter notebooks with direct cluster access
  • Job Scheduling: Ray job submission and monitoring

SDK: darwin-compute

from darwin_compute import ComputeCluster

cluster = ComputeCluster(env="prod")
result = cluster.create_with_yaml("cluster-config.yaml")
cluster.start(cluster_id=result['cluster_id'])

2. Darwin Cluster Manager (DCM)

Low-level Kubernetes orchestration service (Go)

  • Helm-based Ray cluster deployment via KubeRay operator
  • Dynamic values.yaml generation for cluster configurations
  • Remote command execution on cluster pods
  • Jupyter pod lifecycle management
  • FastAPI serve deployment orchestration

3. Feature Store

High-performance feature serving and engineering platform

Components:

  • darwin-ofs-v2 (App): Low-latency online feature serving (<10ms)
  • darwin-ofs-v2-admin: Feature group management, schema versioning
  • darwin-ofs-v2-consumer: Kafka-based feature materialization
  • darwin-ofs-v2-populator: Bulk ingestion from Parquet/Delta tables

Capabilities:

  • Real-time feature retrieval with Cassandra backend
  • Point-in-time correctness for training datasets
  • Feature validation and schema evolution
  • Spark integration for batch feature pipelines
  • Multi-tenant feature isolation

Storage Architecture:

  • Cassandra: High-throughput feature values
  • MySQL: Feature metadata and schemas
  • Kafka: Real-time feature streaming

SDK: darwin_fs

from darwin_fs import FeatureStoreClient

fs = FeatureStoreClient()
features = fs.fetch_features(
    feature_group="user_engagement",
    keys=[123, 456]
)

4. ML Serve

Production model deployment and serving platform

  • Serve Lifecycle: Create, configure, deploy, monitor, undeploy
  • Multi-Environment: Dev, staging, UAT, production with environment-specific configs
  • Backend Support:
    • FastAPI serves for REST inference
    • Ray Serve for distributed model serving (experimental)
  • Artifact Management: Git-based Docker image builds
  • Auto-Scaling: HPA-based horizontal pod autoscaling
  • Feature Store Integration: Native integration for online feature retrieval

Deployment Workflow:

# Complete model deployment via Darwin CLI

# 1. Configure environment and authentication
source .venv/bin/activate
darwin config set --env darwin-local
darwin serve configure  # Uses default token for darwin-local

# 2. Create environment (one-time setup)
darwin serve environment create --name local --domain-suffix .local --cluster-name kind

# 3. Create serve definition
darwin serve create --name my-model --type api --space serve --description "My ML model"

# 4. Deploy model
darwin serve deploy-model \
  --serve-name my-model \
  --artifact-version v1 \
  --model-uri mlflow-artifacts:/1/abc123/artifacts/model \
  --cores 4 \
  --memory 8 \
  --node-capacity spot \
  --min-replicas 2 \
  --max-replicas 10

πŸ“– For complete Serve CLI commands and deployment options, see darwin-cli/README.md#serve-commands


5. Artifact Builder

Docker image building service for ML models

  • Build images from GitHub repositories with custom Dockerfiles
  • Queue-based build system with status tracking
  • Container registry integration (ECR, GCR)
  • Integration with ML Serve deployment pipeline

6. Darwin MLflow

Experiment tracking and model registry

  • MLflow 2.12.2 with custom FastAPI authentication layer
  • Experiment and run tracking (parameters, metrics, artifacts)
  • Model registry with versioning
  • User-based experiment permissions
  • S3/LocalStack artifact storage
  • Custom UI with enhanced authorization

SDK: darwin_mlflow (wraps MLflow client)

import darwin_mlflow as mlflow

mlflow.log_params({"lr": 0.001, "epochs": 100})
mlflow.log_metric("accuracy", 0.95)
mlflow.sklearn.log_model(model, "model")

7. Chronos (Event Processing & Metadata)

Event ingestion, transformation, and lineage tracking

  • Event Sources: REST API for raw events from services
  • Transformers: Python/JSONPath-based event processing
  • Entity Extraction: Automatic entity creation (clusters, users, jobs)
  • Relationship Mapping: Build lineage graphs between entities
  • Queue Processing: Async consumption from Kafka/SQS

Use Cases:

  • Cluster lifecycle tracking
  • Workflow execution lineage
  • Audit logs and compliance
  • Metadata dependencies (data β†’ model β†’ deployment)

8. Darwin Workspace

Project and development environment management

  • Project Management: Multi-user project organization
  • Codespace Lifecycle: Create and manage Jupyter/VSCode environments
  • Compute Integration: Attach Ray clusters to development environments
  • Shared Storage: FSx/EFS integration for persistent workspaces
  • Event Publishing: Workspace state changes tracked in Chronos

9. Darwin Catalog

Data asset discovery and governance

  • Asset Management: Register datasets, tables, models
  • Schema Tracking: Schema evolution and versioning
  • Lineage: OpenLineage-based data lineage tracking
  • Search: Full-text search across data assets
  • Metadata: Tags, descriptions, ownership, quality metrics
  • Integration: Spark and Airflow job lineage capture

10. Darwin Workflow

ML pipeline orchestration and scheduling

  • Workflow Definition: Define multi-step ML pipelines with task dependencies
  • DAG Management: Create, deploy, and manage Airflow DAGs programmatically
  • Job Cluster Integration: Automatic Ray cluster provisioning for workflow tasks
  • Conditional Execution: Support for branching and conditional task execution
  • Callback Events: Event-driven notifications on workflow state changes

Components:

  • App Layer: FastAPI REST API for workflow management
  • Core: Workflow orchestration logic and DAG services
  • Airflow Integration: Custom operators for Darwin platform integration
  • SDK: Python SDK with CLI for workflow creation and management

SDK: darwin_workflow

from darwin_workflow import WorkflowClient

client = WorkflowClient(env="prod")

# Create a workflow
workflow = client.create_workflow(
    name="feature-pipeline",
    tasks=[
        {"name": "extract", "type": "ray_job", "script": "extract.py"},
        {"name": "transform", "type": "ray_job", "script": "transform.py", "depends_on": ["extract"]},
        {"name": "load", "type": "ray_job", "script": "load.py", "depends_on": ["transform"]}
    ]
)

# Trigger workflow run
client.trigger_workflow(workflow_id=workflow['id'])

Use Cases:

  • Scheduled feature engineering pipelines
  • Model retraining workflows
  • Data processing DAGs with Ray/Spark tasks
  • Multi-step ML experiments with dependencies

11. Darwin CLI

Unified command-line interface for all Darwin ML Platform services

  • Compute cluster management
  • Workspace and codespace operations
  • Model serving deployment (Serve)
  • MLflow experiment tracking
  • Feature Store operations
  • Catalog and lineage queries
  • Workflow orchestration

πŸ“– For complete Darwin CLI documentation and all available commands, see darwin-cli/README.md


πŸ‘₯ User Personas

Data Scientists

Use Darwin for: Experimentation, training, model development

  • Launch Ray clusters via SDK for distributed training
  • Track experiments with MLflow
  • Access features from Feature Store
  • Deploy models with Darwin CLI serve commands

ML Engineers

Use Darwin for: Production model deployment and monitoring

  • Configure multi-environment serves (dev/staging/prod)
  • Build and deploy artifacts from GitHub
  • Manage auto-scaling policies
  • Monitor model performance and resource usage

Data Engineers

Use Darwin for: Feature pipelines and data infrastructure

  • Create and manage feature groups in Feature Store
  • Build Spark-based feature engineering pipelines
  • Track data lineage in Catalog
  • Publish features to Kafka for real-time materialization

Platform Engineers

Use Darwin for: Infrastructure management and operations

  • Deploy and configure Darwin platform via Helm
  • Manage Kubernetes resources and policies
  • Monitor costs and resource utilization
  • Configure multi-tenancy and RBAC

πŸš€ Getting Started

Prerequisites

  • Kubernetes cluster (Kind for local, EKS for production)
  • Helm 3.8+
  • kubectl
  • Docker
  • Python 3.9.7+

Quick Start: Local Deployment

# 1. Initialize configuration (select your use case)
./init.sh

# 2. Setup Kind cluster and get images
./setup.sh              # Pull release images (default)
./setup.sh -d           # Build images locally (dev mode)
./setup.sh -y           # Non-interactive, pull release images
./setup.sh -y -d        # Non-interactive, build locally
./setup.sh -y --clean   # Clean install with release images
./setup.sh -y -d --clean # Clean install, build locally

# 3. Deploy Darwin platform to Kubernetes
./start.sh

Choosing Your Use Case

By default, init.sh offers two simplified presets:

Preset Features Use Case
Training Compute + MLFlow Model training, experiments, distributed compute with Ray clusters
Inference Serve + MLFlow Model deployment, real-time inference endpoints

You can select one or both presets. Dependencies are automatically resolved.

Advanced: Granular Service Selection

For fine-grained control over individual services, use dev mode:

./init.sh --dev-mode

This enables the original service-by-service selection:

If you want to... Enable
Run distributed data processing jobs or spin up short-lived compute clusters Compute
Work interactively with persistent code and notebooks attached to scalable clusters Workspace (includes Compute)
Store, version, and serve features for ML training and inference Feature Store
Track experiments, log metrics, and manage model versions MLflow
Deploy trained models as real-time inference endpoints Serve (includes Artifact Builder)
Discover and track lineage across datasets, models, and pipelines Catalog
Capture platform events and build metadata graphs Chronos
Orchestrate multi-step ML pipelines with scheduling and dependencies Workflow (includes Compute, Airflow)

Tip: Dependencies are resolved automatically. For example, enabling Workspace will also enable Compute, and enabling Serve will include Artifact Builder and MLflow.

Access Services:

  • Compute: http://localhost/compute/*
  • Feature Store: http://localhost/feature-store/*
  • MLflow UI: http://localhost/mlflow/*
  • Chronos API: http://localhost/chronos/*
  • Catalog API: http://localhost/darwin-catalog/*
  • Workspace: http://localhost/workspace/*
  • Workflow: http://localhost/workflow/*

Quick Start: Create and Use a Ray Cluster

# Create a cluster via REST API
curl --location 'http://localhost/compute/cluster' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "cluster_name": "my-first-cluster",
    "tags": ["demo"],
    "runtime": "0.0",
    "inactive_time": 30,
    "start_cluster": true,
    "head_node_config": {
        "cores": 4,
        "memory": 8,
        "node_capacity_type": "ondemand"
    },
    "worker_node_configs": [
        {
            "cores_per_pods": 2,
            "memory_per_pods": 4,
            "min_pods": 1,
            "max_pods": 2,
            "disk_setting": null,
            "node_capacity_type": "ondemand"
        }
    ],
    "user": "user@example.com"
}'

# Wait for Cluster to become Active
curl http://localhost/compute/cluster/{cluster_id}/metadata
# Wait until the status shows active.

# Response will include cluster_id
# Get Cluster Dashboards link via below API using cluster_id
curl --location 'http://localhost/compute/cluster/{cluster_id}/dashboards'
# Access Jupyter notebook at the returned jupyter_lab_url
# Monitor Ray cluster at the ray_dashboard_url

# Stop the cluster when done
curl --location --request POST 'http://localhost/compute/cluster/stop-cluster/{cluster_id}' \
  --header 'msd-user: {"email": "user@example.com"}'

Understanding Runtime Parameter:

The runtime field specifies which pre-built Docker image to use for your Ray cluster. Darwin supports multiple runtimes with different Python versions and pre-installed libraries:

  • "0.0": Default runtime with Ray 2.37.0, Python 3.10, Spark 3.5.1, and darwin-sdk
  • Custom runtimes can be registered with specific library combinations

To check available runtimes:

curl http://localhost/compute/get-runtimes | python3 -m json.tool

Or use the Python SDK:

# Install SDK
pip install -e darwin-compute/sdk

# Create a cluster
from darwin_compute import ComputeCluster

cluster = ComputeCluster(env="darwin-local")
response = cluster.create_with_yaml("examples/cluster-config.yaml")
cluster_id = response['cluster_id']

# Check and wait until cluster status becomes active
cluster.get_info(cluster_id)

# Stop when done
cluster.stop(cluster_id)

Quick Start: Deploy a Model

# Activate Darwin CLI
source .venv/bin/activate

# 1. Configure Darwin CLI
darwin config set --env darwin-local

# 2. Configure Serve authentication (uses default token for darwin-local)
darwin serve configure

# 3. Create environment
darwin serve environment create --name local --domain-suffix .local --cluster-name kind

# 4. Create serve
darwin serve create \
  --name iris-classifier \
  --type api \
  --space serve \
  --description "Iris classification model"

# 5. Deploy model
darwin serve deploy-model \
  --serve-name iris-classifier \
  --artifact-version v1 \
  --model-uri mlflow-artifacts:/1/2b2b1b5727a14c5ca81b44e899979745/artifacts/model \
  --cores 2 \
  --memory 4 \
  --node-capacity spot \
  --min-replicas 1 \
  --max-replicas 2

# 6. Make predictions
curl -X POST http://localhost/iris-classifier/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [[5.1, 3.5, 1.4, 0.2]]}'

πŸ“– For complete Serve CLI documentation, see darwin-cli/README.md#serve-commands

πŸ“š Complete End-to-End Examples

For comprehensive step-by-step guides covering the full ML lifecycle (training, deployment, and inference), see these examples:

Example Framework Task Type Guide
Iris Classification Spark + Sklearn Multi-class Classification examples/iris-classification/README.md
Wine Classification Spark + LightGBM Multi-class Classification examples/lightgbm-wine-classification/README.md
Diabetes Regression Spark + XGBoost Regression examples/xgboost-diabetes-regression/README.md

Each example demonstrates:

  • βœ… Platform setup and configuration
  • βœ… Compute cluster creation with Spark support
  • βœ… Hybrid approach: Spark for data processing + native frameworks for training
  • βœ… Model training in Jupyter notebooks
  • βœ… MLflow experiment tracking and model registration
  • βœ… Model deployment with darwin-cli
  • βœ… Real-time inference testing
  • βœ… Complete resource cleanup

Quick Start: Use Feature Store

# Install SDK
pip install -e feature-store/python/darwin_fs

# Fetch features
from darwin_fs import FeatureStoreClient

fs = FeatureStoreClient(env="local")
features = fs.fetch_features(
    feature_group_name="user_features",
    feature_columns=["age", "tenure", "activity_score"],
    primary_key_names=["user_id"],
    primary_key_values=[[123], [456], [789]]
)

🎯 Quick Start: Submit a Spark Job Using Darwin SDK

Darwin SDK provides seamless integration with Apache Spark on Ray clusters. Here's how to run distributed Spark workloads using Darwin as your Spark session provider:

Step 1: Create a Ray Cluster

curl --location 'http://localhost/compute/cluster' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "cluster_name": "spark-demo-cluster",
    "tags": ["spark", "demo"],
    "runtime": "0.0",
    "inactive_time": 60,
    "start_cluster": true,
    "head_node_config": {
        "cores": 4,
        "memory": 8,
        "node_capacity_type": "ondemand"
    },
    "worker_node_configs": [{
        "cores_per_pods": 2,
        "memory_per_pods": 4,
        "min_pods": 1,
        "max_pods": 2,
        "disk_setting": null,
        "node_capacity_type": "ondemand"
    }],
    "user": "user@example.com"
}'

Save the cluster_id from the response.

Step 2: Wait for Cluster to be Ready

# Check cluster status
curl http://localhost/compute/cluster/{cluster_id}/metadata

# Wait until status shows "active"
# Then verify pods are running
kubectl get pods -n ray -l ray.io/cluster={cluster_id}-kuberay

Step 3: Create Your Spark Job

Create a file my_spark_job.py:

#!/usr/bin/env python3
"""
Darwin SDK Spark Job Example
"""
import os
import ray

# Initialize Ray (connects to running Ray cluster)
ray.init()

# Set environment variables
os.environ["ENV"] = "LOCAL"
os.environ["CLUSTER_ID"] = os.getenv("CLUSTER_ID", "your-cluster-id")
os.environ["DARWIN_COMPUTE_URL"] = "http://darwin-compute.darwin.svc.cluster.local:8000"

print("=" * 60)
print("Darwin SDK Spark Job")
print(f"Cluster ID: {os.environ['CLUSTER_ID']}")
print("=" * 60)

# Initialize Spark using darwin-sdk
from darwin import init_spark_with_configs

spark_configs = {
    "spark.sql.execution.arrow.pyspark.enabled": "true",
    "spark.sql.session.timeZone": "UTC",
    "spark.sql.shuffle.partitions": "10",
    "spark.default.parallelism": "10",
    "spark.driver.memory": "1g",
    "spark.executor.memory": "1g",
}

spark = init_spark_with_configs(spark_configs=spark_configs)
print(f"βœ“ Spark initialized (version: {spark.version})")

# Create and process DataFrame
df = spark.createDataFrame([
    (1, "Alice", 100),
    (2, "Bob", 200),
    (3, "Charlie", 300),
], ["id", "name", "score"])

print("\nDataFrame Contents:")
df.show()

print(f"\nTotal records: {df.count()}")
print(f"Average score: {df.agg({'score': 'avg'}).collect()[0][0]}")

# Stop Spark cleanly
from darwin import stop_spark
stop_spark()

print("\nβœ“ Job completed successfully!")

Step 4: Submit the Job

Option A: Using submit_spark_job.sh Script

cd darwin-sdk/darwin
./submit_spark_job.sh \
  --cluster-name {cluster_id} \
  --namespace ray \
  --job-file /path/to/my_spark_job.py \
  --wait

Option B: Using Ray Jobs API

# Port-forward to Ray dashboard
kubectl port-forward -n ray svc/{cluster_id}-kuberay-head-svc 8265:8265 &

# Submit job
curl -X POST "http://localhost:8265/api/jobs/" \
  -H "Content-Type: application/json" \
  -d '{
    "entrypoint": "python my_spark_job.py",
    "runtime_env": {
      "working_dir": "./",
      "env_vars": {
        "CLUSTER_ID": "'{cluster_id}'",
        "ENV": "LOCAL"
      }
    },
    "metadata": {
      "name": "darwin-spark-demo"
    }
  }'

Option C: Using Ray Python Client

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://localhost:8265")

job_id = client.submit_job(
    entrypoint="python my_spark_job.py",
    runtime_env={
        "working_dir": "./",
        "env_vars": {
            "CLUSTER_ID": "{cluster_id}",
            "ENV": "LOCAL"
        }
    }
)

print(f"Submitted job: {job_id}")

# Wait for completion
client.wait_until_status(job_id, "SUCCEEDED")
print(client.get_job_logs(job_id))

Step 5: Monitor Job Execution

# Check job status
SUBMISSION_ID="raysubmit_xxxxxxxx"
curl "http://localhost:8265/api/jobs/${SUBMISSION_ID}"

# View job logs
curl "http://localhost:8265/api/jobs/${SUBMISSION_ID}/logs"

# Or use Ray Dashboard
open http://localhost:8265

Darwin SDK Spark Functions

Function Description
init_spark_with_configs(spark_configs) Initialize Spark with custom configurations
start_spark(spark_conf) Start Spark with default Glue catalog configs
get_raydp_spark_session() Get existing Spark session
stop_spark() Stop Spark session cleanly

Troubleshooting

Issue: "Runtime given is incorrect"

# Check available runtimes
curl http://localhost/compute/get-runtimes

Issue: Ray job stuck in PENDING

# Check Ray head pod
kubectl describe pod {cluster_id}-kuberay-head-xxx -n ray

Issue: Connection refused when submitting job

# Restart port-forward
pkill -f "port-forward.*8265"
kubectl port-forward -n ray svc/{cluster_id}-kuberay-head-svc 8265:8265 &

Issue: Cluster not starting due to long init script

If your init_script in the cluster configuration is too long, the cluster may fail to start. This happens because init scripts are executed during pod startup and have timeout limitations.

Solutions:

  • Use the Library Installation API to install packages instead of init scripts
  • Create a custom runtime with your dependencies pre-installed
  • Split long scripts into smaller, essential commands

πŸ§ͺ Creating Your First Project

This guide walks you through your first end-to-end experience on Darwin β€” from compute creation to deployment.

πŸ“– Looking for complete step-by-step guides? See our comprehensive examples:

πŸ”§ 1) Create Compute

Create a Ray cluster for your ML workload:

curl --location 'http://localhost/compute/cluster' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "cluster_name": "housing-project",
    "tags": ["tutorial", "housing-prices"],
    "runtime": "0.0",
    "inactive_time": 60,
    "head_node_config": {
        "cores": 4,
        "memory": 8
    },
    "worker_node_configs": [
        {
            "cores": 2,
            "memory": 4,
            "min_pods": 1,
            "max_pods": 2
        }
    ],
    "user": "user@example.com"
}'

Save the cluster_id from the response - you'll need it for the next steps.

πŸ“Š 2) Check Status

Check your cluster status:

curl http://localhost/compute/cluster/{cluster_id}/metadata

Wait until the status shows active.

πŸ““ 3) Open Jupyter Notebook

Once the cluster is ready, access the Jupyter notebook at:

http://localhost/kind-0/{cluster_id}-jupyter

Open this URL in your browser to start working in the workspace.

🏑 4) Copy and Run Example: housing-prices

In the Jupyter notebook, copy the example project: /examples/housing-prices/ in Jupyter notebook. The model will be logged automatically to MLflow.

🏷️ 5) Check Your Model in the Registry

Verify your trained model in the Darwin MLflow UI:

http://localhost/mlflow-app/experiments

Navigate to your experiment to see the registered model with metrics and parameters.

πŸš€ 6) Deploy with Darwin CLI

Deploy your trained model (replace <experiment_id> and <run_id> with values from MLflow UI):

πŸ“– Sample training script for house price prediction: examples/house-price-prediction/train_house_pricing_model.ipynb

# Activate Darwin CLI
source .venv/bin/activate

# 1. Configure Darwin CLI (one-time)
darwin config set --env darwin-local
darwin serve configure

# 2. Create environment
darwin serve environment create --name local --domain-suffix .local --cluster-name kind

# 3. Create serve
darwin serve create \
  --name housing-model \
  --type api \
  --space serve \
  --description "House Price Prediction model"

# 4. Deploy model
darwin serve deploy-model \
  --serve-name housing-model \
  --artifact-version v1 \
  --model-uri mlflow-artifacts:/1/<experiment_id>/<run_id>/artifacts/model \
  --cores 2 \
  --memory 4 \
  --node-capacity spot \
  --min-replicas 1 \
  --max-replicas 2

πŸ“– For complete Serve CLI documentation, see darwin-cli/README.md#serve-commands

🌐 7) Test Your Endpoint

Test your deployed model:

curl -X POST http://localhost/housing-model/predict \
  -H "Content-Type: application/json" \
  -d '{
    "features": {
      "MedInc": 3.5214,
      "HouseAge": 15.0,
      "AveRooms": 6.575757575757576,
      "AveBedrms": 1.0196969696969697,
      "Population": 1447.0,
      "AveOccup": 3.0144927536231883,
      "Latitude": 37.63,
      "Longitude": -122.43
    }
  }'

Once deployed, your model is accessible as a real-time inference API.


πŸ”¬ Additional ML Examples

For more comprehensive examples with different ML frameworks and use cases:

Example Framework Dataset Task Complete Guide
Iris Classification Spark + Sklearn Iris (150 samples, 4 features) Multi-class Classification πŸ“– View Guide
Wine Classification Spark + LightGBM Wine (178 samples, 13 features) Multi-class Classification πŸ“– View Guide
Diabetes Regression Spark + XGBoost Diabetes (442 samples, 10 features) Regression πŸ“– View Guide

What's included in each example:

  • βœ… Complete platform setup scripts (init-example.sh)
  • βœ… Training notebooks with Spark data processing
  • βœ… MLflow experiment tracking and model registration
  • βœ… Step-by-step deployment instructions
  • βœ… Sample inference requests with expected outputs
  • βœ… Troubleshooting guides

Quick reference for training scripts:

πŸ’‘ Tip: All Spark-based examples use the hybrid approach - Spark for distributed data processing and native frameworks (sklearn, LightGBM, XGBoost) for model training. This ensures compatibility with any MLflow server version and eliminates Spark dependencies at serving time.


πŸ§ͺ Example Workflows

End-to-End ML Workflow

sequenceDiagram
    participant DS as Data Scientist
    participant Compute as Darwin Compute
    participant MLflow as MLflow
    participant FS as Feature Store
    participant Serve as ML Serve

    DS->>Compute: Create Ray cluster
    Compute-->>DS: Cluster ID + Jupyter link
    DS->>FS: Fetch training features
    FS-->>DS: Feature dataset
    DS->>Compute: Train model (Ray/Spark)
    DS->>MLflow: Log experiment + model
    DS->>Serve: Deploy model (Darwin CLI)
    Serve->>MLflow: Fetch model artifact
    Serve->>FS: Configure feature retrieval
    Serve-->>DS: Inference endpoint
    DS->>Compute: Stop cluster
Loading

Feature Engineering Pipeline

graph LR
    A[Raw Data<br/>S3/Delta Lake] --> B[Spark Pipeline<br/>Feature Transform]
    B --> C[Kafka Topic<br/>feature-updates]
    C --> D[Feature Store<br/>Consumer]
    D --> E[Cassandra<br/>Materialized Features]
    E --> F[Online Serving<br/>< 10ms latency]
    B --> G[Catalog<br/>Lineage Tracking]
Loading

πŸ“Š Technology Stack

Layer Technologies
Languages Python 3.9.7, Java 11, Go 1.18
Compute Ray 2.37.0, Apache Spark 3.5.1
Web Frameworks FastAPI, Spring Boot, Vert.x
Orchestration Kubernetes (EKS/Kind), Helm 3, KubeRay Operator v1.1.0
Databases MySQL 8.0, Cassandra 5.0, OpenSearch 2.11
Streaming Apache Kafka 7.4.0
Storage S3 (AWS/LocalStack), FSx, EFS
Experiment Tracking MLflow 2.12.2
Monitoring Prometheus, Grafana, Ray Dashboard
Container Registry ECR, GCR, Local Docker Registry

πŸ”§ Configuration

Darwin uses a declarative configuration approach:

Service Selection (init.sh)

Interactive wizard to select platform components:

Default Mode - Simplified preset selection:

./init.sh
# Prompts for:
# - Training preset (Compute + MLFlow)
# - Inference preset (Serve + MLFlow)

Dev Mode - Granular service-by-service selection:

./init.sh --dev-mode
# Prompts for enabling individual:
# - Applications (Compute, Feature Store, MLflow, etc.)
# - Datastores (MySQL, Cassandra, Kafka, etc.)
# - Ray images and Serve runtimes

All Mode - Enable everything:

./init.sh --all
# Enables all services without prompts

Generates .setup/enabled-services.yaml with user selections.

Environment Variables

Key configuration via .setup/config.env (auto-generated):

KUBECONFIG=./.setup/kindkubeconfig.yaml
DOCKER_REGISTRY=127.0.0.1:32768

Helm Values

Customize deployments via helm/darwin/values.yaml:

global:
  namespace: darwin
  
services:
  compute:
    enabled: true
    replicas: 2
  
datastores:
  mysql:
    enabled: true
  cassandra:
    enabled: true

🏒 Deployment Patterns

Local Development (Kind)

  • Single-node Kubernetes cluster
  • Local Docker registry
  • HostPath-based persistent storage
  • Nginx Ingress at localhost/*

Production (EKS)

  • Multi-AZ high availability
  • Mixed spot/on-demand node groups
  • Auto-scaling with Karpenter
  • Network policies and security groups
  • S3-backed artifact storage
  • RDS for MySQL (optional)
  • Multi-tenant namespace isolation

πŸ“ˆ Observability

Metrics

  • Prometheus: Cluster resource utilization, service metrics
  • Grafana: Pre-configured dashboards for compute, serving, features
  • Ray Dashboard: Job execution, task profiling, resource usage

Logging

  • Centralized logging via stdout/stderr
  • Application logs in /app/logs
  • Structured logging with context

Events & Lineage

  • Chronos: Event-driven tracking of all platform operations
  • Catalog: Data lineage via OpenLineage
  • Elasticsearch-based search and analytics

Alerts

  • Slack integration for cost alerts
  • Long-running cluster notifications
  • Failed deployment alerts

πŸ“š SDKs & APIs

Available SDKs

  • darwin-compute: Ray cluster management
  • darwin_fs: Feature Store client
  • darwin_mlflow: MLflow wrapper with auth
  • darwin-workspace (internal): Workspace orchestration
  • darwin_workflow: Workflow orchestration and pipeline management

REST APIs

All services expose FastAPI/Spring Boot REST APIs:

  • Feature Store: /feature-store/*, /feature-store-admin/*
  • Darwin Compute: /cluster/*, /jupyter/*
  • ML Serve: /api/v1/serve/*, /api/v1/artifact/*
  • Chronos: /api/v1/event/*, /api/v1/sources/*
  • Catalog: /v1/assets/*, /v1/lineage/*
  • Workflow: /api/v3/workflow/*, /api/v3/workflow-run/*

API documentation available at <service-url>/docs (Swagger UI).


🧩 Extensibility

Available Runtimes

Darwin provides pre-built Ray runtimes for cluster creation. The Runtime Name is what you pass to the API when creating clusters (e.g., "runtime": "0.1").

Runtime Name Image Ray Version Python Class Type
0.0 ray:2.37.0 2.37.0 3.10 CPU Ray Only
0.1 ray:2.53.0 2.53.0 3.10 CPU Ray Only

Custom Runtimes

Add new Ray runtimes by creating Dockerfiles in darwin-compute/runtimes/:

# darwin-compute/runtimes/cpu/Ray2.37_Py3.11_CustomLibs/Dockerfile
FROM rayproject/ray:2.37.0-py311
RUN pip install jupyterlab==4.3.0
RUN pip install custom-library

Register in services.yaml:

ray-images:
  - image-name: ray:2.37.0-py311-custom
    dockerfile-path: darwin-compute/runtimes/cpu/Ray2.37_Py3.11_CustomLibs

Custom Transformers (Chronos)

Create Python transformers for event processing:

# Chronos transformer
def transform(event):
    return {
        "event_type": "cluster_created",
        "entities": [{"type": "cluster", "id": event["cluster_id"]}],
        "relationships": [{"from": user, "to": cluster, "type": "owns"}]
    }

🀝 Contributing

See CONTRIBUTING.md for development setup, coding standards, and contribution guidelines.


πŸ“„ License

[License information to be added]


πŸ“ž Support

For issues, questions, or feature requests, please open an issue in the repository or contact the platform team.


πŸ—ΊοΈ Project Structure

darwin-distro/
β”œβ”€β”€ darwin-compute/          # Ray cluster management service
β”œβ”€β”€ darwin-cluster-manager/  # Kubernetes orchestration (Go)
β”œβ”€β”€ darwin-workflow/         # ML pipeline orchestration (Airflow integration)
β”œβ”€β”€ feature-store/           # Feature Store (Java)
β”œβ”€β”€ mlflow/                  # MLflow experiment tracking
β”œβ”€β”€ ml-serve-app/            # Model serving platform
β”œβ”€β”€ artifact-builder/        # Docker image builder
β”œβ”€β”€ chronos/                 # Event processing & metadata
β”œβ”€β”€ workspace/               # Project & codespace management
β”œβ”€β”€ darwin-catalog/          # Data catalog & lineage
β”œβ”€β”€ darwin-sdk/              # Platform SDK with Spark integration
β”œβ”€β”€ hermes-cli/              # Serve CLI backend (used by darwin-cli)
β”œβ”€β”€ darwin-cli/              # Unified CLI for all Darwin services
β”œβ”€β”€ helm/                    # Helm charts for deployment
β”‚   └── darwin/              # Umbrella chart
β”‚       β”œβ”€β”€ charts/
β”‚       β”‚   β”œβ”€β”€ datastores/  # MySQL, Cassandra, Kafka, Airflow, etc.
β”‚       β”‚   └── services/    # Application services
β”œβ”€β”€ deployer/                # Build scripts and base images
β”œβ”€β”€ kind/                    # Local Kubernetes setup
β”œβ”€β”€ examples/                # Example notebooks and configs
β”œβ”€β”€ init.sh                  # Interactive configuration wizard
β”œβ”€β”€ setup.sh                 # Build and cluster setup
β”œβ”€β”€ start.sh                 # Deploy platform
└── services.yaml            # Service registry

Darwin ML Platform β€” Unified, scalable, production-ready machine learning infrastructure.

About

ML Platform at Dream11

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors