ACO — Adaptive Compute Orchestrator

Traditional schedulers react to load. ACO predicts it.

Most job schedulers are reactive: they see a spike after it starts. By then, SLAs are already at risk.
ACO combines ant colony optimization, LSTM-based spike prediction, and workload-intent routing to place jobs on the right node before contention hits.

Performance

Metric	Result
Scheduling latency (P99)	< 10ms (5-node, 20-job cluster)
Resource utilisation improvement	+28% vs first-fit baseline
SLA adherence under burst	95%+
Spike recall (LSTM predictor)	> 80%
Test coverage	202 tests passing
External dependencies at runtime	Zero — fully in-memory

The Problem

Standard schedulers treat workloads as opaque units and nodes as interchangeable bins.
They don't know:

A GPU inference job cannot tolerate spot interruption
A batch actor on a node about to spike will miss its deadline
Placing a replay buffer and a latency-critical job on the same node will destroy P99

ACO treats workloads as profiles with intent and nodes as typed resources with predictable futures.

How It Works

Three overlapping signals, combined in < 10ms:

Signal	Source	Role
Pheromone (learned history)	ACO colony convergence	Which placements have worked before
Heuristic (cost + headroom + spike risk)	CostEngine	How good is this node right now
Intent (workload type + constraints)	WorkloadIntentRouter	Which nodes are even eligible

Scheduling Decision Flow

POST /jobs
  ↓
OrchestratorService
  ├── Admit job (resource sanity checks)
  ├── WorkloadIntentRouter → SchedulingStrategy
  │     GPU Inference?   → ON_DEMAND only, fast path
  │     Batch + GPU?     → SPOT allowed, full colony
  │     LC + deadline?   → tighten SLA threshold, fast path
  │     Preemptible?     → SPOT preferred, forgiving SLA
  ↓
aco_schedule()
  ├── Filter nodes: capacity + arch + colocation policy
  ├── CostEngine.score_node() × n_nodes → η array
  │     reliability_factor × cost_efficiency × sla_headroom × prediction_factor
  ├── Fast path (latency-critical): deterministic argmax(η) → < 1ms
  └── Full colony (20 ants × 5 iterations): probabilistic → ≤ 8ms
  ↓
Allocate → NodeAgent.execute_job() (async, non-blocking)
Return 202: { node_id, scheduling_latency_ms }

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     FastAPI :8000                            │
│  POST /jobs   GET /metrics   GET /predict/:id               │
│  POST /simulation/start      POST /upload-trace             │
└──────────────────────┬──────────────────────────────────────┘
                       │ submit_job()
                       ▼
┌────────────────────────────────────────────────────────────┐
│                  OrchestratorService                        │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ Admission   │→ │ IntentRouter │→ │ aco_schedule()   │  │
│  │ Controller  │  │ classify job │  │ ACO or fast path │  │
│  └─────────────┘  └──────────────┘  └──────────────────┘  │
└──────────────────────┬─────────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
┌─────────────┐  ┌───────────┐  ┌─────────────┐
│  ACO Colony │  │ CostEngine│  │ LSTM        │
│  20 Ants    │  │ η per pair│  │ Predictor   │
│  5 Iters    │  │           │  │ per node    │
└─────────────┘  └───────────┘  └─────────────┘
        │
        ▼ (background, every 5s)
┌────────────────────────────────────────────────────────────┐
│                  TelemetryCollector                         │
│  tick() → NodeTelemetry → update state                     │
│  every 10 ticks: refit LSTM per node                       │
│  Source: synthetic Gaussian OR Alibaba 2018 OR Borg 2019   │
└────────────────────────────────────────────────────────────┘
        │
        ▼ (per node, background heartbeat)
┌────────────────────────────────────────────────────────────┐
│                   NodeAgent (×5)                            │
│  execute_job() → asyncio.sleep → complete_job()            │
│  _heartbeat_loop() → telemetry push every 5s              │
└────────────────────────────────────────────────────────────┘

Mock Cluster — 5-Node Heterogeneous Fleet

Node	Arch	Instance	CPU	Mem	GPU	Cost/hr
`node-cpu-01`	x86_64	ON_DEMAND	32c	128GB	—	$0.48
`node-arm-02`	ARM64	SPOT	16c	64GB	—	$0.12
`node-api-03`	x86_64	ON_DEMAND	8c	32GB	—	$0.18
`node-gpu-04`	GPU_NODE	ON_DEMAND	16c	128GB	A100	$3.20
`node-gpu-05`	GPU_NODE	SPOT	16c	128GB	A100	$1.10

The intent router naturally directs GPU inference to node-gpu-04 (ON_DEMAND, never spot), batch actors to node-arm-02 (cheap ARM64), and latency-critical API traffic to node-api-03 — without explicit configuration.

Key Components

ACO Colony (`aco_core/`)

PheromoneMatrix: 2D NumPy τ[n_jobs][n_nodes], evaporation rate 10%/iter, floor 0.01 (prevents stagnation)
Ant: Roulette-wheel selection on (τ^α × η^β) with α=1.0, β=2.0
Colony: 20 ants × 5 iterations, early stop after 3 stagnant iterations
Fast path: Single latency-critical job → deterministic argmax(η) in < 1ms (no variance acceptable for P99)

LSTM Predictor (`orchestrator/control_plane/predictor.py`)

Architecture: (1, 10, 1) input → LSTM(hidden=32) → Linear(32→1)
Per-node model, refits every 10 telemetry ticks (~50ms refit time)
Cold-start handled: uses CPU average with confidence=0.1 until ≥10 samples
Confidence grows linearly: 10 samples → 0.5, 500 samples → 1.0

CostEngine — Composite Score

score(job, node) = reliability_factor
                × cost_efficiency_factor
                × sla_headroom_factor
                × prediction_factor    ← penalises nodes about to spike

All factors in (0.0, 1.0]. A score of 0.0 means never place here. Hard gate: latency-critical jobs never land on high-risk SPOT nodes.

Trace Replay Adapters (`orchestrator/telemetry/trace_adapter.py`)

Real cluster data instead of synthetic noise:

Alibaba 2018: 8-day trace, 300s intervals, CPU range 16–79%, diurnal cycles, burst events
Borg 2019: Google cluster trace, auto-detected by column names, hot-swap at runtime via POST /upload-trace

Running It

Start the server

pip install -r requirements.txt
python run.py
# Open: http://localhost:8000

Load real trace data

curl -L "https://zenodo.org/records/14564935/files/machine_usage_days_1_to_8_grouped_300_seconds.csv" \
     -o alibaba_trace.csv

curl -X POST http://localhost:8000/upload-trace \
     -F "file=@alibaba_trace.csv"

Submit a job

curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"workload_type": "latency-critical", "cpu_cores_min": 2.0, "memory_gb_min": 4.0, "priority": 90}'

# Response: {"status": "SCHEDULED", "node_id": "node-api-03", "scheduling_latency_ms": 4.2}

Run all tests

python -m pytest tests/ -v
# 202 tests, 0 failures

Test Coverage

Test file	Phase	Tests	Focus
`test_aco_phase2.py`	ACO Core	43	PheromoneMatrix, Ant η, Colony convergence, performance
`test_predictor.py`	LSTM	20	Training, cold-start, spike detection, confidence
`test_cost_engine.py`	CostEngine	30	All 4 sub-scores, edge cases, threshold overrides
`test_aco_phase5.py`	Orchestration	32	aco_schedule integration, OrchestratorService lifecycle
`test_intent_router.py`	Intent Router	24	Strategy classification, deadline override, colocation
`test_aco_phase7.py`	Telemetry	20	TelemetryCollector, per-node profiles, prediction cache
`test_trace_adapter.py`	Trace Replay	13	Adapter loading, trace replay, collector integration
`test_data_plane.py`	NodeAgent	20	execute_job, heartbeat, end-to-end async

V3 Upgrade Path

V2 is intentionally in-memory to validate scheduling algorithms without infrastructure complexity.

V2 (current)	V3 (production target)
In-memory `node_state` dict	Redis / etcd cluster state
In-process `NodeAgent`	Real agents over HTTP via `httpx`
Synthetic / trace telemetry	Real `node_exporter` / Prometheus
`asyncio.sleep()` simulation	Actual K8s pod execution
Single-process FastAPI	FastAPI + load balancer
No persistence	Postgres job history + S3 pheromone snapshots

Project Structure

aco_core/
├── pheromone.py              PheromoneMatrix (NumPy, evaporation, deposit)
├── ant.py                    Single ant: η computation + roulette-wheel selection
└── colony.py                 Colony: 20 ants × 5 iters, fast path, fallback

orchestrator/
├── control_plane/
│   ├── scheduler.py          aco_schedule() — main entry point
│   ├── orchestration_service.py  OrchestratorService — state machine
│   ├── intent_router.py      WorkloadIntentRouter — 6 strategies
│   ├── cost_engine.py        CostEngine — 4-factor composite score
│   └── predictor.py          LSTM predictor — per-node, refits every 10 ticks
├── data_plane/
│   └── agent.py              NodeAgent — async job execution + heartbeat
└── telemetry/
    ├── collector.py           TelemetryCollector — drives prediction loop
    └── trace_adapter.py       Alibaba 2018 + Borg 2019 trace replay

api/
└── main.py                   FastAPI app, dashboard, simulation loop, lifespan

tests/                        202 tests (pytest-asyncio)

Author

Aravind Sundaresan — Infrastructure & Distributed Systems Engineer
Microsoft (distributed validation platform, 17K+ microservices) · Ex-Amazon (Alexa device infrastructure)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
aco_core		aco_core
api		api
benchmarks		benchmarks
orchestrator		orchestrator
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
TECHNICAL.md		TECHNICAL.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACO — Adaptive Compute Orchestrator

Performance

The Problem

How It Works

Scheduling Decision Flow

Architecture

Mock Cluster — 5-Node Heterogeneous Fleet

Key Components

ACO Colony (`aco_core/`)

LSTM Predictor (`orchestrator/control_plane/predictor.py`)

CostEngine — Composite Score

Trace Replay Adapters (`orchestrator/telemetry/trace_adapter.py`)

Running It

Start the server

Load real trace data

Submit a job

Run all tests

Test Coverage

V3 Upgrade Path

Project Structure

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ACO — Adaptive Compute Orchestrator

Performance

The Problem

How It Works

Scheduling Decision Flow

Architecture

Mock Cluster — 5-Node Heterogeneous Fleet

Key Components

ACO Colony (aco_core/)

LSTM Predictor (orchestrator/control_plane/predictor.py)

CostEngine — Composite Score

Trace Replay Adapters (orchestrator/telemetry/trace_adapter.py)

Running It

Start the server

Load real trace data

Submit a job

Run all tests

Test Coverage

V3 Upgrade Path

Project Structure

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

ACO Colony (`aco_core/`)

LSTM Predictor (`orchestrator/control_plane/predictor.py`)

Trace Replay Adapters (`orchestrator/telemetry/trace_adapter.py`)

Packages