Skip to content

amangrewal1/distributed-experiment-runner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Experiment Runner (xrun)

license

A Python CLI for launching, tracking, and aggregating ML experiments across a local GPU cluster. YAML-configured sweeps, SQLite-backed metric and artifact tracking, automatic GPU pinning, and optional Docker containerization for reproducible environments.

Cut a typical experiment setup (spin up, pin GPUs, seed, start TensorBoard, copy configs, track results) from 20+ minutes of shell wrangling to ~30 seconds.

What it does

  • YAML configs — single-run or full grid sweeps
  • Local GPU scheduler with automatic GPU pinning via CUDA_VISIBLE_DEVICES
  • SQLite tracking DB for runs, metrics, and artifacts (no server required)
  • Docker integration for reproducible environments (optional)
  • Sweep aggregation — summarize metrics across runs with xrun summarize
  • Failure isolation — a crashing run doesn't kill the sweep

CLI usage

# Install
pip install -r requirements.txt
pip install -e .   # makes `xrun` available; or use `python -m src.cli`

# Detect GPUs
xrun gpus
# => Detected GPUs: [0, 1, 2, 3]

# Launch an experiment or sweep
xrun run examples/sample_experiment/experiment.yaml

# List recent runs
xrun list --name mnist_mlp --limit 10

# Inspect a single run
xrun show mnist_mlp_abc12345

# Aggregate across all runs with a name prefix
xrun summarize --name mnist_mlp

Example sweep config

# examples/sample_experiment/experiment.yaml
base:
  name: mnist_mlp
  command: "python train.py --cfg {cfg} --out {artifacts_dir}"
  resources:
    gpus: 1
  artifacts_dir: "runs/{name}/{run_id}"
  params:
    epochs: 10
    batch_size: 64

sweep:
  lr: [1.0e-3, 5.0e-3, 1.0e-2]
  hidden: [64, 128, 256]

Launch the 9-run grid (3 × 3):

xrun run examples/sample_experiment/experiment.yaml --max-concurrent 2

The scheduler runs 2 experiments at a time, pinning each to its own GPU.

Docker integration (optional)

Add a docker: block to the config to run each experiment inside a container:

base:
  name: resnet_ablation
  command: "python train.py --cfg {cfg}"
  docker:
    image: "pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime"
    gpus: 1
    mounts:
      - "/data/imagenet:/data/imagenet:ro"
    env:
      PYTHONUNBUFFERED: "1"
  resources:
    gpus: 1

Every run now starts a fresh container, mounts the workdir at /workspace, and passes GPU access via --gpus device=N.

Tracking schema

The SQLite database (default experiments.db) stores:

runs       -- run_id, name, config_json, status, started_at, ended_at, exit_code, gpu_ids, ...
metrics    -- run_id, key, value, step, wall_time
artifacts  -- run_id, name, path, size_bytes, registered_at

The experimenter's train.py can log metrics from inside the container:

from src.tracking import ExperimentStore
store = ExperimentStore("/workspace/experiments.db")
store.log_metric(run_id, "val_auc", 0.812, step=epoch)
store.register_artifact(run_id, "checkpoint", "/workspace/runs/ckpt.pt")

Layout

src/
  cli.py          Click-based CLI (`xrun`)
  config.py       YAML config + sweep expansion (pydantic)
  scheduler.py    Local GPU pool + job scheduler (threaded)
  docker_env.py   `docker run` command builder
  tracking.py     SQLite experiment store
tests/
  test_config.py  Sweep expansion tests
  test_tracking.py Metrics + summarize tests
examples/sample_experiment/
  experiment.yaml     3-way × 2-way grid demo
  train.py            Minimal training script (no GPU required)

Why not use X?

  • Why not MLflow / W&B? Those are great but require a server, account, or network. xrun is a single-file SQLite backend that runs purely locally.
  • Why not Ray Tune / Optuna? Different purpose — Tune is for optimizing hyperparameters with adaptive schedulers. xrun is for running a manual grid or ad-hoc experiment and tracking the results.
  • Why not SLURM? SLURM is great for large shared clusters. xrun targets the local-GPU-box or small team case where SLURM is overkill.

Trade-offs honestly

  • Single-host only. Scale-out would require an RPC layer over scheduler.py.
  • Scheduler uses Python threads (fine for launching subprocesses), not async.
  • experiments.db is SQLite — fine for thousands of runs, not millions.
  • GPU detection relies on nvidia-smi; CPU-only machines fall back to running max_concurrent jobs in parallel without GPU pinning.

Releases

No releases published

Packages

 
 
 

Contributors

Languages