Distributed Experiment Runner (xrun)

A Python CLI for launching, tracking, and aggregating ML experiments across a local GPU cluster. YAML-configured sweeps, SQLite-backed metric and artifact tracking, automatic GPU pinning, and optional Docker containerization for reproducible environments.

Cut a typical experiment setup (spin up, pin GPUs, seed, start TensorBoard, copy configs, track results) from 20+ minutes of shell wrangling to ~30 seconds.

What it does

YAML configs — single-run or full grid sweeps
Local GPU scheduler with automatic GPU pinning via CUDA_VISIBLE_DEVICES
SQLite tracking DB for runs, metrics, and artifacts (no server required)
Docker integration for reproducible environments (optional)
Sweep aggregation — summarize metrics across runs with xrun summarize
Failure isolation — a crashing run doesn't kill the sweep

CLI usage

# Install
pip install -r requirements.txt
pip install -e .   # makes `xrun` available; or use `python -m src.cli`

# Detect GPUs
xrun gpus
# => Detected GPUs: [0, 1, 2, 3]

# Launch an experiment or sweep
xrun run examples/sample_experiment/experiment.yaml

# List recent runs
xrun list --name mnist_mlp --limit 10

# Inspect a single run
xrun show mnist_mlp_abc12345

# Aggregate across all runs with a name prefix
xrun summarize --name mnist_mlp

Example sweep config

# examples/sample_experiment/experiment.yaml
base:
  name: mnist_mlp
  command: "python train.py --cfg {cfg} --out {artifacts_dir}"
  resources:
    gpus: 1
  artifacts_dir: "runs/{name}/{run_id}"
  params:
    epochs: 10
    batch_size: 64

sweep:
  lr: [1.0e-3, 5.0e-3, 1.0e-2]
  hidden: [64, 128, 256]

Launch the 9-run grid (3 × 3):

xrun run examples/sample_experiment/experiment.yaml --max-concurrent 2

The scheduler runs 2 experiments at a time, pinning each to its own GPU.

Docker integration (optional)

Add a docker: block to the config to run each experiment inside a container:

base:
  name: resnet_ablation
  command: "python train.py --cfg {cfg}"
  docker:
    image: "pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime"
    gpus: 1
    mounts:
      - "/data/imagenet:/data/imagenet:ro"
    env:
      PYTHONUNBUFFERED: "1"
  resources:
    gpus: 1

Every run now starts a fresh container, mounts the workdir at /workspace, and passes GPU access via --gpus device=N.

Tracking schema

The SQLite database (default experiments.db) stores:

runs       -- run_id, name, config_json, status, started_at, ended_at, exit_code, gpu_ids, ...
metrics    -- run_id, key, value, step, wall_time
artifacts  -- run_id, name, path, size_bytes, registered_at

The experimenter's train.py can log metrics from inside the container:

from src.tracking import ExperimentStore
store = ExperimentStore("/workspace/experiments.db")
store.log_metric(run_id, "val_auc", 0.812, step=epoch)
store.register_artifact(run_id, "checkpoint", "/workspace/runs/ckpt.pt")

Layout

src/
  cli.py          Click-based CLI (`xrun`)
  config.py       YAML config + sweep expansion (pydantic)
  scheduler.py    Local GPU pool + job scheduler (threaded)
  docker_env.py   `docker run` command builder
  tracking.py     SQLite experiment store
tests/
  test_config.py  Sweep expansion tests
  test_tracking.py Metrics + summarize tests
examples/sample_experiment/
  experiment.yaml     3-way × 2-way grid demo
  train.py            Minimal training script (no GPU required)

Why not use X?

Why not MLflow / W&B? Those are great but require a server, account, or network. xrun is a single-file SQLite backend that runs purely locally.
Why not Ray Tune / Optuna? Different purpose — Tune is for optimizing hyperparameters with adaptive schedulers. xrun is for running a manual grid or ad-hoc experiment and tracking the results.
Why not SLURM? SLURM is great for large shared clusters. xrun targets the local-GPU-box or small team case where SLURM is overkill.

Trade-offs honestly

Single-host only. Scale-out would require an RPC layer over scheduler.py.
Scheduler uses Python threads (fine for launching subprocesses), not async.
experiments.db is SQLite — fine for thousands of runs, not millions.
GPU detection relies on nvidia-smi; CPU-only machines fall back to running max_concurrent jobs in parallel without GPU pinning.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples/sample_experiment		examples/sample_experiment
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Experiment Runner (xrun)

What it does

CLI usage

Example sweep config

Docker integration (optional)

Tracking schema

Layout

Why not use X?

Trade-offs honestly

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Experiment Runner (xrun)

What it does

CLI usage

Example sweep config

Docker integration (optional)

Tracking schema

Layout

Why not use X?

Trade-offs honestly

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages