Distributed Hyperparameter Tuning Framework

Overview

A scalable, distributed system designed to parallelize machine learning hyperparameter optimization. This framework decouples job scheduling from execution, allowing you to run computationally intensive tuning jobs across multiple worker nodes. It uses RabbitMQ for reliable message queuing and PostgreSQL for persistent storage of experiment results and metrics.

By moving away from sequential execution, this project enables faster experimentation cycles and better resource utilization.

Key Features

Distributed Architecture: Decoupled Scheduler and Worker nodes communicate via RabbitMQ.
Persistent Storage: All experiment metadata, trial configurations, and results are stored in PostgreSQL.
Extensible Design: Modular adapter system supporting scikit-learn and PyTorch models (easily extensible to others).
CLI Management: A built-in Command Line Interface (CLI) to launch experiments and query results.
Fault Tolerance: Durable queues ensure jobs are not lost if workers go offline.

System Architecture

The system consists of four main components:

Scheduler: Generates hyperparameter combinations (Grid Search capabilities) and publishes tuning jobs to the message queue.
Message Queue (RabbitMQ): buffers jobs and distributes them to available workers.
Workers: Stateless consumers that pick up jobs, train the model with the specified hyperparameters, and evaluate performance.
Database (PostgreSQL): Stores experiment definitions, trial statuses, and final metrics.

Project Structure

distributed-tuner/
├── cli/                 # Command Line Interface
│   └── main.py          # Entry point for CLI commands
├── common/              # Shared resources
│   ├── db.py            # Database connection logic
│   ├── models.py        # SQLAlchemy ORM models
│   └── queue.py         # RabbitMQ publisher logic
├── scheduler/           # Experiment orchestration
│   └── scheduler.py     # Logic to generate and dispatch jobs
├── worker/              # Worker node implementation
│   ├── adapters/        # Model-specific adapters (sklearn, torch)
│   └── worker.py        # Main worker loop (consumes jobs)
├── scripts/             # Utility scripts
│   └── init_db.py       # Database schema initialization
├── docker/              # Docker configuration files
├── requirements.txt     # Python dependencies
└── README.md            # Project documentation

Prerequisites

Before running the project, ensure you have the following installed:

Python 3.8+
PostgreSQL: Running locally or accessible via network.
RabbitMQ: Running locally or accessible via network.

Note

Currently, database and RabbitMQ configurations (localhost) are hardcoded in common/db.py and worker/worker.py. Ensure these services are running on their default ports.

Installation & Setup

Clone the repository

git clone <repository-url>
cd distributed-tuner

Create and activate a virtual environment

python -m venv venv
# Windows
.\venv\Scripts\activate
# Linux/Mac
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```
Initialize the Database Create the necessary tables in your PostgreSQL database.
```
python -m scripts.init_db
```

Usage guide

1. Start a Worker

Start one or more worker processes. Each worker will listen for jobs on the tuning_jobs queue.

python -m worker.worker

You should see: "Worker started. Waiting for jobs..."

2. Launch an Experiment

Open a new terminal (with venv activated) and use the CLI to start a new tuning experiment.

python -m cli.main run

This will create an experiment and dispatch trial jobs to the queue. The running worker(s) will pick these up immediately.

3. Check Results

View the status and metrics of your trials.

python -m cli.main results

Output format: TrialID | Status | MetricName = Value

Technical Details

Language: Python
ORM: SQLAlchemy (declarative base)
Messaging: Pika (RabbitMQ client)
CLI: Click
ML Libraries: scikit-learn, PyTorch, NumPy, Pandas

Author

Anuj Mundu

MCA Student | ML Practitioner
Focused on applied machine learning and real-world optimization problems.
GitHub | LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Hyperparameter Tuning Framework

Overview

Key Features

System Architecture

Project Structure

Prerequisites

Installation & Setup

Usage guide

1. Start a Worker

2. Launch an Experiment

3. Check Results

Technical Details

Author

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Distributed Hyperparameter Tuning Framework

Overview

Key Features

System Architecture

Project Structure

Prerequisites

Installation & Setup

Usage guide

1. Start a Worker

2. Launch an Experiment

3. Check Results

Technical Details

Author