Skip to content

anujmundu/distributed-hyperparameter-tuner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Hyperparameter Tuning Framework

Overview

A scalable, distributed system designed to parallelize machine learning hyperparameter optimization. This framework decouples job scheduling from execution, allowing you to run computationally intensive tuning jobs across multiple worker nodes. It uses RabbitMQ for reliable message queuing and PostgreSQL for persistent storage of experiment results and metrics.

By moving away from sequential execution, this project enables faster experimentation cycles and better resource utilization.

Key Features

  • Distributed Architecture: Decoupled Scheduler and Worker nodes communicate via RabbitMQ.
  • Persistent Storage: All experiment metadata, trial configurations, and results are stored in PostgreSQL.
  • Extensible Design: Modular adapter system supporting scikit-learn and PyTorch models (easily extensible to others).
  • CLI Management: A built-in Command Line Interface (CLI) to launch experiments and query results.
  • Fault Tolerance: Durable queues ensure jobs are not lost if workers go offline.

System Architecture

The system consists of four main components:

  1. Scheduler: Generates hyperparameter combinations (Grid Search capabilities) and publishes tuning jobs to the message queue.
  2. Message Queue (RabbitMQ): buffers jobs and distributes them to available workers.
  3. Workers: Stateless consumers that pick up jobs, train the model with the specified hyperparameters, and evaluate performance.
  4. Database (PostgreSQL): Stores experiment definitions, trial statuses, and final metrics.

Project Structure

distributed-tuner/
├── cli/                 # Command Line Interface
│   └── main.py          # Entry point for CLI commands
├── common/              # Shared resources
│   ├── db.py            # Database connection logic
│   ├── models.py        # SQLAlchemy ORM models
│   └── queue.py         # RabbitMQ publisher logic
├── scheduler/           # Experiment orchestration
│   └── scheduler.py     # Logic to generate and dispatch jobs
├── worker/              # Worker node implementation
│   ├── adapters/        # Model-specific adapters (sklearn, torch)
│   └── worker.py        # Main worker loop (consumes jobs)
├── scripts/             # Utility scripts
│   └── init_db.py       # Database schema initialization
├── docker/              # Docker configuration files
├── requirements.txt     # Python dependencies
└── README.md            # Project documentation

Prerequisites

Before running the project, ensure you have the following installed:

  • Python 3.8+
  • PostgreSQL: Running locally or accessible via network.
  • RabbitMQ: Running locally or accessible via network.

Note

Currently, database and RabbitMQ configurations (localhost) are hardcoded in common/db.py and worker/worker.py. Ensure these services are running on their default ports.

Installation & Setup

  1. Clone the repository

    git clone <repository-url>
    cd distributed-tuner
  2. Create and activate a virtual environment

    python -m venv venv
    # Windows
    .\venv\Scripts\activate
    # Linux/Mac
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Initialize the Database Create the necessary tables in your PostgreSQL database.

    python -m scripts.init_db

Usage guide

1. Start a Worker

Start one or more worker processes. Each worker will listen for jobs on the tuning_jobs queue.

python -m worker.worker

You should see: "Worker started. Waiting for jobs..."

2. Launch an Experiment

Open a new terminal (with venv activated) and use the CLI to start a new tuning experiment.

python -m cli.main run

This will create an experiment and dispatch trial jobs to the queue. The running worker(s) will pick these up immediately.

3. Check Results

View the status and metrics of your trials.

python -m cli.main results

Output format: TrialID | Status | MetricName = Value

Technical Details

  • Language: Python
  • ORM: SQLAlchemy (declarative base)
  • Messaging: Pika (RabbitMQ client)
  • CLI: Click
  • ML Libraries: scikit-learn, PyTorch, NumPy, Pandas

Author

Anuj Mundu

  • MCA Student | ML Practitioner
  • Focused on applied machine learning and real-world optimization problems.
  • GitHub | LinkedIn

About

Distributed hyperparameter tuning framework using RabbitMQ, PostgreSQL, and Python with scheduler-worker architecture.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages