A scalable, distributed system designed to parallelize machine learning hyperparameter optimization. This framework decouples job scheduling from execution, allowing you to run computationally intensive tuning jobs across multiple worker nodes. It uses RabbitMQ for reliable message queuing and PostgreSQL for persistent storage of experiment results and metrics.
By moving away from sequential execution, this project enables faster experimentation cycles and better resource utilization.
- Distributed Architecture: Decoupled Scheduler and Worker nodes communicate via RabbitMQ.
- Persistent Storage: All experiment metadata, trial configurations, and results are stored in PostgreSQL.
- Extensible Design: Modular adapter system supporting scikit-learn and PyTorch models (easily extensible to others).
- CLI Management: A built-in Command Line Interface (CLI) to launch experiments and query results.
- Fault Tolerance: Durable queues ensure jobs are not lost if workers go offline.
The system consists of four main components:
- Scheduler: Generates hyperparameter combinations (Grid Search capabilities) and publishes tuning jobs to the message queue.
- Message Queue (RabbitMQ): buffers jobs and distributes them to available workers.
- Workers: Stateless consumers that pick up jobs, train the model with the specified hyperparameters, and evaluate performance.
- Database (PostgreSQL): Stores experiment definitions, trial statuses, and final metrics.
distributed-tuner/
├── cli/ # Command Line Interface
│ └── main.py # Entry point for CLI commands
├── common/ # Shared resources
│ ├── db.py # Database connection logic
│ ├── models.py # SQLAlchemy ORM models
│ └── queue.py # RabbitMQ publisher logic
├── scheduler/ # Experiment orchestration
│ └── scheduler.py # Logic to generate and dispatch jobs
├── worker/ # Worker node implementation
│ ├── adapters/ # Model-specific adapters (sklearn, torch)
│ └── worker.py # Main worker loop (consumes jobs)
├── scripts/ # Utility scripts
│ └── init_db.py # Database schema initialization
├── docker/ # Docker configuration files
├── requirements.txt # Python dependencies
└── README.md # Project documentation
Before running the project, ensure you have the following installed:
- Python 3.8+
- PostgreSQL: Running locally or accessible via network.
- RabbitMQ: Running locally or accessible via network.
Note
Currently, database and RabbitMQ configurations (localhost) are hardcoded in common/db.py and worker/worker.py. Ensure these services are running on their default ports.
-
Clone the repository
git clone <repository-url> cd distributed-tuner
-
Create and activate a virtual environment
python -m venv venv # Windows .\venv\Scripts\activate # Linux/Mac source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Initialize the Database Create the necessary tables in your PostgreSQL database.
python -m scripts.init_db
Start one or more worker processes. Each worker will listen for jobs on the tuning_jobs queue.
python -m worker.workerYou should see: "Worker started. Waiting for jobs..."
Open a new terminal (with venv activated) and use the CLI to start a new tuning experiment.
python -m cli.main runThis will create an experiment and dispatch trial jobs to the queue. The running worker(s) will pick these up immediately.
View the status and metrics of your trials.
python -m cli.main resultsOutput format: TrialID | Status | MetricName = Value
- Language: Python
- ORM: SQLAlchemy (declarative base)
- Messaging: Pika (RabbitMQ client)
- CLI: Click
- ML Libraries: scikit-learn, PyTorch, NumPy, Pandas
Anuj Mundu