An educational distributed deep learning system where students become part of a compute cluster to train a GAN (Generative Adversarial Network) to generate images.
This project demonstrates distributed machine learning by:
- Using students' computers as a distributed compute cluster
- Coordinating training through a PostgreSQL database (no complex networking!)
- Training a DCGAN to generate realistic images
- Teaching distributed systems, parallel training, and GANs simultaneously
Main process (instructor/admin):
- Creates work units (batches of image indices)
- Aggregates gradients from workers
- Applies optimizer steps
- Tracks training progress
Worker process (students/workers):
- Polls database for available work
- Computes gradients on assigned image batches
- Uploads gradients back to database
- Runs continuously until training completes
PostgreSQL database:
- Stores model weights, gradients, work units
- Acts as communication hub (no port forwarding needed!)
- Tracks worker statistics for monitoring
- Note: instructor/admin needs to set-up student-accessible SQL database
Quick links:
- Getting Started - Introduction and concepts
- Installation Guide - Choose your setup path
- Student Guide - How to participate as a worker
- Instructor Guide - Running the coordinator
- Configuration Reference - All config options
- Architecture - System design details
- FAQ - Frequently asked questions
Choose your installation path:
| Setup Path | Best For | GPU Required | Documentation |
|---|---|---|---|
| Dev Container † | Full development environment | Optional | Setup guide |
| Native Python | Direct local control | Optional | Setup guide |
| Conda | Conda users | Optional | Setup guide |
| Google Colab | Zero installation, free GPU | No (provided) | Setup guide |
| Local Training | Single GPU, no database | Optional | Setup guide |
† Recommended configuration
- New to the project? Start with the Getting Started Guide.
- For students: See the Student Guide for how to participate as a worker.
- For instructors: See the Instructor Guide for running the coordinator and managing training.
- Database-coordinated training: No complex networking, works across firewalls
- Fault tolerant: Workers can disconnect/reconnect, automatic work reassignment
- Flexible hardware: CPU and GPU workers can participate together
- Educational: Learn distributed systems, GANs, and parallel training
- Distributed systems: Coordination, fault tolerance, atomic operations
- Deep learning: GAN training, gradient aggregation, data parallelism
- Practical skills: PostgreSQL, PyTorch, collaborative computing
This is an educational project! Contributions welcome:
- Bug fixes and improvements
- Additional GAN architectures
- Gradient compression techniques
See the Contributing Guide for more details.
MIT License - See LICENSE file for details