Skip to content

gperdrizet/GANNs-with-friends

Repository files navigation

Distributed GAN training with students as workers

An educational distributed deep learning system where students become part of a compute cluster to train a GAN (Generative Adversarial Network) to generate images.

Concept

This project demonstrates distributed machine learning by:

  • Using students' computers as a distributed compute cluster
  • Coordinating training through a PostgreSQL database (no complex networking!)
  • Training a DCGAN to generate realistic images
  • Teaching distributed systems, parallel training, and GANs simultaneously

Architecture

Main process (instructor/admin):

  • Creates work units (batches of image indices)
  • Aggregates gradients from workers
  • Applies optimizer steps
  • Tracks training progress

Worker process (students/workers):

  • Polls database for available work
  • Computes gradients on assigned image batches
  • Uploads gradients back to database
  • Runs continuously until training completes

PostgreSQL database:

  • Stores model weights, gradients, work units
  • Acts as communication hub (no port forwarding needed!)
  • Tracks worker statistics for monitoring
  • Note: instructor/admin needs to set-up student-accessible SQL database

Documentation

Full documentation

Quick links:

Quick start

Choose your installation path:

Setup Path Best For GPU Required Documentation
Dev Container Full development environment Optional Setup guide
Native Python Direct local control Optional Setup guide
Conda Conda users Optional Setup guide
Google Colab Zero installation, free GPU No (provided) Setup guide
Local Training Single GPU, no database Optional Setup guide

† Recommended configuration

Features

  • Database-coordinated training: No complex networking, works across firewalls
  • Fault tolerant: Workers can disconnect/reconnect, automatic work reassignment
  • Flexible hardware: CPU and GPU workers can participate together
  • Educational: Learn distributed systems, GANs, and parallel training

What students learn

  • Distributed systems: Coordination, fault tolerance, atomic operations
  • Deep learning: GAN training, gradient aggregation, data parallelism
  • Practical skills: PostgreSQL, PyTorch, collaborative computing

Contributing

This is an educational project! Contributions welcome:

  • Bug fixes and improvements
  • Additional GAN architectures
  • Gradient compression techniques

See the Contributing Guide for more details.

License

MIT License - See LICENSE file for details