PollutionMap is a distributed data processing pipeline that processes air quality readings from sensors across multiple cities in parallel, producing a city-level pollution summary.
The air quality dataset is submitted to a central scheduler, which splits it into chunks and distributes them across multiple worker nodes that process them in parallel. This scheduler-worker approach scales horizontally: adding more workers reduces processing time proportionally. For a small dataset the difference is negligible, but for a dataset with millions of sensor readings from thousands of cities, splitting the work across multiple workers can be many times faster than running a single sequential script.
- A client submits a dataset to the scheduler
- The scheduler partitions it into chunks and pushes tasks to a Redis queue
- Multiple workers poll the queue and process their assigned chunks in parallel
- Results are aggregated into a final output by a reduce stage
- Python + FastAPI — scheduler API
- Redis — task queue and distributed state
- Docker — each worker runs in its own container, making it easy to scale horizontally
- HTTP REST — communication between client, scheduler, and workers
Prerequisites: Docker must be installed and running.
Start the scheduler and 3 workers:
docker compose up --scale worker=3Submit a job:
PYTHONPATH=. python -m client.submit_job dataset/sample_dataset.json 3Check live metrics (active workers, pending and running tasks):
curl http://localhost:8000/metrics