This project, developed for the "Advanced Methods for Scientific Computing" course at Politecnico di Milano, goes beyond a simple SIR model. Our mission was to engineer a high-performance, parallel simulation capable of modeling real-world epidemic dynamics across a large, geographically complex area like the United States. We leveraged C++ and MPI to build a scalable and efficient tool for computational epidemiology, demonstrating how HPC techniques can be applied to solve critical, large-scale societal problems.
This project was a collaborative effort by a talented international team of five engineers. The primary repository is owned by my teammate, Nada Khaled. As a key contributor and strategist for the team, my specific role focused on two main areas:
- Architecting the Parallelization Strategy:
- I took the lead in designing the core MPI-based parallelization architecture, focusing on an efficient domain decomposition strategy and the implementation of ghost cell communication to ensure data consistency between processes.
- My work was crucial for enabling the simulation to scale and run efficiently on multiple processor cores.
- Team Management & Strategy:
- I helped to guide the team's overall strategy, ensuring our technical decisions were aligned with the project's goals.
- This involved facilitating discussions, helping to resolve technical roadblocks, and ensuring that our collaborative workflow was smooth and productive.
The datasets used in this project are sourced from the JHU CSSE COVID-19 Dataset. Our main input dataset is from February 2, 2021.
Before using a CSV file in the simulation:
- Population data is added from external sources for each state
- Missing values are filled using values from previous records
- Dataset is cleaned to retain only the first 9 essential columns:
- Province_State
- Population
- Last_Update
- Lat
- Long_
- Confirmed
- Deaths
- Recovered
- Active
- States are reordered to place geographically adjacent states together
- Header row is added with column count and row count
.
├── data
│ ├── output # Simulation results
│ ├── test_results # Test outputs
│ ├── analysis # Analysis plots and metrics
│ └── test_datasets # Raw CSVs for testing
├── header
│ ├── main
│ └── test
├── scripts # Python scripts for analysis and plotting
└── src # C++ source code
├── main.cpp # Main simulation file
├── main
└── test # Test suite for simulation
-
Data Distribution
- Optimal block division
- Load balancing
- Neighbor cell mapping
-
MPI Communication
- Block distribution
- Ghost cell updates
- Result gathering
-
SIR Model
- Differential equations
- Parameter tuning
- State management
-
Output Handling
- CSV writing
- Logging
- Error handling
- Load balancing strategies
- Asynchronous communication
- Efficient I/O operations
- C++17 or later compiler
- MPI Library
- Python 3.x with required packages
# C++ Requirements
sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install openmpi-bin libopenmpi-dev
# Python Requirements
sudo apt-get install python3 python3-pip
pip3 install numpy pandas matplotlib seaborn# Create necessary directories
mkdir -p data/output
# Build
make clean
make all
# Run simulation (e.g., with 4 processes)
mpirun -np 4 ./sir_simulation
# Plot simulation results
python scripts/PlottingSIRModelResults.py# Create test directories
mkdir -p data/test_results data/analysis
# Build tests
make clean
make test
# Run tests
mpirun -np 4 ./sir_test_suite
# Analyze test results
python scripts/analyze_results.py- Place raw CSV in
data/test_datasets/ - Run preprocessing:
python scripts/clean_sort_dataset.py data/test_datasets/your_dataset.csv
- Verify the preprocessing steps:
- Population data added
- Missing values filled
- Only essential columns retained
- States geographically sorted
- Update test configurations in
src/test/TestSuite.cpp
-
sorted_01-01-2021.csv
- First wave 2020 data
- 50 states complete data
- Used for base temporal tests
-
sorted_02-05-2021.csv
- Second wave 2021 data
- 50 states complete data
- Used for comparative analysis
- Must contain all 50 US states
- Population values must be positive
- Missing values handled as zeros
- Dates in YYYY-MM-DD format
data/output/: Main simulation resultsdata/test_results/: Test outputsdata/analysis/: Analysis plots and metrics
- Main Results:
python scripts/PlottingSIRModelResults.py
- Test Analysis:
python scripts/analyze_results.py
Location: data/output/simulation_results.csv
Time,S_avg,I_avg,R_avg
0.0,0.950000,0.050000,0.000000
0.2,0.947331,0.052669,0.000000
0.4,0.944516,0.055484,0.000000
...Where:
Time: Simulation timestepS_avg: Proportion of susceptible populationI_avg: Proportion of infected populationR_avg: Proportion of recovered population
Location: data/test_results/<test_name>_p<num_processes>_results.csv
Time,S_avg,I_avg,R_avg
0.0,0.950000,0.050000,0.000000
...Location: data/output/timing_log.csv
PhaseName,Statistic,Value,Units,NumRanks
distributeBlocks_Total,Min,0.000123,s,4
distributeBlocks_Total,Max,0.000145,s,4
distributeBlocks_Total,Avg,0.000134,s,4
...Location: plots/sir_global_line_plot.png
- Shows the temporal evolution of S, I, R populations
- X-axis: Time steps
- Y-axis: Population proportions
- Three lines: Susceptible (blue), Infected (red), Recovered (green)
Location: plots/infection_heatmap_per_rank.png
- Visualizes infection spread across MPI ranks
- X-axis: Time steps
- Y-axis: MPI ranks
- Color intensity: Infection level (darker = higher infection)
Location: plots/timing_comparison_phases.png
- Compares execution times across simulation phases
- Shows min/max/avg times for each phase
- Helps identify performance bottlenecks
-
Convergence Check
- S + I + R should always sum to 1.0
- Values should stabilize over time
- Final R value indicates total affected population
-
Performance Metrics
- Load balance: Compare execution times across ranks
- Communication overhead: Check MPI phase timings
- Scalability: Compare timings with different process counts
The SIR model is based on the following set of differential equations:
where:
- S, I, and R are the numbers of susceptible, infected, and recovered individuals
- N is the total population size (assumed constant)
- β (beta) is the transmission rate
- γ (gamma) is the recovery rate
Parameters are tuned based on:
- Literature values
- Calibration with observed data
- Sensitivity analysis to assess impact
States are managed using a discrete event simulation approach:
- Events are scheduled for infections, recoveries, and data logging.
- Future events are predicted based on current state and parameters.
- State is updated at each event, and new events are scheduled as needed.