🦠 PARALLEL SIR SIMULATION USING MPI & RUNGE-KUTTA METHOD.

🎯 Project Mission

This project, developed for the "Advanced Methods for Scientific Computing" course at Politecnico di Milano, goes beyond a simple SIR model. Our mission was to engineer a high-performance, parallel simulation capable of modeling real-world epidemic dynamics across a large, geographically complex area like the United States. We leveraged C++ and MPI to build a scalable and efficient tool for computational epidemiology, demonstrating how HPC techniques can be applied to solve critical, large-scale societal problems.

Contributors & My Role

This project was a collaborative effort by a talented international team of five engineers. The primary repository is owned by my teammate, Nada Khaled. As a key contributor and strategist for the team, my specific role focused on two main areas:

Architecting the Parallelization Strategy:

I took the lead in designing the core MPI-based parallelization architecture, focusing on an efficient domain decomposition strategy and the implementation of ghost cell communication to ensure data consistency between processes.
My work was crucial for enabling the simulation to scale and run efficiently on multiple processor cores.

Team Management & Strategy:

I helped to guide the team's overall strategy, ensuring our technical decisions were aligned with the project's goals.
This involved facilitating discussions, helping to resolve technical roadblocks, and ensuring that our collaborative workflow was smooth and productive.

Team Members:

Data Source & Preprocessing

Original Data Source

The datasets used in this project are sourced from the JHU CSSE COVID-19 Dataset. Our main input dataset is from February 2, 2021.

Data Preprocessing Steps

Before using a CSV file in the simulation:

Population data is added from external sources for each state
Missing values are filled using values from previous records
Dataset is cleaned to retain only the first 9 essential columns:
- Province_State
- Population
- Last_Update
- Lat
- Long_
- Confirmed
- Deaths
- Recovered
- Active
States are reordered to place geographically adjacent states together
Header row is added with column count and row count

Project Structure

.
├── data
│   ├── output                # Simulation results
│   ├── test_results           # Test outputs
│   ├── analysis              # Analysis plots and metrics
│   └── test_datasets         # Raw CSVs for testing
├── header
│    ├── main
│    └── test     
├── scripts                   # Python scripts for analysis and plotting
└── src                       # C++ source code
    ├── main.cpp              # Main simulation file
    ├── main
    └── test                  # Test suite for simulation

Implementation Details

Key Components

Data Distribution
- Optimal block division
- Load balancing
- Neighbor cell mapping
MPI Communication
- Block distribution
- Ghost cell updates
- Result gathering
SIR Model
- Differential equations
- Parameter tuning
- State management
Output Handling
- CSV writing
- Logging
- Error handling

Performance Optimization

Load balancing strategies
Asynchronous communication
Efficient I/O operations

Building & Running

Prerequisites

C++17 or later compiler
MPI Library
Python 3.x with required packages

Installation Steps

# C++ Requirements
sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install openmpi-bin libopenmpi-dev

# Python Requirements
sudo apt-get install python3 python3-pip
pip3 install numpy pandas matplotlib seaborn

Main Simulation

# Create necessary directories
mkdir -p data/output

# Build
make clean
make all

# Run simulation (e.g., with 4 processes)
mpirun -np 4 ./sir_simulation

# Plot simulation results
python scripts/PlottingSIRModelResults.py

Test Suite

# Create test directories
mkdir -p data/test_results data/analysis

# Build tests
make clean
make test

# Run tests
mpirun -np 4 ./sir_test_suite

# Analyze test results
python scripts/analyze_results.py

Testing & Analysis

Adding New Test Data

Place raw CSV in data/test_datasets/

Run preprocessing:

python scripts/clean_sort_dataset.py data/test_datasets/your_dataset.csv

Verify the preprocessing steps:
- Population data added
- Missing values filled
- Only essential columns retained
- States geographically sorted
Update test configurations in src/test/TestSuite.cpp

Available Test Datasets

sorted_01-01-2021.csv
- First wave 2020 data
- 50 states complete data
- Used for base temporal tests
sorted_02-05-2021.csv
- Second wave 2021 data
- 50 states complete data
- Used for comparative analysis

Test Requirements

Must contain all 50 US states
Population values must be positive
Missing values handled as zeros
Dates in YYYY-MM-DD format

Output & Analysis

File Structure

data/output/: Main simulation results
data/test_results/: Test outputs
data/analysis/: Analysis plots and metrics

Analysis Scripts

Main Results:

python scripts/PlottingSIRModelResults.py

Test Analysis:
```
python scripts/analyze_results.py
```

Simulation Results

Output File Formats

Main Simulation Results

Location: data/output/simulation_results.csv

Time,S_avg,I_avg,R_avg
0.0,0.950000,0.050000,0.000000
0.2,0.947331,0.052669,0.000000
0.4,0.944516,0.055484,0.000000
...

Where:

Time: Simulation timestep
S_avg: Proportion of susceptible population
I_avg: Proportion of infected population
R_avg: Proportion of recovered population

Test Results

Location: data/test_results/<test_name>_p<num_processes>_results.csv

Time,S_avg,I_avg,R_avg
0.0,0.950000,0.050000,0.000000
...

Performance Metrics

Location: data/output/timing_log.csv

PhaseName,Statistic,Value,Units,NumRanks
distributeBlocks_Total,Min,0.000123,s,4
distributeBlocks_Total,Max,0.000145,s,4
distributeBlocks_Total,Avg,0.000134,s,4
...

Generated Plots

1. SIR Evolution

Location: plots/sir_global_line_plot.png

Shows the temporal evolution of S, I, R populations
X-axis: Time steps
Y-axis: Population proportions
Three lines: Susceptible (blue), Infected (red), Recovered (green)

2. Infection Heatmap

Location: plots/infection_heatmap_per_rank.png

Visualizes infection spread across MPI ranks
X-axis: Time steps
Y-axis: MPI ranks
Color intensity: Infection level (darker = higher infection)

3. Performance Analysis

Location: plots/timing_comparison_phases.png

Compares execution times across simulation phases
Shows min/max/avg times for each phase
Helps identify performance bottlenecks

Interpreting Results

Convergence Check
- S + I + R should always sum to 1.0
- Values should stabilize over time
- Final R value indicates total affected population
Performance Metrics
- Load balance: Compare execution times across ranks
- Communication overhead: Check MPI phase timings
- Scalability: Compare timings with different process counts

🔍 Detailed Implementation

SIR Model Equations

The SIR model is based on the following set of differential equations:

where:

S, I, and R are the numbers of susceptible, infected, and recovered individuals
N is the total population size (assumed constant)
β (beta) is the transmission rate
γ (gamma) is the recovery rate

Parameter Tuning

Parameters are tuned based on:

Literature values
Calibration with observed data
Sensitivity analysis to assess impact

State Management

States are managed using a discrete event simulation approach:

Events are scheduled for infections, recoveries, and data logging.
Future events are predicted based on current state and parameters.
State is updated at each event, and new events are scheduled as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
data		data
header		header
plots		plots
report		report
scripts		scripts
src		src
.gitignore		.gitignore
AMSC___SIR_Simulation_with_MPI.pdf		AMSC___SIR_Simulation_with_MPI.pdf
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp
sir_simulation		sir_simulation
sir_test_suite		sir_test_suite

Folders and files

Latest commit

History

Repository files navigation