This repository is being prepared for archival. It represents a comprehensive historical record of COVID-19 data collection, analysis, and modeling efforts for South Africa from March 2020 through end of 2021. The data and code are preserved for:
- Historical research and retrospective analysis
- Academic reference and citation
- Public health preparedness and future pandemic response
- Data science education and methodology demonstration
- Transparency and reproducibility of pandemic-era decision support
The repository remains publicly available under open licenses (MIT for code, CC BY-SA 4.0 for data) to ensure long-term accessibility for researchers, policymakers, and the public.
- Project Overview
- Historical Context and Impact
- Repository Structure
- Data Availability
- Data Coverage and Completeness
- Technical Documentation
- Data Sources
- Scholarly Work and Citations
- Community Showcase
- Licenses
- Contributors
- Contact
The Coronavirus COVID-19 Data Repository for South Africa was created, maintained, and hosted by the Data Science for Social Impact (DSFSI) research group at the University of Pretoria, led by Dr. Vukosi Marivate.
This repository served as the primary open-access COVID-19 data infrastructure for South Africa, providing:
- Systematically collected data from official sources (NICD, DoH)
- Provincial and district-level granularity for spatial analysis
- Multiple data streams: cases, deaths, recoveries, testing, vaccinations, hospital surveillance, excess mortality, and mobility
- Analysis notebooks for epidemiological modeling and forecasting
- Standardized data formats enabling reproducible research
- Community collaboration through open-source contribution
- Over 70+ scholarly citations and publications
- Powered 14+ independent web applications and dashboards used by the public
- Daily updates maintained throughout the pandemic (March 2020 onwards)
- Collaboration with 40+ volunteers from academia, industry, and civil society
- Africa-wide model inspiring the broader COVID-19 Africa project
- Published in Data Science Journal (peer-reviewed methodology paper)
Disclaimer: We worked to keep data as accurate as possible. Data was collated from NICD and DoH official reports and statements. Updates occurred only after official announcements to ensure data integrity and traceability.
-
Blog Posts:
-
Africa-wide Effort: COVID-19 Africa Repository
-
Dashboard: COVID-19 Dashboard
This repository was established in March 2020 at the onset of the COVID-19 pandemic in South Africa. It filled a critical gap by providing:
- Open Access to Official Data: Making government health data accessible in machine-readable formats
- Rapid Response Infrastructure: Enabling real-time analysis by researchers, journalists, and policymakers
- Transparent Methodology: All data collection, cleaning, and analysis methods documented and open-source
- Community-Driven Science: Demonstrating collaborative data science for public good
- African Data Leadership: Setting standards for pandemic data sharing across the continent
The repository supported evidence-based decision-making during the pandemic and continues to serve as a historical record for retrospective analysis and future pandemic preparedness.
Primary directory containing CSV files for South African COVID-19 data:
- Provincial Timelines: Cumulative confirmed cases, deaths, recoveries, testing, and vaccinations by province
- District Data: Fine-grained geographic data organized by province (Eastern Cape, Free State, Gauteng, KwaZulu-Natal, Limpopo, Mpumalanga, Northern Cape, North West, Western Cape)
- Hospital Surveillance: NICD hospital admission and surveillance data
- Excess Mortality: SAMRC (South African Medical Research Council) excess deaths data
- Mobility Data: Apple, Google, and Facebook mobility indicators
- Health System Data: Public and private hospital facility information
- Official Reports: Archive of DoH PDF reports and extracted data
See Data Availability section for detailed dataset listing.
Jupyter notebooks for epidemiological analysis:
- Growth Analysis: Provincial and international COVID-19 growth comparisons
- Bayesian Inference Models: Statistical modeling of spreading rates
- R0 Estimation: Reproduction number calculations (based on Kevin Systrom's real-time R0 method)
- Mobility Studies: Correlation between mobility patterns and case trends
- Geospatial Visualization: Choropleth maps and spatial analysis
- Vaccination Analytics: Vaccine rollout tracking and analysis
Usage: Clone the repository and run notebooks locally. Some notebooks fetch external data; others use local /data files.
Standardized repository for COVID-19 forecasting and projection models, inspired by the Reich Lab COVID-19 Forecast Hub.
- Model Directory:
/modelhub/notebooks/ - Template:
model_template.ipynbfor standardized model development - Automation: Models designed for nightly automated execution
- Licensing: Model code (MIT), Model outputs (CC BY-SA 4.0)
Python scripts for data collection and processing:
realtime_r0.py- Automated R0 (effective reproduction number) calculationgp_pdf_extractor.py- Extracts COVID-19 data from Gauteng Province PDF reports (requirespdfplumber)mobility_scraper.py- Scrapes mobility data from various sourcessacoronavirus_provincial_vaccine.py- Vaccination data scraper
Dependencies: See individual script headers and /notebooks/covid-model/requirements.txt
Go-based CLI tool for scraping NICD/DoH media releases. Note: No longer functional as NICD/DoH stopped providing individual patient data in releases after March 24, 2020.
Build from source: go build in ./scraper directory
Directory for storing output graphics, charts, and visualizations generated from analysis notebooks.
API-related code and infrastructure for programmatic data access.
Additional documentation, methodological notes, and resources.
All datasets are located in the /data directory and are available in CSV format for ease of analysis.
Testing and Reporting Lag: Daily reports reflect new positive test results released by the National Department of Health or NICD, but significant lag exists between test date and reporting date.
Example: In epidemiological Week 1 of 2021 (3-9 Jan), approximately 33,000 new cases were reported in daily announcements. However, the NICD Testing Summary Report for Week 3 of 2021 showed 43,635 positive tests for Week 1. This discrepancy arises because:
- Cases reported in daily announcements may be from prior weeks
- Tests conducted in a given week may only be reported in subsequent weeks
Implication: Temporal analyses must account for reporting lag. The reported date does not necessarily reflect the actual test date or illness onset date.
Complete, maintained datasets covering the pandemic period:
| Dataset | Description | Repository Link | Raw CSV URL |
|---|---|---|---|
| Provincial Confirmed Cases | Cumulative confirmed COVID-19 cases by province over time | Link | CSV |
| Provincial Recoveries | Cumulative recoveries by province over time | Link | CSV |
| Provincial Testing | Cumulative COVID-19 tests conducted by province | Link | CSV |
| Provincial Deaths | Cumulative deaths by province over time | Link | CSV |
| Vaccination Timeline | COVID-19 vaccination rollout data over time | Link | CSV |
| Death Statistics | Detailed death statistics and demographics | Link | CSV |
| Transmission Type | Classification of transmission types (local, imported, etc.) | Link | CSV |
| Testing Timeline | National-level testing data over time | Link | CSV |
| District Data | District and subdistrict level data by province | Link | Multiple CSVs |
| DoH PDF Reports | Department of Health PDF reports and extracted data | Link | Multiple files |
| DoH WhatsApp Archive | Archive of DoH WhatsApp case update messages | Link | Text files |
| Health Facilities | Public and private hospital/facility information | Link | CSV |
| NICD Daily Reports | NICD daily national reports | Link | CSV |
| NICD Hospital Surveillance | Hospital admission and surveillance data from NICD | Link | CSV |
| SAMRC Excess Deaths | Excess mortality data by province (SAMRC) | Link | CSV |
| Mobility Data | Apple, Google, and Facebook mobility indicators | Link | Multiple CSVs |
IMPORTANT NOTE: Since approximately March 24, 2020, individual case-level data was no longer provided by DoH or NICD.
For provincial-level analysis from March 26, 2020 onwards, use the provincial_cumulative_timeline_* datasets.
For individual case data up to March 25, 2020, use the confirmed_cases dataset below.
| Dataset | Coverage Period | Repository Link | Raw CSV URL |
|---|---|---|---|
| Individual Confirmed Cases | March 5 - March 25, 2020 | Link | CSV |
| Individual Deaths | Limited early period | Link | CSV |
- Primary Data Collection Period: March 2020 - End of 2021
- Individual Case Data: March 5, 2020 - March 25, 2020 (discontinued by authorities)
- Provincial Aggregate Data: March 2020 - End of 2021
- Vaccination Data: Begins with vaccine rollout in 2021
- Hospital Surveillance: Varies by dataset
- Excess Mortality: Provided by SAMRC on periodic basis through 2021
- National Level: All-South Africa aggregates
- Provincial Level: All 9 provinces (Eastern Cape, Free State, Gauteng, KwaZulu-Natal, Limpopo, Mpumalanga, Northern Cape, North West, Western Cape)
- District Level: District and subdistrict data where available (see
/data/district_data/)
- Reporting Changes: Data collection methods and reporting formats evolved throughout the pandemic
- Missing Periods: Some datasets may have gaps due to reporting delays or changes in official data release practices
- Source Dependency: All data depends on official government reporting; accuracy reflects source data quality
- Backfilling: Some data points were backfilled or corrected in subsequent official reports
Primary Analysis Environment: Python 3.7+
Core Dependencies (for COVID modeling notebooks in /notebooks/covid-model/):
pip install -r notebooks/covid-model/requirements.txtKey Libraries:
pandas- Data manipulation and analysisnumpy- Numerical computingscipy- Scientific computing (statistics, interpolation)pymc3- Bayesian statistical modelingmatplotlib,seaborn- Visualizationrequests- HTTP requests for data fetchingpdfplumber- PDF data extraction (for GP PDF extractor script)
Some notebooks use R for statistical analysis. Required packages vary by notebook.
- Encoding: UTF-8
- Delimiter: Comma (
,) - Date Format:
DD-MM-YYYY(primary format, some datasets useYYYY-MM-DD) - Province Codes: Standard South African province abbreviations (EC, FS, GP, KZN, LP, MP, NC, NW, WC)
District-level data follows strict standards documented in /data/district_data/README.md:
- Required Columns:
date,YYYYMMDD,source - District Levels:
- Level 0: Province-level data
- Level 1: District-level data
- Level 2: Subdistrict-level data
- Column Management: Do NOT rename existing columns; update friendly names in
combined_district_keys.csv - New Columns: Add at the end before the
sourcecolumn
Most datasets require source attribution (typically the last column). Each new data point should reference the official report or statement.
# Clone the repository
git clone https://github.com/dsfsi/covid19za.git
cd covid19za
# Install Python dependencies
pip install -r notebooks/covid-model/requirements.txt
# Explore data
cd data
ls -la
# Run analysis notebooks
cd ../notebooks
jupyter notebookAll CSV files can be accessed directly via raw GitHub URLs:
import pandas as pd
# Load provincial confirmed cases
url = "https://raw.githubusercontent.com/dsfsi/covid19za/master/data/covid19za_provincial_cumulative_timeline_confirmed.csv"
df = pd.read_csv(url)- Navigate to
/notebooksdirectory - Open desired notebook in Jupyter
- Some notebooks fetch external data; others use local
/datafiles - Ensure dependencies are installed
# Calculate real-time R0
python scripts/realtime_r0.py
# Extract data from Gauteng Province PDFs
pip install pdfplumber
python scripts/gp_pdf_extractor.py
# Scrape mobility data
python scripts/mobility_scraper.pyAll data in this repository was collected from official government and public health sources to ensure accuracy and credibility.
| Source | Description | URL |
|---|---|---|
| NICD | National Institute for Communicable Diseases - Primary source for epidemiological data, alerts, and reports | nicd.ac.za |
| Department of Health (DoH) | South African National Department of Health - Official health announcements and statistics | health.gov.za |
| DoH Twitter | Real-time health updates and announcements | @HealthZA |
| SA Government | Official government media statements and policy announcements | gov.za |
| SAMRC | South African Medical Research Council - Excess mortality data | samrc.ac.za |
| Source | Description | URL |
|---|---|---|
| DHIS Data Dictionary | National Department of Health Data Dictionary - Standard health data definitions | dd.dhmis.org |
| Statistics SA | Statistics South Africa - Demographic and population data | statssa.gov.za |
| Apple Mobility | Apple mobility trend reports | apple.com/covid19/mobility |
| Google Mobility | Google Community Mobility Reports | google.com/covid19/mobility |
| Facebook Data for Good | Facebook mobility and connectivity data | dataforgood.facebook.com |
| MedPages | South African medical information resource | medpages.info |
- Daily Monitoring: Official sources were monitored daily for new reports and announcements
- Manual Verification: Data was manually verified against multiple sources when available
- Source Attribution: Each data point includes reference to the source document/announcement
- Version Control: All changes tracked via Git for full transparency and reproducibility
- Community Review: Open-source model allowed community verification and error reporting
For visualizations, notebooks, or web applications:
Data Science for Social Impact Research Group @ University of Pretoria, Coronavirus COVID-19 (2019-nCoV) Data Repository for South Africa. Available on: https://github.com/dsfsi/covid19za.
For academic publications - Journal Article:
@article{marivate2020use,
Author = {Vukosi Marivate and Herkulaas MvE Combrink},
Journal = {Data Science Journal},
Number = {1},
Pages = {1-7},
Title = {Use of Available Data To Inform The COVID-19 Outbreak in South Africa: A Case Study},
Volume = {19},
Year = {2020},
DOI = {10.5334/dsj-2020-019},
URL = {https://doi.org/10.5334/dsj-2020-019}
}For academic publications - Dataset:
@dataset{marivate_vukosi_2020_3819126,
author = {Marivate, Vukosi and Arbi, Riaz and Combrink, Herkulaas and
de Waal, Alta and Dryza, Henkho and Egersdorfer, Derrick and
Garnett, Shaun and Gordon, Brent and Greyling, Lizel and
Lebogo, Ofentswe and Mackie, Dave and Merry, Bruce and
Mkhondwane, S'busiso and Mokoatle, Mpho and Moodley, Shivan and
Mtsweni, Jabu and Mtsweni, Nompumelelo and Myburgh, Paul and
Richter, Jannik and Rikhotso, Vuthlari and Rosen, Simon and
Sefara, Joseph and van der Walt, Anelda and van Heerden, Schalk and
Welsh, Jay and Hazelhurst, Scott and Petersen, Chad and
Mbuvha, Rendani and Dhlamini, Nelisiwe and James, Vaibhavi},
title = {{Coronavirus disease (COVID-19) case data - South Africa}},
month = mar,
year = 2020,
publisher = {Zenodo},
doi = {10.5281/zenodo.3819126},
url = {https://doi.org/10.5281/zenodo.3819126}
}This repository has been cited in over 70+ scholarly publications spanning epidemiology, public health, data science, and social sciences.
Explore citations: Google Scholar
- Marivate, V., & Combrink, H. M. (2020). "Use of Available Data To Inform The COVID-19 Outbreak in South Africa: A Case Study." Data Science Journal, 19(1), 1-7. DOI: 10.5334/dsj-2020-019
This repository has powered numerous independent projects that made COVID-19 data accessible to the South African public:
| # | Project | Description | Links | Creator | Country | Status |
|---|---|---|---|---|---|---|
| 1 | Covid-19 SA Data | Data visualizations for the COVID-19 outbreak in South Africa | Archive • GitHub | Simon Rosen | South Africa | Archived |
| 2 | Covid-19 Testing Map | Map of COVID-19 testing facilities | Archive • GitHub | Yannick Zehnder | Switzerland | Archived |
| 3 | Coronavirus Map | Interactive coronavirus mapping tool | Archive • GitHub | Jay Welsh | South Africa | Archived |
| 4 | Covid-19 Telegram Bot | Corona virus statistics via Telegram | Bot Link | CodeChap | South Africa | Unknown |
| 5 | Xitsonga Dashboard | COVID-19 dashboard in Xitsonga language | Archive | xitsonga.org | South Africa | Archived |
| 6 | Hospital Capacity Viz | Mapping local hospital capacity (public and private) | Archive • GitHub | Nompumelelo | South Africa | Archived |
| 7 | Covid-19 Trends | COVID-19 analytics dashboard for South Africa | Archive • GitHub | Schalk van Heerden | South Africa | Archived |
| 8 | Tshivenda Dashboard | COVID-19 dashboard in Tshivenda language | Archive | luvenda.com | South Africa | Archived |
| 9 | Health Facilities Map | Map of health facilities with comparable details | Website • GitHub | Team | South Africa | Active |
| 10 | R Interactive Map | Health facilities viewer built with R (afrimapr) | Archive • GitHub | Dr Andy South | United Kingdom | Archived |
| 11 | R Number Estimation | Estimating effective reproductive number for SA and provinces (last updated July 2022) | Website | Louis Rossouw | South Africa | Inactive |
| 12 | Provincial Modeling | COVID-19 modeling using reported and excess deaths (last updated July 2021) | Website | Louis Rossouw | South Africa | Inactive |
| 13 | Provincial Visualization | Deaths, cases, recoveries with mobility data visualization | Archive • GitHub | Christopher Marais | South Africa | Archived |
| 14 | Multi-strain Optimization | Differential Evolution for long-term multi-strain modeling | IEEE Paper | CJ Pretorius & MC du Plessis | South Africa | Active |
- 14+ web applications built by independent developers (many now archived as pandemic-era projects)
- Multi-lingual accessibility (English, Xitsonga, Tshivenda)
- International reach (projects from Switzerland, UK, SA)
- Diverse applications: mapping, forecasting, messaging bots, academic research
Note: Many community projects have been archived or discontinued as the active pandemic phase ended. Wayback Machine links are provided where original sites are no longer available.
Note: As this repository transitions to archival status, active contribution may be limited. However, error corrections and data quality improvements are welcome.
During active maintenance, the contribution process was:
- Choose or Create an Issue: Browse existing issues or create a new one describing your contribution
- Assign to Yourself: Take ownership of the issue
- Adopt a File: Add your name to
covid19za_volunteer_adopted_files.csvlisting which files you'll work on - Fork the Repository: Create your own fork
- Make Changes: Ensure data accuracy and include source attribution
- Submit Pull Request: Follow GitHub PR guidelines
- Data Accuracy: Verify all data against official sources
- Source Attribution: Include source for every data point
- Format Consistency: Follow existing CSV structure and date formats
- Documentation: Update relevant documentation for significant changes
- Data Science Africa COVID-19 Response
- IndabaX South Africa: Vukosi Marivate - Using data science to inform the COVID-19 outbreak in Africa
- Stanford CS472 Data science and AI for COVID-19
This project was made possible by the dedication of 40+ volunteers from academia, industry, and civil society.
Made with contributors-img.
Full contributor list: GitHub Contributors Graph
Special thanks to all volunteers who contributed data collection, validation, analysis, and development throughout the pandemic.
- Code (MIT License): All software, scripts, and notebooks are licensed under the MIT License, allowing free use, modification, and distribution with attribution
- Data (CC BY-SA 4.0): All datasets are licensed under Creative Commons Attribution-ShareAlike 4.0, requiring attribution and share-alike for derivative works
These open licenses ensure long-term accessibility and reusability for research, education, and public health purposes.
Project Lead:
- Dr. Vukosi Marivate
- Email: vukosi.marivate@cs.up.ac.za
- Twitter: @vukosi
- Affiliation: Data Science for Social Impact Research Group, University of Pretoria
Research Group:
- Data Science for Social Impact (DSFSI)
- Department of Computer Science, University of Pretoria
For Questions:
- Data Issues: Please open a GitHub issue describing the problem
- Research Collaborations: Contact Dr. Vukosi Marivate via email
- General Inquiries: vukosi.marivate@cs.up.ac.za
This project was made possible through support from:
- University of Pretoria - Hosting and institutional support
- Data Science for Social Impact Research Group - Research leadership and coordination
- 40+ volunteer contributors - Data collection, validation, and analysis
- South African data science community - Code reviews, testing, and feedback
- 14+ independent developers - Building applications and tools using this data
- 70+ researchers - Citing and validating this work through scholarly publications
We acknowledge the critical work of:
- NICD (National Institute for Communicable Diseases) - Primary data collection and public health surveillance
- National Department of Health - Official reporting and public health communication
- SAMRC (South African Medical Research Council) - Excess mortality analysis
- All healthcare workers and public health officials who worked tirelessly during the pandemic
- Repository: github.com/dsfsi/covid19za
- Zenodo DOI: 10.5281/zenodo.3819126
- Data Science Journal DOI: 10.5334/dsj-2020-019
- Primary Language: Python
- Data Format: CSV
- Status: Archival
- Last Active Update: End of 2021 (varies by dataset)
- COVID-19 Africa Repository - Africa-wide COVID-19 data effort
- Health Facility Map - South African health facility mapping
This repository stands as a testament to open science, community collaboration, and the power of data to inform public health responses. We hope it continues to serve researchers, educators, and public health professionals for years to come.