A comprehensive COVID-19 data processing pipeline that analyzes COVID-19 data from multiple sources. This project implements a batch processing architecture using Spring Boot, Apache Spark, HDFS, and Hive for data processing and analysis.
Data Sources β Data Ingestion β HDFS (Raw) β Spark Processing β Hive β API β React Frontend
β β β β β β β
Our World Spring Boot HDFS Storage Apache Spark Apache REST API React UI
in Data Ingestion (Data Lake) (ETL Jobs) Hive (JSON) (Charts)
- Data Sources: Our World in Data COVID-19 dataset
- Data Ingestion: Spring Boot service for downloading and storing data in HDFS
- Data Processing: Apache Spark jobs for ETL processing
- Data Storage: HDFS for raw data storage
- Data Warehouse: Apache Hive for analytical queries
- API Layer: Spring Boot REST API
- Frontend: React with TypeScript and Recharts
- Docker and Docker Compose
- Java 17+
- Node.js 18+ (for local development)
-
Clone the repository:
git clone <repository-url> cd Covid19_Data_Tracker
-
Start all services:
docker-compose up -d
-
Run the Spark job to process data:
docker exec spark-job /opt/spark/run-covid-job.sh
-
Access the applications:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8081/api
- HDFS NameNode: http://localhost:9870
- Hive Server: http://localhost:10000
- Spark Master: http://localhost:8080
-
Start the backend:
mvn spring-boot:run
-
Start the frontend:
cd covid19-visualization npm install npm start
- Batch Data Ingestion: Automated data collection from Our World in Data
- ETL Pipeline: Apache Spark jobs for data transformation and cleaning
- Data Quality: Validation and error handling for data integrity
- Schema Enforcement: Consistent data schema across all countries
- COVID-19 Metrics: Cases, deaths, recoveries, and trends
- Country-wise Analysis: Data analysis by country
- Time Series Analysis: Trend visualization over time
- Statistical Analysis: Summary statistics and data insights
- Interactive Dashboards: Real-time charts and graphs
- Time Series Analysis: Trend visualization over time
- Geographic Data: Country and regional comparisons
- Comparative Analysis: Side-by-side comparisons of different metrics
The application configuration is in src/main/resources/application.yml
:
# Data Sources
data-sources:
our-world-in-data:
url: https://covid.ourworldindata.org/data/owid-covid-data.json
# HDFS Configuration
hdfs:
namenode: hdfs://namenode:9000
base-path: /covid19-data
# Hive Configuration
hive:
jdbc-url: jdbc:hive2://hive-server:10000/default
SPRING_PROFILES_ACTIVE
: Active Spring profile (dev, prod, docker)HDFS_NAMENODE
: HDFS NameNode URLHIVE_JDBC_URL
: Hive JDBC connection URL
GET /api/covid19/latest
- Latest COVID-19 dataGET /api/covid19/country/{country}
- Data by countryGET /api/covid19/range?start={date}&end={date}
- Data by date rangeGET /api/covid19/summary
- Summary statistics
GET /api/health
- Health checkPOST /api/ingest
- Trigger data ingestion
public class Covid19Data {
private LocalDate date;
private String country;
private Integer totalCases;
private Integer totalDeaths;
private Integer newCases;
private Integer newDeaths;
private Double totalCasesPerMillion;
private Double totalDeathsPerMillion;
private String dataSource;
}
- COVID-19 Data Processing: Processes Our World in Data JSON and creates Hive tables
- Schema Enforcement: Ensures consistent data structure across all countries
- Data Validation: Validates and cleans incoming data
# Run the Spark job manually
docker exec spark-job /opt/spark/run-covid-job.sh
# Check job status
docker logs spark-job
mvn test
mvn verify
# Test health endpoint
curl http://localhost:8081/api/health
# Test data endpoints
curl http://localhost:8081/api/covid19/latest
curl http://localhost:8081/api/covid19/summary
src/
βββ main/
β βββ java/com/covid19_tracker/
β β βββ config/ # Configuration classes
β β βββ model/ # Data models
β β βββ repository/ # Data access layer
β β βββ service/ # Business logic
β β βββ batch/ # Spring Batch jobs
β β βββ ingestion/ # Data ingestion services
β β βββ hive/ # Hive data service
β β βββ api/ # REST controllers
β βββ resources/
β βββ application.yml # Application configuration
βββ test/ # Test classes
spark-jobs/
βββ src/main/java/
β βββ com/covid19_tracker/spark/
β βββ Covid19DataProcessor.java # Spark ETL job
βββ Dockerfile
covid19-visualization/
βββ src/
β βββ components/ # React components
β βββ services/ # API services
βββ package.json
- Update
DataSourcesConfig.java
- Add ingestion method in
DataIngestionService.java
- Update Spark job in
Covid19DataProcessor.java
- Update API endpoints in
RestApiController.java
-
Build the application:
mvn clean package -DskipTests
-
Deploy with Docker:
docker-compose up -d
-
Run data processing:
docker exec spark-job /opt/spark/run-covid-job.sh
-
Monitor the application:
docker-compose logs -f covid19-tracker
- Application health:
/api/health
- HDFS health: HDFS NameNode web UI
- Hive health: Hive Server web UI
- Spark health: Spark Master web UI
- Application logs: Spring Boot logging
- Spark job logs: Docker logs
- Container logs: Docker logs
- Spring Boot Actuator metrics
- Custom business metrics
- Performance monitoring
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Data sources: Our World in Data
- Technologies: Spring Boot, Apache Spark, Apache Hadoop, Apache Hive, React
- Community: Open source contributors and researchers