Skip to content

breadface/Covid19_Data_Tracker

Repository files navigation

COVID-19 Data Tracker

A comprehensive COVID-19 data processing pipeline that analyzes COVID-19 data from multiple sources. This project implements a batch processing architecture using Spring Boot, Apache Spark, HDFS, and Hive for data processing and analysis.

πŸ—οΈ Architecture Overview

Data Sources β†’ Data Ingestion β†’ HDFS (Raw) β†’ Spark Processing β†’ Hive β†’ API β†’ React Frontend
     ↓              ↓              ↓              ↓              ↓        ↓         ↓
   Our World    Spring Boot    HDFS Storage   Apache Spark   Apache    REST API   React UI
   in Data      Ingestion      (Data Lake)    (ETL Jobs)     Hive      (JSON)     (Charts)

Key Components

  • Data Sources: Our World in Data COVID-19 dataset
  • Data Ingestion: Spring Boot service for downloading and storing data in HDFS
  • Data Processing: Apache Spark jobs for ETL processing
  • Data Storage: HDFS for raw data storage
  • Data Warehouse: Apache Hive for analytical queries
  • API Layer: Spring Boot REST API
  • Frontend: React with TypeScript and Recharts

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose
  • Java 17+
  • Node.js 18+ (for local development)

Running with Docker Compose

  1. Clone the repository:

    git clone <repository-url>
    cd Covid19_Data_Tracker
  2. Start all services:

    docker-compose up -d
  3. Run the Spark job to process data:

    docker exec spark-job /opt/spark/run-covid-job.sh
  4. Access the applications:

Local Development

  1. Start the backend:

    mvn spring-boot:run
  2. Start the frontend:

    cd covid19-visualization
    npm install
    npm start

πŸ“Š Features

Data Processing

  • Batch Data Ingestion: Automated data collection from Our World in Data
  • ETL Pipeline: Apache Spark jobs for data transformation and cleaning
  • Data Quality: Validation and error handling for data integrity
  • Schema Enforcement: Consistent data schema across all countries

Analytics

  • COVID-19 Metrics: Cases, deaths, recoveries, and trends
  • Country-wise Analysis: Data analysis by country
  • Time Series Analysis: Trend visualization over time
  • Statistical Analysis: Summary statistics and data insights

Visualization

  • Interactive Dashboards: Real-time charts and graphs
  • Time Series Analysis: Trend visualization over time
  • Geographic Data: Country and regional comparisons
  • Comparative Analysis: Side-by-side comparisons of different metrics

πŸ”§ Configuration

Application Properties

The application configuration is in src/main/resources/application.yml:

# Data Sources
data-sources:
  our-world-in-data:
    url: https://covid.ourworldindata.org/data/owid-covid-data.json

# HDFS Configuration
hdfs:
  namenode: hdfs://namenode:9000
  base-path: /covid19-data

# Hive Configuration
hive:
  jdbc-url: jdbc:hive2://hive-server:10000/default

Environment Variables

  • SPRING_PROFILES_ACTIVE: Active Spring profile (dev, prod, docker)
  • HDFS_NAMENODE: HDFS NameNode URL
  • HIVE_JDBC_URL: Hive JDBC connection URL

πŸ“ˆ API Endpoints

COVID-19 Data

  • GET /api/covid19/latest - Latest COVID-19 data
  • GET /api/covid19/country/{country} - Data by country
  • GET /api/covid19/range?start={date}&end={date} - Data by date range
  • GET /api/covid19/summary - Summary statistics

System

  • GET /api/health - Health check
  • POST /api/ingest - Trigger data ingestion

πŸ—„οΈ Data Models

COVID-19 Data

public class Covid19Data {
    private LocalDate date;
    private String country;
    private Integer totalCases;
    private Integer totalDeaths;
    private Integer newCases;
    private Integer newDeaths;
    private Double totalCasesPerMillion;
    private Double totalDeathsPerMillion;
    private String dataSource;
}

πŸ”„ Data Processing

Spark Job

  • COVID-19 Data Processing: Processes Our World in Data JSON and creates Hive tables
  • Schema Enforcement: Ensures consistent data structure across all countries
  • Data Validation: Validates and cleans incoming data

Job Execution

# Run the Spark job manually
docker exec spark-job /opt/spark/run-covid-job.sh

# Check job status
docker logs spark-job

πŸ§ͺ Testing

Unit Tests

mvn test

Integration Tests

mvn verify

API Tests

# Test health endpoint
curl http://localhost:8081/api/health

# Test data endpoints
curl http://localhost:8081/api/covid19/latest
curl http://localhost:8081/api/covid19/summary

πŸ“ Development

Project Structure

src/
β”œβ”€β”€ main/
β”‚   β”œβ”€β”€ java/com/covid19_tracker/
β”‚   β”‚   β”œβ”€β”€ config/          # Configuration classes
β”‚   β”‚   β”œβ”€β”€ model/           # Data models
β”‚   β”‚   β”œβ”€β”€ repository/      # Data access layer
β”‚   β”‚   β”œβ”€β”€ service/         # Business logic
β”‚   β”‚   β”œβ”€β”€ batch/           # Spring Batch jobs
β”‚   β”‚   β”œβ”€β”€ ingestion/       # Data ingestion services
β”‚   β”‚   β”œβ”€β”€ hive/            # Hive data service
β”‚   β”‚   └── api/             # REST controllers
β”‚   └── resources/
β”‚       └── application.yml  # Application configuration
└── test/                    # Test classes

spark-jobs/
β”œβ”€β”€ src/main/java/
β”‚   └── com/covid19_tracker/spark/
β”‚       └── Covid19DataProcessor.java  # Spark ETL job
└── Dockerfile

covid19-visualization/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ components/          # React components
β”‚   └── services/            # API services
└── package.json

Adding New Data Sources

  1. Update DataSourcesConfig.java
  2. Add ingestion method in DataIngestionService.java
  3. Update Spark job in Covid19DataProcessor.java
  4. Update API endpoints in RestApiController.java

πŸš€ Deployment

Production Deployment

  1. Build the application:

    mvn clean package -DskipTests
  2. Deploy with Docker:

    docker-compose up -d
  3. Run data processing:

    docker exec spark-job /opt/spark/run-covid-job.sh
  4. Monitor the application:

    docker-compose logs -f covid19-tracker

πŸ“Š Monitoring and Logging

Health Checks

  • Application health: /api/health
  • HDFS health: HDFS NameNode web UI
  • Hive health: Hive Server web UI
  • Spark health: Spark Master web UI

Logging

  • Application logs: Spring Boot logging
  • Spark job logs: Docker logs
  • Container logs: Docker logs

Metrics

  • Spring Boot Actuator metrics
  • Custom business metrics
  • Performance monitoring

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Data sources: Our World in Data
  • Technologies: Spring Boot, Apache Spark, Apache Hadoop, Apache Hive, React
  • Community: Open source contributors and researchers

About

A Big Data project that furnishes analytical data on COVID19

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published