Reddit ETL Pipeline

This project provides a robust solution for extracting, transforming, and loading (ETL) Reddit data into an Amazon Redshift data warehouse. The pipeline utilizes several modern tools and services for scalability, reliability, and efficiency.

Architecture

Overview

The pipeline automates the extraction of Reddit data using the Reddit API, performs transformations, and loads it into a Redshift data warehouse for analytics and reporting. Key services and tools involved:

Apache Airflow: Orchestration of the ETL workflow.
Celery: Task queue for parallel processing.
PostgreSQL: Temporary storage for intermediate data.
Amazon S3: Data storage for raw and processed data.
AWS Glue: Data transformation and cataloging.
Amazon Athena: Query service for data exploration.
Amazon Redshift: Data warehousing and analytics.

Features

Automated data extraction from Reddit API.
Parallel data processing using Celery and Airflow.
Scalable data storage on S3 and Redshift.
Flexible data transformations using AWS Glue.
Easy querying with Athena for ad-hoc analysis.

Getting Started

Prerequisites

AWS Account with access to Redshift, Glue, Athena, and S3.
Reddit API credentials.
Docker (optional for containerized deployment).
Python 3.7+.

Installation

Clone this repository:

https://github.com/Prasadmuthyala/Reddit-pipeline.git
cd reddit-etl-pipeline

Install required dependencies:
```
pip install -r requirements.txt
```
Configure your AWS credentials and Reddit API keys.

Running the Pipeline

Start the Airflow web server:
```
airflow webserver -p 8080
```

Start the Celery worker:

celery -A airflow_worker worker --loglevel=info

Trigger the ETL DAG in Airflow UI.

Workflow

Data Extraction: Reddit data is fetched through the Reddit API.
Data Transformation: The data is cleaned and structured using AWS Glue.
Data Loading: Processed data is loaded into Amazon Redshift for analysis.
Querying: Data can be queried via Amazon Athena for ad-hoc analysis.

Acknowledgments

A special thank you to @airscholar for the inspiration and valuable teachings that helped shape this project. Your work was a great learning resource!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
AWS		AWS
config		config
dags		dags
data/output		data/output
etls		etls
pipelines		pipelines
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
airflow.env		airflow.env
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit ETL Pipeline

Architecture

Overview

Features

Getting Started

Prerequisites

Installation

Running the Pipeline

Workflow

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Prasadmuthyala/Reddit-pipeline

Folders and files

Latest commit

History

Repository files navigation

Reddit ETL Pipeline

Architecture

Overview

Features

Getting Started

Prerequisites

Installation

Running the Pipeline

Workflow

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages