This project provides a robust solution for extracting, transforming, and loading (ETL) Reddit data into an Amazon Redshift data warehouse. The pipeline utilizes several modern tools and services for scalability, reliability, and efficiency.
The pipeline automates the extraction of Reddit data using the Reddit API, performs transformations, and loads it into a Redshift data warehouse for analytics and reporting. Key services and tools involved:
- Apache Airflow: Orchestration of the ETL workflow.
- Celery: Task queue for parallel processing.
- PostgreSQL: Temporary storage for intermediate data.
- Amazon S3: Data storage for raw and processed data.
- AWS Glue: Data transformation and cataloging.
- Amazon Athena: Query service for data exploration.
- Amazon Redshift: Data warehousing and analytics.
- Automated data extraction from Reddit API.
- Parallel data processing using Celery and Airflow.
- Scalable data storage on S3 and Redshift.
- Flexible data transformations using AWS Glue.
- Easy querying with Athena for ad-hoc analysis.
- AWS Account with access to Redshift, Glue, Athena, and S3.
- Reddit API credentials.
- Docker (optional for containerized deployment).
- Python 3.7+.
-
Clone this repository:
https://github.com/Prasadmuthyala/Reddit-pipeline.git cd reddit-etl-pipeline -
Install required dependencies:
pip install -r requirements.txt
-
Configure your AWS credentials and Reddit API keys.
-
Start the Airflow web server:
airflow webserver -p 8080
-
Start the Celery worker:
celery -A airflow_worker worker --loglevel=info
-
Trigger the ETL DAG in Airflow UI.
- Data Extraction: Reddit data is fetched through the Reddit API.
- Data Transformation: The data is cleaned and structured using AWS Glue.
- Data Loading: Processed data is loaded into Amazon Redshift for analysis.
- Querying: Data can be queried via Amazon Athena for ad-hoc analysis.
A special thank you to @airscholar for the inspiration and valuable teachings that helped shape this project. Your work was a great learning resource!
