This repository contains an automated data pipeline built with Apache Airflow. The project handles the end-to-end extraction, transformation, and loading (ETL) of Wikipedia pageview traffic data into a Snowflake data warehouse.
This repository contains the first version of the LaunchSentiment data pipeline, built as part of a data engineering capstone project. The project validates a market hypothesis: that fluctuations in a company’s Wikipedia pageviews can serve as a sentiment indicator for its stock performance.With emphasis placed on the Big namely; Microsoft,Amazon,Apple,Google,Facebook
The pipeline manages the lifecycle of Wikipedia's traffic data. It automates the ingestion of raw, compressed files, applies cleaning logic via Python to ensure data quality, and stages the data for analytical processing.
- Orchestration: Apache Airflow
- Data Warehouse: Snowflake
- Languages: Python (Pandas), SQL
- Infrastructure: Docker
The DAG (Directed Acyclic Graph) is structured into distinct functional units to ensure reliability and scalability:
- Extraction & Parsing:
- Uses a custom
dataframe_parserto efficiently read and process raw.gzfiles. - Standardizes raw traffic logs into a structured format for downstream use.
- Uses a custom
- Transformation:
- Performs cleaning and deduplication using Python.
- Exports processed data to a temporary storage path before cloud ingestion.
- Snowflake Integration (
upload_to_stage):- Utilizes the SnowflakeHook to manage secure warehouse connections.
- Executes a
PUTcommand to move files into the@WIKI_STAGING_STAGEinternal stage before uploading to the raw layer and on filtering for the five companies, data is inserted into staging. - Includes
OVERWRITE=TRUElogic to ensure idempotency, allowing tasks to be safely re-run without creating duplicate data.
dags.py: The core script containing the DAG definition and task logic.db_conn.py: Contains SQL logic: Tables creation, data insertion etc- Dockerized Environment: The entire stack is containerized, ensuring that the Airflow scheduler, webserver, and worker run consistently across all environments.
file_parser.py: Contains parsing pageviews file content for 1st December 2025 at 10:00 PM- Snowflake Data Warehouse: Snowflake for holding the 8 million rows of raw data and the final production table ready to be used for sentiment analysis
To run this pipeline locally:
- Clone the Repository:
git clone https://github.com/Human-Gechi/first_airflow_task.git cd first_airflow_task - Ensure you have Airflow installed
- Initialize Docker:
docker-compose up -d
- Configure Connections:
Set up your Snowflake credentials in the Airflow UI under
Admin > Connectionsto allow the SnowflakeHook to authenticate.
Challenges Encountered
-
Network Delays : Network delays were experienced when i tried loading 8+ Million data into snowflake at a goal. This challenge was rectified by utilizing chunking data into 1 miilion rows per insertion.
-
Duplicated files in /tmp in Airflow's containers: After repeatedly running the pipeline each time changing the file names of the chunked.csv files, i began having over 25 million rows. This challenge was rectified by running rm -f on all the created files in the /tmp.
-
Snowflake Configuration on Airflow : There were issues with knowing which was the correct credential for creating a snowflake connection. This was rectified after several searches and trial and error to see what works.
Ogechukwu Abimbola Okoli
I help scale data pipelines by building the "digital plumbing" that moves and cleans information.
- Focus: Building reliable and scalable data pipelines
- LinkedIn: https://shorturl.at/PwBSf