This project sets up a fully dockerized Apache Airflow pipeline to scrape quotes from the web, transform the data, and load it into a PostgreSQL database.
Everything runs through Docker Compose, and the entire project can be bootstrapped with a single shell script.
chmod +x create_airflow_scraper_project.sh./create_airflow_scraper_project.shsource venv/bin/activatedocker-compose up --buildWait a few moments...
π Access the Airflow UI
Username: admin Password: admin
π Run the DAG
Trigger the etl_web_scraper DAG from the Airflow UI. It will scrape quote data, store it in PostgreSQL, and log each step.
π Check the PostgreSQL Database Enter the database container:
docker exec -it airflow-postgres psql -U airflow -d scraperdbRun:
\dt
\c scraperdb
SELECT quote, author, tags, created_at FROM quotes LIMIT 5;β Output
Each quote includes:
-
Quote text
-
Author
-
Tags
-
Timestamp of extraction (created_at)
π Project Structure
airflow-webscraper/
βββ dags/
β βββ etl_web_scraper.py
βββ logs/
βββ plugins/
βββ docker-compose.yaml
βββ requirements.txt
βββ .gitignore
βββ create_airflow_scraper_project.sh
π Tech Stack
-
Python + Airflow
-
Docker Compose
-
PostgreSQL
-
BeautifulSoup for web scraping