Using Apache Airflow to schedule our ETL process
Pre Requisite to run
- Download Docker
- Ensure that WSL 2 is active on the docker
- Recommend using dbeaver to access the database
Setting Up the Docker
- In VScode - Setup the Terminal to be n the folder where the docker is located
- Creating the directorys needed: In the terminal write mkdir ./dags ./plugins ./logs
- Next in the terminal: Type echo -e "AIRFLOW_UID=$(id -u)\AIRFLOW_GID=0" > .env
- Now to initiate our airflow in the terminal: Type docker-compose up airflow-init
- Finally to run the airflow, type in the terminal:docker-compose up
- To check health of our docker we type: docker ps
Running the ETL
- Ensure all zipped files/folders are decomposed all together and saved in the same location
- Open up VScode - and have a terminal open
- In the folder where docker-compose file is, in the terminal type docker-compose up then press enter
- This will initiate the airflow on localhost:8081
- On a browser type the localhost:8081 - Look for spotify_dag and then activate the DAG
- Once this is active you can now run the dag or wait for it to run when it's scheduled
Notes:
- Necessary ETL files are found in the dags folder
- IMPORTANT - copy our dag documents/replace the dag folder with the one created in our setting up docker instructionn
- Within the spotify_dag.py (located in the dags folder) we you can change the schedule_interval to either minutes/ hours /days
- Also Note that in order for the scheduler to work your computer cannot switch off similar to the cloud.
- Note that this is going to be my spotify played songs, to use your own you will need to remove the Refresh part of the spotify_etl and manually insert your played tracks token