- Extract Random User data from https://randomuser.me/ API using Apache Airflow as Data Orchestration Tool.
- Trigger Kafka Producer and Kafka Consumer on schedule to start streaming data.
- Kafka Monitoring and Management using Apache Zooker, Control Center and Confluent Schema Registry.
- Performing Spark Master and Spark Worker for streaming data.
- Apache Cassandra as a Distributed Database.
- Have your Docker installed on your computer, then check: docker -version and docker compose -version.
- In your terminal, run docker compose up -d.
- Trigger the user_automation task on Airflow Web UI.
- Check if the Kafka topic has been created on Control Center.
- In the terminal, run the command:
spark-submit --master spark://localhost:7077 spark_stream.pyto submit spark job. - Run the Cassandra cluster and do some stuff of SQL query (SELECT FROM).
A lot can still be done:
- Cloud composer for Airflow, Kafka and Spark (such as AWS Managed Service for Airflow, Kafka, EMR for Spark).
- Data quality tests.
- OLAP Operations with Data Warehouse for analytical purposes.
