In this project, you will execute an End-To-End Data Engineering Project on Real-Time Stock Market Data using Kafka.
We are going to use different technologies such as Python, Amazon Web Services (AWS), Apache Kafka, Glue, Athena, and SQL.
- Programming Language - Python
- Amazon Web Service (AWS)
- S3 (Simple Storage Service)
- Athena
- Glue Crawler
- Glue Catalog
- EC2
- Apache Kafka
Ensure that you have Java installed on your EC2 instance. You can install it using the following commands:
java -version
sudo yum install java
java -versionwget https://downloads.apache.org/kafka/3.5.2/kafka_2.13-3.5.2.tgz
tar -xvf kafka_2.13-3.5.2.tgzNavigate to the Kafka directory and start ZooKeeper:
cd kafka_2.13-3.5.2/
bin/zookeeper-server-start.sh config/zookeeper.properties- Open a new terminal session (or duplicate the current one).
- SSH into your EC2 instance.
- Set Kafka heap options:
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"- Start the Kafka server:
cd kafka_2.13-3.5.2/
bin/kafka-server-start.sh config/server.propertiesTo ensure Kafka is accessible from outside, update the server.properties file:
- Open the
server.propertiesfile:
sudo nano config/server.properties- Modify the
ADVERTISED_LISTENERSproperty to use the public IP of your EC2 instance:
advertised.listeners=PLAINTEXT://<public-ip>:9092- Open a new terminal session (or duplicate the current one).
- SSH into your EC2 instance.
- Create a topic named
demo:
cd kafka_2.13-3.5.2/
bin/kafka-topics.sh --create --topic demo --bootstrap-server <kafka-ip>:9092 --replication-factor 1 --partitions 1Start a Kafka producer to send messages to the demo topic:
bin/kafka-console-producer.sh --topic demo --bootstrap-server <kafka-ip>:9092Open a new terminal session, SSH into your EC2 instance, and start a Kafka consumer to read messages from the demo topic:
cd kafka_2.13-3.5.2
bin/kafka-console-consumer.sh --topic demo --bootstrap-server <kafka-ip>:9092To create an IAM user and get access and secret keys in AWS, follow these steps:
- Go to the AWS Management Console.
- Sign in using your AWS account credentials.
- In the AWS Management Console, open the IAM service by either:
- Typing “IAM” in the search bar and selecting it.
- Finding IAM under “Security, Identity, & Compliance” in the Services menu.
- In the IAM dashboard, click on Users in the left-hand menu.
- Click the Add user button at the top of the page.
- Enter the User name for the new IAM user.
- Select Programmatic access under Access type. This will enable the user to access AWS services via API, CLI, and SDK.
- Optionally, select AWS Management Console access if you want to allow the user to log in to the AWS Management Console. Set a custom password if needed.
- Click Next: Permissions.
- Choose how you want to assign permissions to the user:
- Attach existing policies directly: Choose policies that grant permissions directly to the user.
- Add user to group: Add the user to an IAM group with predefined policies.
- Copy permissions from existing user: Copy permissions from another user.
- Attach customer managed policies: Attach custom policies if you have them.
- Select the policies or groups you want to assign and click Next: Tags.
- Add any tags you want to assign to the user (e.g.,
Department: Finance). - Click Next: Review.
- Review the user details and permissions.
- Click the Create user button.
- Once the user is created, you will see a success message with the user’s Access key ID and Secret access key.
- Click Download .csv to save the credentials to a file or copy them manually. Important: Save the secret access key securely, as it is only displayed once.
- Open your terminal or command prompt.
- Run the command
aws configure. - Enter the Access Key ID and Secret Access Key when prompted, along with the default region name and output format.
These steps will create an IAM user with programmatic access and allow you to use the access key and secret key to interact with AWS services.
-
Create an S3 Bucket
- Log in to the AWS Management Console.
- Navigate to the S3 service.
- Click on "Create bucket."
- Enter a unique name for the bucket.
- Choose a region and configure any other settings as needed.
- Click "Create bucket."
-
Update Your Kafka Consumer Notebook
- Ensure your Kafka consumer notebook is configured to write data to the newly created S3 bucket.
-
Navigate to AWS Glue
- Log in to the AWS Management Console.
- Go to the AWS Glue service.
-
Create a Glue Crawler
- In the Glue Console, select "Crawlers" from the left-hand menu.
- Click "Add crawler."
- Provide a name for the crawler.
- Choose "Data stores" as the source type.
- Select "S3" and specify the path to your S3 bucket.
- Set up a crawler schedule (e.g., run on demand or schedule at intervals).
- Choose or create an IAM role that has permissions to access S3 and Glue.
- Configure output settings, including the database where the table metadata will be stored.
-
Run the Crawler
- After creating the crawler, select it from the list and click "Run crawler."
- The crawler will scan your S3 bucket, infer the schema, and create tables in the specified Glue database.
-
Navigate to AWS Athena
- Log in to the AWS Management Console.
- Go to the Athena service.
-
Configure Athena
- Set up a query result location in S3 if not already configured.
- Choose the database that was created by the Glue Crawler.
-
Run Queries
- Use the Athena query editor to run SQL queries on the tables created by the Glue Crawler.
- Analyze your data and generate reports as needed.
-
Monitor Glue Crawler and Athena
- Regularly check the AWS Glue and Athena dashboards for job statuses and query results.
-
Permissions and Security
- Ensure that the IAM roles used by Glue and Athena have the necessary permissions to access S3 and other resources.
