Skip to content

Commit 971d67c

Browse files
authored
Merge pull request #3 from aws-samples/feature/sfn-emr-integration
Changes: Stepfunctions enhancement to support direct SDK Integration …
2 parents a795f62 + 5229dfd commit 971d67c

17 files changed

+268
-263
lines changed

README.md

Lines changed: 34 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
# Run a data processing job on Amazon EMR Serverless with AWS Step Functions
1+
# Running a Data Processing Job on EMR Serverless with AWS Step Functions and AWS Lambda using Terraform (By HashiCorp)
22

3+
*Update Feb 2023* – AWS Step Functions adds direct integration for 35 services including Amazon EMR Serverless. In the current version of this blog, we are able to submit an EMR Serverless job by invoking the APIs directly from a Step Functions workflow. We are using the Lambda only for polling the status of the job in EMR. Read more about this feature enhancement [here](https://aws.amazon.com/about-aws/whats-new/2023/02/aws-step-functions-integration-35-services-emr-serverless/).
34

45
In this blog we showcase how to build and orchestrate a [Scala](https://www.scala-lang.org/) Spark Application using [Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) , AWS Step Functions and [Terraform By HashiCorp](https://www.terraform.io/). In this end to end solution we execute a Spark job on EMR Serverless which processes sample click-stream data in Amazon S3 bucket and stores the aggregation results in Amazon S3.
56

@@ -18,8 +19,8 @@ Overview of the steps and the AWS Services used in this solution:
1819
* [Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) Application - provides the option to submit a Spark job.
1920
* [AWS Lambda](https://aws.amazon.com/lambda/):
2021
* Ingestion Lambda – This lambda processes the incoming request and pushes the data into Firehose stream.
21-
* EMR Start Job Lambda - This lambda starts the EMR Serverless application, the EMR job process converts the ingested user click logs into output in another S3 bucket.
22-
* [AWS Step Functions](https://aws.amazon.com/step-functions) triggers the EMR Start Job Lambda which submits the application to EMR Serverless for processing of the ingested log files.
22+
* EMR Job Status Check Lambda - This lambda does a polling mechanism to check the status of the job that was submitted to EMR Serverless.
23+
* [AWS Step Functions](https://aws.amazon.com/step-functions) Submits the data processing job to an EMR Serverless application and triggers a Lambda which polls to check the status of the submitted job.
2324
* [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3)
2425
* Firehose Delivery Bucket - Stores the ingested application logs in parquet file format
2526
* Loggregator Source Bucket - Stores the scala code/jar for EMR job execution
@@ -40,8 +41,8 @@ Overview of the steps and the AWS Services used in this solution:
4041

4142
### Design Decisions
4243

43-
* We use AWS Step Functions and AWS Lambda in this use case to trigger the EMR Serverless Application. In real world, the data processing application could be long running and may exceed AWS Lambda’s execution timeout. Tools like [Amazon Managed Workflows for Apache Airflow (MWAA)](https://aws.amazon.com/managed-workflows-for-apache-airflow/) can be used. Amazon Managed Apache airflow is a managed orchestration service makes it easier to set up and operate end-to-end data pipelines in the cloud at scale
44-
* AWS Lambda Code & EMR Serverless Log Aggregation code are developed using Java & Scala respectively. These can any done using any supported languages in these use cases.
44+
* We use AWS Step Functions and its support for SDK Integrations with EMR Serverless to submit the data processing job to the EMR Serverless Application.
45+
* AWS Lambda Code & EMR Serverless Log Aggregation code are developed using Java & Scala respectively.
4546
* AWS CLI V2 is required for querying Amazon EMR Serverless applications from command line. These can be viewed from AWS Console also. A sample CLI command provided below in the “Testing” section below.
4647

4748
### Steps
@@ -64,7 +65,7 @@ To run the commands individually
6465
Set the application deployment region and account number. An example below. Modify as needed.
6566

6667
```
67-
$ APP_DIR=$PWD
68+
$ APP_DIR=$PWD
6869
$ APP_PREFIX=clicklogger
6970
$ STAGE_NAME=dev
7071
$ REGION=us-east-1
@@ -74,9 +75,10 @@ $ APP_DIR=$PWD
7475
Maven build AWS Lambda Application Jar & Scala Application package
7576

7677
```
77-
$ cd $APP_DIR/source/clicklogger
78+
$ cd $APP_DIR/source/clicklogger
7879
$ mvn clean package
79-
$ sbt reload
80+
$ cd $APP_DIR/source/loggregator
81+
$ sbt reload
8082
$ sbt compile
8183
$ sbt package
8284
```
@@ -85,14 +87,13 @@ $ sbt reload
8587
Deploy the AWS Infrastructure using Terraform
8688

8789
```
88-
$ terraform init
90+
$ terraform init
8991
$ terraform plan
9092
$ terraform apply --auto-approve
9193
```
9294

9395
### Testing
9496

95-
9697

9798
Once the application is built and deployed, you can also insert sample data for the EMR processing. An example as below. Note exec.sh has multiple sample insertions for AWS Lambda. The ingested logs will be used by the EMR Serverless Application job
9899

@@ -114,11 +115,16 @@ Validate the Deployments
114115
![Alt text](assets/s3_source_parquet_files.png?raw=true "Title")
115116

116117
* Run AWS Step Function to validate the Serverless application
117-
* Open AWS Console > AWS Step Function > Open "clicklogger-dev-state-machine".
118-
* The step function will show the steps that ran to trigger the AWS Lambda and EMR Serverless Application
119-
* Start a new execution to trigger the AWS Lambda and EMR Serverless Application/Job
120-
* Once the AWS Step Function is successful, navigate to Amazon S3 > clicklogger-dev-outputs-bucket- to see the output files.
121-
* These will be partitioned by year/month/date/response.md. A sample is shown below
118+
* Open AWS Console > AWS Step Function > Open "clicklogger-dev-state-machine".
119+
* The step function will show the steps that ran to trigger the AWS Lambda and Job submission to EMR Serverless Application
120+
* Start a new StepFunctions execution to trigger the workflow with the sample input below. Enter the date value equal to the date when sample data was ingested to S3 with the ingest lambda.
121+
```
122+
{
123+
"InputDate": "2023-02-08"
124+
}
125+
```
126+
* Once the AWS Step Function is successful, navigate to Amazon S3 > <your-region>-clicklogger-dev-loggregator-output-<your-Account-Number> to see the output files.
127+
* These will be partitioned by year/month/date/response.md. A sample is shown below
122128

123129
![Alt text](assets/s3_output_response_file.png?raw=true "Title")
124130

@@ -129,7 +135,6 @@ AWS CLI can be used to check the deployed AWS Serverless Application
129135
$ aws emr-serverless list-applications \
130136
| jq -r '.applications[] | select(.name=="clicklogger-dev-loggregrator-emr-<Your-Account-Number>").id'
131137
132-
133138
```
134139

135140
![Alt text](assets/step_function_success.png?raw=true "Title")
@@ -138,7 +143,6 @@ EMR Studio
138143

139144
* Open AWS Console, Navigate to “EMR” > “Serverless” tab on the left pane.
140145
* Select “clicklogger-dev-studio” and click “Manage Applications”
141-
* The Application created by the stack will be as shown below clicklogger-dev-loggregator-emr-<Your-Account-Number>
142146

143147

144148

@@ -184,11 +188,19 @@ S3 and created services can be deleted using CLI also. Execute the below command
184188
185189
# CLI Commands to delete the S3
186190
187-
aws s3 rb s3://clicklogger-dev-emr-serverless-logs-bucket-<your-account-number> --force
188-
aws s3 rb s3://clicklogger-dev-firehose-delivery-bucket-<your-account-number> --force
189-
aws s3 rb s3://clicklogger-dev-loggregator-output-bucket-<your-account-number> --force
190-
aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
191-
aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
191+
APP_DIR=$PWD
192+
APP_PREFIX=clicklogger
193+
STAGE_NAME=dev
194+
REGION=us-east-1
195+
196+
ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')
197+
echo $ACCOUNT_ID
198+
199+
aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-emr-logs-$ACCOUNT_ID --force
200+
aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-firehose-delivery-$ACCOUNT_ID --force
201+
aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-loggregator-output-$ACCOUNT_ID --force
202+
aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-loggregator-source-$ACCOUNT_ID --force
203+
aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-emr-studio-$ACCOUNT_ID --force
192204
193205
# Destroy the AWS Infrastructure
194206
terraform destroy --auto-approve

assets/AWSStepFunction.png

-12.3 KB
Binary file not shown.

assets/AWSStepFunction.png.license

Lines changed: 0 additions & 3 deletions
This file was deleted.
-181 Bytes
Loading

assets/step_function_caught.png

-74.1 KB
Binary file not shown.

assets/step_function_caught.png.license

Lines changed: 0 additions & 3 deletions
This file was deleted.

assets/step_function_success.png

10.8 KB
Loading

assets/step_function_uncaught.png

-75 KB
Binary file not shown.

assets/step_function_uncaught.png.license

Lines changed: 0 additions & 3 deletions
This file was deleted.

0 commit comments

Comments
 (0)