You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+34-22Lines changed: 34 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,6 @@
1
-
# Run a data processing job on Amazon EMR Serverless with AWS Step Functions
1
+
# Running a Data Processing Job on EMR Serverless with AWS Step Functions and AWS Lambda using Terraform (By HashiCorp)
2
2
3
+
*Update Feb 2023* – AWS Step Functions adds direct integration for 35 services including Amazon EMR Serverless. In the current version of this blog, we are able to submit an EMR Serverless job by invoking the APIs directly from a Step Functions workflow. We are using the Lambda only for polling the status of the job in EMR. Read more about this feature enhancement [here](https://aws.amazon.com/about-aws/whats-new/2023/02/aws-step-functions-integration-35-services-emr-serverless/).
3
4
4
5
In this blog we showcase how to build and orchestrate a [Scala](https://www.scala-lang.org/) Spark Application using [Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) , AWS Step Functions and [Terraform By HashiCorp](https://www.terraform.io/). In this end to end solution we execute a Spark job on EMR Serverless which processes sample click-stream data in Amazon S3 bucket and stores the aggregation results in Amazon S3.
5
6
@@ -18,8 +19,8 @@ Overview of the steps and the AWS Services used in this solution:
18
19
*[Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) Application - provides the option to submit a Spark job.
19
20
*[AWS Lambda](https://aws.amazon.com/lambda/):
20
21
* Ingestion Lambda – This lambda processes the incoming request and pushes the data into Firehose stream.
21
-
* EMR Start Job Lambda - This lambda starts the EMR Serverless application, the EMR job process converts the ingested user click logs into output in another S3 bucket.
22
-
*[AWS Step Functions](https://aws.amazon.com/step-functions)triggers the EMR Start Job Lambda which submits the application to EMR Serverless for processing of the ingested log files.
22
+
* EMR Job Status Check Lambda - This lambda does a polling mechanism to check the status of the job that was submitted to EMR Serverless.
23
+
*[AWS Step Functions](https://aws.amazon.com/step-functions)Submits the data processing job to an EMR Serverless application and triggers a Lambda which polls to check the status of the submitted job.
* Firehose Delivery Bucket - Stores the ingested application logs in parquet file format
25
26
* Loggregator Source Bucket - Stores the scala code/jar for EMR job execution
@@ -40,8 +41,8 @@ Overview of the steps and the AWS Services used in this solution:
40
41
41
42
### Design Decisions
42
43
43
-
* We use AWS Step Functions and AWS Lambda in this use case to trigger the EMR Serverless Application. In real world, the data processing application could be long running and may exceed AWS Lambda’s execution timeout. Tools like [Amazon Managed Workflows for Apache Airflow (MWAA)](https://aws.amazon.com/managed-workflows-for-apache-airflow/) can be used. Amazon Managed Apache airflow is a managed orchestration service makes it easier to set up and operate end-to-end data pipelines in the cloud at scale
44
-
* AWS Lambda Code & EMR Serverless Log Aggregation code are developed using Java & Scala respectively. These can any done using any supported languages in these use cases.
44
+
* We use AWS Step Functions and its support for SDK Integrations with EMR Serverless to submit the data processing job to the EMR Serverless Application.
45
+
* AWS Lambda Code & EMR Serverless Log Aggregation code are developed using Java & Scala respectively.
45
46
* AWS CLI V2 is required for querying Amazon EMR Serverless applications from command line. These can be viewed from AWS Console also. A sample CLI command provided below in the “Testing” section below.
46
47
47
48
### Steps
@@ -64,7 +65,7 @@ To run the commands individually
64
65
Set the application deployment region and account number. An example below. Modify as needed.
65
66
66
67
```
67
-
$ APP_DIR=$PWD
68
+
$ APP_DIR=$PWD
68
69
$ APP_PREFIX=clicklogger
69
70
$ STAGE_NAME=dev
70
71
$ REGION=us-east-1
@@ -74,9 +75,10 @@ $ APP_DIR=$PWD
74
75
Maven build AWS Lambda Application Jar & Scala Application package
75
76
76
77
```
77
-
$ cd $APP_DIR/source/clicklogger
78
+
$ cd $APP_DIR/source/clicklogger
78
79
$ mvn clean package
79
-
$ sbt reload
80
+
$ cd $APP_DIR/source/loggregator
81
+
$ sbt reload
80
82
$ sbt compile
81
83
$ sbt package
82
84
```
@@ -85,14 +87,13 @@ $ sbt reload
85
87
Deploy the AWS Infrastructure using Terraform
86
88
87
89
```
88
-
$ terraform init
90
+
$ terraform init
89
91
$ terraform plan
90
92
$ terraform apply --auto-approve
91
93
```
92
94
93
95
### Testing
94
96
95
-
96
97
97
98
Once the application is built and deployed, you can also insert sample data for the EMR processing. An example as below. Note exec.sh has multiple sample insertions for AWS Lambda. The ingested logs will be used by the EMR Serverless Application job
* Run AWS Step Function to validate the Serverless application
117
-
* Open AWS Console > AWS Step Function > Open "clicklogger-dev-state-machine".
118
-
* The step function will show the steps that ran to trigger the AWS Lambda and EMR Serverless Application
119
-
* Start a new execution to trigger the AWS Lambda and EMR Serverless Application/Job
120
-
* Once the AWS Step Function is successful, navigate to Amazon S3 > clicklogger-dev-outputs-bucket- to see the output files.
121
-
* These will be partitioned by year/month/date/response.md. A sample is shown below
118
+
* Open AWS Console > AWS Step Function > Open "clicklogger-dev-state-machine".
119
+
* The step function will show the steps that ran to trigger the AWS Lambda and Job submission to EMR Serverless Application
120
+
* Start a new StepFunctions execution to trigger the workflow with the sample input below. Enter the date value equal to the date when sample data was ingested to S3 with the ingest lambda.
121
+
```
122
+
{
123
+
"InputDate": "2023-02-08"
124
+
}
125
+
```
126
+
* Once the AWS Step Function is successful, navigate to Amazon S3 > <your-region>-clicklogger-dev-loggregator-output-<your-Account-Number> to see the output files.
127
+
* These will be partitioned by year/month/date/response.md. A sample is shown below
0 commit comments