Skip to content

Commit af412c9

Browse files
authored
Merge pull request #1 from aws-samples/feature/initial-commit
initial commit
2 parents 5140ad8 + 218f100 commit af412c9

File tree

61 files changed

+3006
-9
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+3006
-9
lines changed

.gitignore

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
.DS_Store
2+
.idea
3+
.mvn
4+
.terraform
5+
.vscode
6+
terraform.tfstate.backup
7+
.terraform.lock.hcl
8+
target
9+
out
10+
source/clicklogger/src/main/.DS_Store
11+
source/clicklogger/src/main/java/.DS_Store
12+
source/clicklogger/src/main/java/com/.DS_Store
13+
source/clicklogger/src/main/java/com/clicklogs/.DS_Store
14+
terraform/workspaces/us-east-1/terraform.tfstate
15+
terraform/workspaces/us-east-1/.terraform/*
16+
terraform/workspaces/us-east-1/.terraform*
17+
source/clicklogger/src/test/.DS_Store
18+
source/clicklogger/src/test/java/.DS_Store
19+
source/clicklogger/src/test/java/com/.DS_Store
20+
assets/.$emr-serverless-click-logs-from-web-application.drawio.bkp
21+
assets/.$emr-serverless-click-logs-from-web-application.drawio.dtmp

HELP.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Getting Started
2+
3+
### Reference Documentation
4+
For further reference, please consider the following sections:
5+
6+
* [Official Apache Maven documentation](https://maven.apache.org/guides/index.html)
7+
* [Spring Boot Maven Plugin Reference Guide](https://docs.spring.io/spring-boot/docs/2.4.2/maven-plugin/reference/html/)
8+
* [Create an OCI image](https://docs.spring.io/spring-boot/docs/2.4.2/maven-plugin/reference/html/#build-image)
9+
* [Spring Web](https://docs.spring.io/spring-boot/docs/2.4.2/reference/htmlsingle/#boot-features-developing-web-applications)
10+
11+
### Guides
12+
The following guides illustrate how to use some features concretely:
13+
14+
* [Building a RESTful Web Service](https://spring.io/guides/gs/rest-service/)
15+
* [Serving Web Content with Spring MVC](https://spring.io/guides/gs/serving-web-content/)
16+
* [Building REST services with Spring](https://spring.io/guides/tutorials/bookmarks/)
17+

LICENSES/MIT-0.txt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of
4+
this software and associated documentation files (the "Software"), to deal in
5+
the Software without restriction, including without limitation the rights to
6+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7+
the Software, and to permit persons to whom the Software is furnished to do so.
8+
9+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15+

README.md

Lines changed: 208 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,216 @@
1-
## My Project
1+
# Running a Data Processing Job on EMR Serverless with AWS Step Functions and AWS Lambda using Terraform (By HashiCorp)
22

3-
TODO: Fill this README out!
43

5-
Be sure to:
4+
In this blog we showcase how to build and orchestrate a [Scala](https://www.scala-lang.org/) Spark Application using [Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) , AWS Step Functions and [Terraform By HashiCorp](https://www.terraform.io/). In this end to end solution we execute a Spark job on EMR Serverless which processes sample click-stream data in Amazon S3 bucket and stores the aggregation results in Amazon S3.
5+
6+
With EMR Serverless, customers don’t have to configure, optimize, secure, or operate clusters to run applications. You will continue to get the benefits of [Amazon EMR](https://aws.amazon.com/emr/), such as open source compatibility, concurrency, and optimized runtime performance for popular data frameworks. EMR Serverless is suitable for customers who want ease in operating applications using open-source frameworks. It offers quick job startup, automatic capacity management, and straightforward cost controls.
7+
8+
There are several ‘infrastructure as code’ frameworks available today, to help customers define their infrastructure, such as the AWS CDK or Terraform. Terraform, an AWS Partner Network (APN) Advanced Technology Partner and member of the AWS DevOps Competency, is an infrastructure as code tool similar to AWS CloudFormation that allows you to create, update, and version your AWS infrastructure. Terraform provides friendly syntax (similar to [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html)) along with other features like planning (visibility to see the changes before they actually happen), graphing, ability to create templates to break infra configurations into smaller chunks which allows better maintenance and reusability. We will leverage the capabilities and features of Terraform to build an API based ingestion process into AWS. Let’s get started!
9+
10+
We will provide the Terraform infrastructure definition and the source code for an AWS Lambda using which sample customer user clicks for online website inputs are ingested into an [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/). The solution leverages Firehose’s capability to convert the incoming data into a Parquet file (an open-source file format for Hadoop) before pushing it to [Amazon S3](https://aws.amazon.com/s3/) using [AWS Glue](https://aws.amazon.com/glue/) catalog. The generated output S3 Parquet file logs are then processed by an EMR Serverless process and outputs a report detailing aggregate click stream statistics in S3 bucket. The EMR serverless operation is triggered using [AWS Step Functions](https://aws.amazon.com/step-functions). The sample architecture and code will be spun up as below.
11+
12+
Provided samples have the source code for building the infrastructure using Terraform for running the Amazon EMR Application. Setup scripts are provided to create the sample ingestion using AWS Lambda for incoming application logs. Similar ingestion pattern sample was terraformed in an earlier [blog](https://aws.amazon.com/blogs/developer/provision-aws-infrastructure-using-terraform-by-hashicorp-an-example-of-web-application-logging-customer-data/).
13+
14+
Overview of the steps and the AWS Services used in this solution:
615

7-
* Change the title in this README
8-
* Edit your repository description on GitHub
16+
* Java source build – Provided application code is packaged & built using Apache Maven
17+
* Terraform commands are used to deploy the infrastructure in AWS.
18+
* [Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) Application - provides the option to submit a Spark job.
19+
* [AWS Lambda](https://aws.amazon.com/lambda/):
20+
* Ingestion Lambda – This lambda processes the incoming request and pushes the data into Firehose stream.
21+
* EMR Start Job Lambda - This lambda starts the EMR Serverless application, the EMR job process converts the ingested user click logs into output in another S3 bucket.
22+
* [AWS Step Functions](https://aws.amazon.com/step-functions) triggers the EMR Start Job Lambda which submits the application to EMR Serverless for processing of the ingested log files.
23+
* [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3)
24+
* Firehose Delivery Bucket - Stores the ingested application logs in parquet file format
25+
* Loggregator Source Bucket - Stores the scala code/jar for EMR job execution
26+
* Loggregator Output Bucket - EMR processed output is stored in this bucket
27+
* EMR Serverless logs Bucket - Stores EMR process application logs
28+
* Sample AWS Invoke commands (run as part of initial set up process) inserts the data using the Ingestion Lambda and Firehose stream converts the incoming stream into a Parquet file and stored in an S3 bucket
929

10-
## Security
30+
31+
![Alt text](assets/emr-serverless-click-logs-from-web-application.drawio.png?raw=true "Title")
32+
### Prerequisites
1133

12-
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
34+
* [AWS Cli](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) - At the time of writing this article version 2.7.18 was used. This will be required to query aws emr-serverless cli commands from your local machine. Optionally all the AWS Services used in this blog can be viewed/operated from AWS Console also.
35+
* Make sure to have [Java](https://www.java.com/en/download/) installed, JDK/JRE 8 is set in the environment path of your machine. For instructions, see [Java Development Kit](https://www.java.com/en/download/)
36+
* [Apache Maven](https://maven.apache.org/download.cgi) – Java Lambdas are built using mvn packages and are deployed using Terraform into AWS
37+
* [Scala Build Tool](https://www.scala-sbt.org/download.html) (sbt) - Version 1.4.7 is used at the time of this article. Make sure to download and install based on your operating system needs.
38+
* Set up [Terraform](https://www.terraform.io/downloads). For steps, see Terraform downloads. Version 1.2.5 is used at the time of this article.
39+
* An [AWS Account](https://aws.amazon.com/free/)
1340

14-
## License
41+
### Design Decisions
1542

16-
This library is licensed under the MIT-0 License. See the LICENSE file.
43+
* We use AWS Step Functions and AWS Lambda in this use case to trigger the EMR Serverless Application. In real world, the data processing application could be long running and may exceed AWS Lambda’s execution timeout. Tools like [Amazon Managed Workflows for Apache Airflow (MWAA)](https://aws.amazon.com/managed-workflows-for-apache-airflow/) can be used. Amazon Managed Apache airflow is a managed orchestration service makes it easier to set up and operate end-to-end data pipelines in the cloud at scale
44+
* AWS Lambda Code & EMR Serverless Log Aggregation code are developed using Java & Scala respectively. These can any done using any supported languages in these use cases.
45+
* AWS CLI V2 is required for querying Amazon EMR Serverless applications from command line. These can be viewed from AWS Console also. A sample CLI command provided below in the “Testing” section below.
1746

47+
### Steps
48+
49+
50+
Clone [this repository](https://github.com/aws-samples/aws-emr-serverless-using-terraform) and execute the below command to spin up the infrastructure and the application
51+
Provided “exec.sh” shell script builds the Java application jar (For the Lambda Ingestion), the Scala application Jar (For the EMR Processing) and deploys the AWS Infrastructure that is needed for this use case.
52+
53+
Execute the below commands
54+
55+
56+
```
57+
$ chmod +x exec.sh
58+
$ ./exec.sh
59+
```
60+
61+
62+
To run the commands individually
63+
64+
Set the application deployment region and account number. An example below. Modify as needed.
65+
66+
```
67+
$ APP_DIR=$PWD
68+
$ APP_PREFIX=clicklogger
69+
$ STAGE_NAME=dev
70+
$ REGION=us-east-1
71+
$ ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')
72+
```
73+
74+
Maven build AWS Lambda Application Jar & Scala Application package
75+
76+
```
77+
$ cd $APP_DIR/source/clicklogger
78+
$ mvn clean package
79+
$ sbt reload
80+
$ sbt compile
81+
$ sbt package
82+
```
83+
84+
85+
Deploy the AWS Infrastructure using Terraform
86+
87+
```
88+
$ terraform init
89+
$ terraform plan
90+
$ terraform apply --auto-approve
91+
```
92+
93+
### Testing
94+
95+
96+
97+
Once the application is built and deployed, you can also insert sample data for the EMR processing. An example as below. Note exec.sh has multiple sample insertions for AWS Lambda. The ingested logs will be used by the EMR Serverless Application job
98+
99+
Below sample AWS CLI Invoke command inserts sample data for the application logs
100+
101+
```
102+
aws lambda invoke --function-name clicklogger-dev-ingestion-lambda —cli-binary-format raw-in-base64-out —payload '{"requestid":"OAP-guid-001","contextid":"OAP-ctxt-001","callerid":"OrderingApplication","component":"login","action":"load","type":"webpage"}' out
103+
```
104+
105+
Validate the Deployments
106+
107+
* Output – Once the Lambda is successfully executed, you should see the output in S3 buckets as shown below
108+
* Validate the saved ingested data as below
109+
* Navigate to the bucket created as part of the stack.
110+
* Select the file and view the file from “Select From” sub tab.
111+
* You should see something ingested stream got converted into parquet file. *
112+
* Select the file and view the data. A sample is shown below
113+
114+
![Alt text](assets/s3_source_parquet_files.png?raw=true "Title")
115+
116+
* Run AWS Step Function to validate the Serverless application
117+
* Open AWS Console > AWS Step Function > Open "clicklogger-dev-state-machine".
118+
* The step function will show the steps that ran to trigger the AWS Lambda and EMR Serverless Application
119+
* Start a new execution to trigger the AWS Lambda and EMR Serverless Application/Job
120+
* Once the AWS Step Function is successful, navigate to Amazon S3 > clicklogger-dev-outputs-bucket- to see the output files.
121+
* These will be partitioned by year/month/date/response.md. A sample is shown below
122+
123+
![Alt text](assets/s3_output_response_file.png?raw=true "Title")
124+
125+
126+
AWS CLI can be used to check the deployed AWS Serverless Application
127+
128+
```
129+
$ aws emr-serverless list-applications \
130+
| jq -r '.applications[] | select(.name=="clicklogger-dev-loggregrator-emr-<Your-Account-Number>").id'
131+
132+
133+
```
134+
135+
![Alt text](assets/step_function_success.png?raw=true "Title")
136+
137+
EMR Studio
138+
139+
* Open AWS Console, Navigate to “EMR” > “Serverless” tab on the left pane.
140+
* Select “clicklogger-dev-studio” and click “Manage Applications”
141+
142+
143+
144+
![Alt text](assets/EMRStudioApplications.png?raw=true "Title")
145+
146+
![Alt text](assets/EMRServerlessApplication.png?raw=true "Title")
147+
148+
Reviewing the Serverless Application Output:
149+
150+
151+
* Open AWS Console, Navigate to Amazon S3
152+
* Open the outputs S3 bucket. This will be like - us-east-1-clicklogger-dev-loggregator-output-<YOUR-ACCOUNT-NUMBER>
153+
* The EMR Serverless application writes the output based on the date partition as below
154+
* 2022/07/28/response.md
155+
* Output of the file will be like below
156+
157+
```
158+
159+
|*createdTime*|*callerid*|*component*|*count*
160+
|------------|-----------|-----------|-------
161+
*07-28-2022*|OrderingApplication|checkout|2
162+
*07-28-2022*|OrderingApplication|login|2
163+
*07-28-2022*|OrderingApplication|products|2
164+
```
165+
166+
## Cleanup
167+
168+
169+
Provided "./cleanup.sh" has the required steps to delete all the files from Amazon S3 buckets that were created as part of this blog. terraform destroy command will clean up the AWS infrastructure those were spun up as mentioned above
170+
171+
172+
```
173+
$ chmod +x cleanup.sh
174+
$ ./cleanup.sh
175+
```
176+
177+
* To do the steps manually,
178+
179+
S3 and created services can be deleted using CLI also. Execute the below commands (an example below, modify as needed):
180+
181+
```
182+
183+
184+
# CLI Commands to delete the S3
185+
186+
aws s3 rb s3://clicklogger-dev-emr-serverless-logs-bucket-<your-account-number> --force
187+
aws s3 rb s3://clicklogger-dev-firehose-delivery-bucket-<your-account-number> --force
188+
aws s3 rb s3://clicklogger-dev-loggregator-output-bucket-<your-account-number> --force
189+
aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
190+
aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
191+
192+
# Destroy the AWS Infrastructure
193+
terraform destroy --auto-approve
194+
195+
196+
```
197+
198+
199+
200+
## Conclusion
201+
202+
203+
To recap, in this post we built, deployed & ran a data processing spark job in Amazon EMR Serverless that interacts with various AWS Services. The post walked through deploying a lambda packaged with Java using maven, a Scala application code for EMR Serverless Application triggered with AWS Step Functions with infrastructure as code. You may use any combination of applicable programming languages to build your lambda functions, EMR Job application. EMR Serverless can be triggered manually, automated or can be orchestrated using AWS Services like AWS Step Function, Amazon Managed Apache airflow, etc.,
204+
205+
We encourage you to test this example and see for yourself how this overall application design works within AWS. Then, it will be just the matter of replacing your individual code base, package them and let the Amazon EMR Serverless handle the process efficiently.
206+
207+
If you implement this example and run into any issues, or have any questions or feedback about this blog please provide your comments below!
208+
209+
## References
210+
211+
* [Terraform: Beyond the basics with AWS](https://aws.amazon.com/blogs/apn/terraform-beyond-the-basics-with-aws/)
212+
* [Amazon EMR Serverless General Availability](https://aws.amazon.com/about-aws/whats-new/2022/06/amazon-emr-serverless-generally-available/)
213+
* [Amazon EMR Serverless Now Generally Available – Run Big Data Applications without Managing Servers](https://aws.amazon.com/blogs/aws/amazon-emr-serverless-now-generally-available-run-big-data-applications-without-managing-servers/)
214+
* [Provision AWS infrastructure using Terraform (By HashiCorp): an example of web application logging customer data](https://aws.amazon.com/blogs/developer/provision-aws-infrastructure-using-terraform-by-hashicorp-an-example-of-web-application-logging-customer-data/)
215+
216+

assets/AWSStepFunction.png

12.3 KB
Loading

assets/AWSStepFunction.png.license

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
2+
3+
SPDX-License-Identifier: MIT-0
114 KB
Loading
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
2+
3+
SPDX-License-Identifier: MIT-0

assets/EMRStudioApplications.png

129 KB
Loading
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
2+
3+
SPDX-License-Identifier: MIT-0

0 commit comments

Comments
 (0)