Skip to content

Commit ec34fa0

Browse files
authored
emr serverless python dependencies (#250)
1 parent 367ff58 commit ec34fa0

File tree

14 files changed

+477
-1
lines changed

14 files changed

+477
-1
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ node_modules/
55
.project
66
.settings/
77
target/
8+
volume/
89

910
.idea/
1011

@@ -13,7 +14,7 @@ __pycache__/
1314
*.log
1415

1516
.terraform/
16-
terraform.tfstate
17+
terraform.tfstate*
1718
.terraform.lock*
1819

1920
.venv/
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# This is a muti-stage Dockerfile that can be used to build many different types of
2+
# bundled dependencies for PySpark projects.
3+
# The `base` stage installs generic tools necessary for packaging.
4+
#
5+
# There are `export-` and `build-` stages for the different types of projects.
6+
# - python-packages - Generic support for Python projects with pyproject.toml
7+
# - poetry - Support for Poetry projects
8+
#
9+
# This Dockerfile is generated automatically as part of the emr-cli tool.
10+
# Feel free to modify it for your needs, but leave the `build-` and `export-`
11+
# stages related to your project.
12+
#
13+
# To build manually, you can use the following command, assuming
14+
# the Docker BuildKit backend is enabled. https://docs.docker.com/build/buildkit/
15+
#
16+
# Example for building a poetry project and saving the output to dist/ folder
17+
# docker build --target export-poetry --output dist .
18+
19+
20+
## ----------------------------------------------------------------------------
21+
## Base stage for python development
22+
## ----------------------------------------------------------------------------
23+
FROM --platform=linux/amd64 amazonlinux:2 AS base
24+
25+
RUN yum install -y python3 tar gzip
26+
27+
ENV VIRTUAL_ENV=/opt/venv
28+
RUN python3 -m venv $VIRTUAL_ENV
29+
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
30+
31+
# EMR 6.x uses Python 3.7 - limit Poetry version to 1.5.1
32+
ENV POETRY_VERSION=1.5.1
33+
RUN python3 -m pip install --upgrade pip
34+
RUN curl -sSL https://install.python-poetry.org | python3 -
35+
36+
ENV PATH="$PATH:/root/.local/bin"
37+
38+
WORKDIR /app
39+
40+
COPY . .
41+
42+
# Test stage - installs test dependencies defined in pyproject.toml
43+
FROM base as test
44+
RUN python3 -m pip install .[test]
45+
46+
47+
## ----------------------------------------------------------------------------
48+
## Build and export stages for Poetry Python projects
49+
## ----------------------------------------------------------------------------
50+
# Build stage for poetry
51+
FROM base as build-poetry
52+
RUN poetry self add poetry-plugin-bundle && \
53+
poetry bundle venv dist/bundle && \
54+
tar -czvf dist/pyspark_deps.tar.gz -C dist/bundle . && \
55+
rm -rf dist/bundle
56+
57+
FROM scratch as export-poetry
58+
COPY --from=build-poetry /app/dist/pyspark_deps.tar.gz /
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
## ----------------------------------------------------------------------------
2+
## Base stage for python development
3+
## ----------------------------------------------------------------------------
4+
FROM --platform=linux/amd64 localstack/localstack:latest AS base
5+
6+
ENV VIRTUAL_ENV=/opt/venv
7+
RUN python3 -m venv $VIRTUAL_ENV
8+
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
9+
10+
# EMR 6.x uses Python 3.7 - limit Poetry version to 1.5.1
11+
ENV POETRY_VERSION=1.5.1
12+
RUN python3 -m pip install --upgrade pip
13+
RUN curl -sSL https://install.python-poetry.org | python3 -
14+
15+
ENV PATH="$PATH:/root/.local/bin"
16+
17+
WORKDIR /app
18+
19+
COPY . .
20+
21+
## ----------------------------------------------------------------------------
22+
## Build and export stages for standard Python projects
23+
## ----------------------------------------------------------------------------
24+
# Build stage - installs required dependencies and creates a venv package
25+
FROM base as build-poetry
26+
RUN poetry self add poetry-plugin-bundle && \
27+
poetry bundle venv dist/bundle
28+
29+
FROM scratch as export-poetry
30+
COPY --from=build-poetry /app/dist/bundle /pyspark_env/
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
export AWS_ACCESS_KEY_ID ?= test
2+
export AWS_SECRET_ACCESS_KEY ?= test
3+
export AWS_DEFAULT_REGION = us-east-1
4+
5+
init:
6+
terraform workspace new local &
7+
terraform workspace new aws &
8+
terraform init
9+
10+
build:
11+
docker build . --file Dockerfile-localstack --output .
12+
13+
build-aws:
14+
docker build . --file Dockerfile-aws --output .
15+
16+
deploy:
17+
docker-compose up --detach
18+
terraform workspace select local
19+
AWS_ENDPOINT_URL=https://localhost.localstack.cloud:4566 terraform apply --auto-approve
20+
21+
deploy-aws:
22+
terraform workspace select aws
23+
terraform apply --auto-approve
24+
25+
run:
26+
terraform workspace select local
27+
./start_job.sh local
28+
29+
run-aws:
30+
terraform workspace select aws
31+
./start_job.sh aws
32+
33+
stop:
34+
docker-compose down
35+
36+
destroy:
37+
terraform workspace select local
38+
./stop-application.sh
39+
terraform destroy --auto-approve
40+
41+
destroy-aws:
42+
terraform workspace select aws
43+
./stop-application.sh aws
44+
terraform destroy --auto-approve
45+
46+
test-ci:
47+
make init build deploy run; return_code=`echo $$?`;\
48+
make stop; exit $$return_code;
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# EMR Serverless with Python dependencies
2+
3+
[AWS has this example](https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies) of how to add python dependencies to an emr job. Unfortunately, the same pattern isn't currently possible on LocalStack. This here will serve as a example of how to implement a workaround to still be able to add your own dependencies and module to your emr Spark jobs
4+
5+
## Requirements
6+
- Make
7+
- Terraform ~>1.9.1
8+
- [LocalStack](https://github.com/localstack/localstack)
9+
- [awslocal](https://github.com/localstack/awscli-local)
10+
11+
## init
12+
13+
This will initialize your terraform and terraform workspaces
14+
15+
```
16+
make init
17+
```
18+
19+
## Build
20+
21+
This will build the python dependencies for the Spark job. This is where the first difference with AWS happens, as we will not package it like we do for aws, but intead will save the environment to our project folder to mount it to Localstack countainer.
22+
23+
```
24+
# For LocalStack, we create a /pyspark_env folder
25+
make build
26+
27+
# For aws, we create pyspark_deps.tar.gz
28+
make build-aws
29+
```
30+
31+
## Deploy
32+
33+
Creates the following resources
34+
- iam role
35+
- iam policy
36+
- s3 bucket
37+
- emr-serverless application
38+
39+
```
40+
# Starts localstack using docker-compose, and apply the terraform configuration.
41+
LOCALSTACK_AUTH_TOKEN=<your_auth_token> make deploy
42+
43+
# apply terraform configuration to AWS
44+
make deploy-aws
45+
```
46+
47+
## Run job
48+
49+
We can finally run our spark job. Notice the differences in the `start_job.sh` for LocalStack and aws. For aws we add `spark.archives` to our configuration and reference the path for the environment as `environment/bin/python`. Whereas for LocalStack, we rely on the volume mounted on our container instead of the archives and are using the absolute path for the environment `/tmp/environment/bin/python`.
50+
51+
```
52+
# LocalStack
53+
make run
54+
55+
# aws
56+
make run-aws
57+
```
58+
59+
## Destroy
60+
61+
Finally we can destroy the environment. We make sure to stop the application first.
62+
63+
```
64+
# LocalStack
65+
make destroy
66+
67+
# aws
68+
make destroy-aws
69+
```
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
services:
2+
localstack:
3+
container_name: "${LOCALSTACK_DOCKER_NAME:-localstack-main}"
4+
# Using this image will significantly decrease the job execution time
5+
# image: localstack/localstack-pro:latest-bigdata
6+
image: localstack/localstack-pro:latest
7+
ports:
8+
- "127.0.0.1:4566:4566" # LocalStack Gateway
9+
- "127.0.0.1:4510-4559:4510-4559" # external services port range
10+
- "127.0.0.1:443:443" # LocalStack HTTPS Gateway (Pro)
11+
environment:
12+
# Activate LocalStack Pro: https://docs.localstack.cloud/getting-started/auth-token/
13+
- LOCALSTACK_AUTH_TOKEN=${LOCALSTACK_AUTH_TOKEN:-} # required for Pro
14+
- LOCALSTACK_API_KEY=${LOCALSTACK_API_KEY:-} # required for CI
15+
# LocalStack configuration: https://docs.localstack.cloud/references/configuration/
16+
- DEBUG=${DEBUG:-0}
17+
- PERSISTENCE=${PERSISTENCE:-0}
18+
- HIVE_DEFAULT_VERSION=3.1.3
19+
volumes:
20+
- "${LOCALSTACK_VOLUME_DIR:-./volume}:/var/lib/localstack"
21+
- "/var/run/docker.sock:/var/run/docker.sock"
22+
- "./pyspark_env:/tmp/environment"
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
{
2+
"Version": "2012-10-17",
3+
"Statement": [
4+
{
5+
"Sid": "ReadAccessForEMRSamples",
6+
"Effect": "Allow",
7+
"Action": [
8+
"s3:GetObject",
9+
"s3:ListBucket"
10+
],
11+
"Resource": [
12+
"arn:aws:s3:::*.elasticmapreduce",
13+
"arn:aws:s3:::*.elasticmapreduce/*"
14+
]
15+
},
16+
{
17+
"Sid": "FullAccessToOutputBucket",
18+
"Effect": "Allow",
19+
"Action": [
20+
"s3:PutObject",
21+
"s3:GetObject",
22+
"s3:ListBucket",
23+
"s3:DeleteObject"
24+
],
25+
"Resource": [
26+
"arn:aws:s3:::${bucket}",
27+
"arn:aws:s3:::${bucket}/*"
28+
]
29+
},
30+
{
31+
"Sid": "GlueCreateAndReadDataCatalog",
32+
"Effect": "Allow",
33+
"Action": [
34+
"glue:GetDatabase",
35+
"glue:CreateDatabase",
36+
"glue:GetDataBases",
37+
"glue:CreateTable",
38+
"glue:GetTable",
39+
"glue:UpdateTable",
40+
"glue:DeleteTable",
41+
"glue:GetTables",
42+
"glue:GetPartition",
43+
"glue:GetPartitions",
44+
"glue:CreatePartition",
45+
"glue:BatchCreatePartition",
46+
"glue:GetUserDefinedFunctions"
47+
],
48+
"Resource": ["*"]
49+
}
50+
]
51+
}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"Version": "2012-10-17",
3+
"Statement": [{
4+
"Sid": "EMRServerlessTrustPolicy",
5+
"Action": "sts:AssumeRole",
6+
"Effect": "Allow",
7+
"Principal": {
8+
"Service": "emr-serverless.amazonaws.com"
9+
}
10+
}]
11+
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
from jobs.spark_run import SparkRun
2+
3+
# importing typer to validate it is in the environment
4+
import typer
5+
6+
if __name__ == "__main__":
7+
spark_runner = SparkRun()
8+
spark_runner.run()
9+
spark_runner.stop()
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
from pyspark.sql import SparkSession
2+
from pyspark.sql.functions import col
3+
4+
class SparkRun:
5+
6+
def __init__(self) -> None:
7+
self.spark = SparkSession.builder.appName("ExtremeWeather").getOrCreate()
8+
9+
def run(self) -> None:
10+
df = self.spark.createDataFrame(
11+
[
12+
("sue", 32),
13+
("li", 3),
14+
("bob", 75),
15+
("heo", 13),
16+
],
17+
["first_name", "age"],
18+
)
19+
print(df.select(col("first_name"), col("age")).first())
20+
21+
def stop(self):
22+
self.spark.stop()

0 commit comments

Comments
 (0)