|
| 1 | +# Spark Data Lake example |
| 2 | + |
| 3 | +In this example, we build a Data Lake and process aggregations from the NY taxi dataset with Spark application. This `README` is a step-by-step deployment guide. You can read more details about this example solution in the [documentation](https://awslabs.github.io/aws-data-solutions-framework/) |
| 4 | + |
| 5 | +We are using a self-contained application where developers can manage both business code (Spark code in `./spark` folder), and the infrastructure code (AWS CDK code in `./infra` folder). |
| 6 | + |
| 7 | +The business code is a simple **PySpark** application packaged in a common Python project following the [best practices](https://packaging.python.org/en/latest/tutorials/packaging-projects/): |
| 8 | + * A `pyproject.toml` file is used to install internal (packages defined in the code structure) and external dependencies (libraries from PyPi). |
| 9 | + * An `src` folder containing business code organized in Python packages (`__init__.py` files). |
| 10 | + * A `test` folder containing the unit tests run via `pytest .` command from the root folder of the Spark project. You can use the [EMR Vscode toolkit](https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools) to locally test the application on an EMR local runtime. |
| 11 | + |
| 12 | +The infrastructure code is an AWS CDK application using the AWS DSF library to create the required resources. It contains 2 CDK stacks: |
| 13 | + * An **application stack** which provisions the Data Lake, data catalog, and the Spark runtime resources via the following constructs: |
| 14 | + * A `DataLakeStorage` |
| 15 | + * A `DataCatalogDatatabse` |
| 16 | + * A `SparkEmrServerlessRuntime` |
| 17 | + * A `SparkEmrServerlessJob` |
| 18 | + * A **CICD stack** which provisions a CICD Pipeline to manage the application development lifecycle via the following constructs: |
| 19 | + * A `SparkEmrCICDPipeline` |
| 20 | + * A `ApplicationStackFactory` |
| 21 | + |
| 22 | +## Pre-requisite |
| 23 | + |
| 24 | +1. [Install the AWS CDK CLI](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install) |
| 25 | +2. [Bootstrap the CICD account](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_bootstrap) |
| 26 | +3. [Bootstrap the staging and production accounts](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.pipelines-readme.html#cdk-environment-bootstrapping) with a trust relationship from the CICD account |
| 27 | + |
| 28 | +```bash |
| 29 | +cdk bootstrap \ |
| 30 | +--profile staging \ |
| 31 | +--trust <CICD_ACCOUNT_ID> \ |
| 32 | +--cloudformation-execution-policies “POLICY_ARN” \ |
| 33 | +aws://<STAGING_ACCOUNT_ID>/<REGION> |
| 34 | +``` |
| 35 | +4. [Install the git-remote-codecommit](https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-git-remote-codecommit.html#setting-up-git-remote-codecommit-install) utility to interact with AWS CodeCommit |
| 36 | + |
| 37 | +## Getting started |
| 38 | + |
| 39 | +1. Copy the `spark-data-lake` folder somewhere else on your machine and initialize a new git repository: |
| 40 | + |
| 41 | +```bash |
| 42 | +cp -R ../spark-data-lake <MY_LOCAL_PATH> |
| 43 | +cd <MY_LOCAL_PATH> |
| 44 | +git init |
| 45 | +``` |
| 46 | + |
| 47 | +2. Modify the `./infra/requirements.txt` to add the `aws_dsf` library as a dependency: |
| 48 | + |
| 49 | +``` |
| 50 | +aws-cdk-lib==2.94.0 |
| 51 | +constructs>=10.2.55, <11.0.0 |
| 52 | +aws_dsf==1.0.0-rc1 |
| 53 | +``` |
| 54 | + |
| 55 | +2. From the `./infra` folder, create Python3 virtual environment and activate it: |
| 56 | + |
| 57 | +```bash |
| 58 | +cd infra |
| 59 | +python3 -m venv .venv |
| 60 | +source .venv/bin/activate |
| 61 | +``` |
| 62 | + |
| 63 | +3. Install the AWS DSF library: |
| 64 | + |
| 65 | +```bash |
| 66 | +pip install -r requirements.txt |
| 67 | +``` |
| 68 | + |
| 69 | +4. Provide the target accounts and region information for the staging and production steps of the CICD pipeline. |
| 70 | + Also configure the global removal policy if you want to delete all the resources including the data when deleting the example. |
| 71 | + |
| 72 | + |
| 73 | + Create a `cdk.context.json` file with the following content: |
| 74 | + |
| 75 | +```json |
| 76 | +{ |
| 77 | + "staging": { |
| 78 | + "accountId": "<STAGING_ACCOUNT_ID>", |
| 79 | + "region": "<STAGING_REGION>" |
| 80 | + }, |
| 81 | + "prod": { |
| 82 | + "accountId": "<PRODUCTION_ACCOUNT_ID>", |
| 83 | + "region": "<PRODUCTION_REGION>" |
| 84 | + }, |
| 85 | + "@aws-data-solutions-framework/removeDataOnDestroy": true |
| 86 | +} |
| 87 | +``` |
| 88 | + |
| 89 | +1. Deploy the CICD pipeline stack: |
| 90 | + |
| 91 | +``` |
| 92 | +cdk deploy CICDPipelineStack |
| 93 | +``` |
| 94 | + |
| 95 | +1. Add the CICD pipeline Git repository as a remote. The command is provided by the `CICDPipelineStack` as an output. Then push the code to the repository: |
| 96 | + |
| 97 | +```bash |
| 98 | +git remote add demo codecommit::<REGION>://SparkTest |
| 99 | +git push demo |
| 100 | +``` |
0 commit comments