Skip to content

Commit e229c18

Browse files
authored
docs: spark data lake doc update (#172)
* docs: add example docs * Update website config * Fix the build
1 parent 0c767a0 commit e229c18

File tree

12 files changed

+149
-108
lines changed

12 files changed

+149
-108
lines changed

.projenrc.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ const CDK_CONSTRUCTS_VERSION = '10.2.55';
88
const JSII_VERSION = '~5.0.0';
99
const KUBECTL_LAYER_VERSION='v27';
1010

11-
const repositoryUrl = 'git@github.com:awslabs/aws-data-solutions-framework.git';
11+
const repositoryUrl = 'https://github.com/awslabs/aws-data-solutions-framework.git';
1212
const homepage = 'https://awslabs.github.io/aws-data-solutions-framework/';
1313
const author = 'Amazon Web Services';
1414
const authorAddress = 'https://aws.amazon.com';

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ You can leverage AWS DSF to implement your data platform in weeks rather than in
1010
- Use the framework to build your data solutions instead of building cloud infrastructure from scratch.
1111
- Compose data solutions using integrated building blocks via Infrastructure as Code (IaC).
1212
- Benefit from smart defaults and built-in AWS best practices.
13-
- Customize or extend according your requirements.
13+
- Customize or extend according to your requirements.
1414

15-
**Get started** by exploring the [framework](./framework/) and available [examples](./example/)
15+
**Get started** by exploring the [framework](./framework/) and available [examples](./examples/). Learn more from [documentation](https://awslabs.github.io/aws-data-solutions-framework/).
1616

1717
## Security
1818
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

examples/spark-data-lake/README.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Spark Data Lake example
2+
3+
In this example, we build a Data Lake and process aggregations from the NY taxi dataset with Spark application. This `README` is a step-by-step deployment guide. You can read more details about this example solution in the [documentation](https://awslabs.github.io/aws-data-solutions-framework/)
4+
5+
We are using a self-contained application where developers can manage both business code (Spark code in `./spark` folder), and the infrastructure code (AWS CDK code in `./infra` folder).
6+
7+
The business code is a simple **PySpark** application packaged in a common Python project following the [best practices](https://packaging.python.org/en/latest/tutorials/packaging-projects/):
8+
* A `pyproject.toml` file is used to install internal (packages defined in the code structure) and external dependencies (libraries from PyPi).
9+
* An `src` folder containing business code organized in Python packages (`__init__.py` files).
10+
* A `test` folder containing the unit tests run via `pytest .` command from the root folder of the Spark project. You can use the [EMR Vscode toolkit](https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools) to locally test the application on an EMR local runtime.
11+
12+
The infrastructure code is an AWS CDK application using the AWS DSF library to create the required resources. It contains 2 CDK stacks:
13+
* An **application stack** which provisions the Data Lake, data catalog, and the Spark runtime resources via the following constructs:
14+
* A `DataLakeStorage`
15+
* A `DataCatalogDatatabse`
16+
* A `SparkEmrServerlessRuntime`
17+
* A `SparkEmrServerlessJob`
18+
* A **CICD stack** which provisions a CICD Pipeline to manage the application development lifecycle via the following constructs:
19+
* A `SparkEmrCICDPipeline`
20+
* A `ApplicationStackFactory`
21+
22+
## Pre-requisite
23+
24+
1. [Install the AWS CDK CLI](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install)
25+
2. [Bootstrap the CICD account](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_bootstrap)
26+
3. [Bootstrap the staging and production accounts](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.pipelines-readme.html#cdk-environment-bootstrapping) with a trust relationship from the CICD account
27+
28+
```bash
29+
cdk bootstrap \
30+
--profile staging \
31+
--trust <CICD_ACCOUNT_ID> \
32+
--cloudformation-execution-policies “POLICY_ARN” \
33+
aws://<STAGING_ACCOUNT_ID>/<REGION>
34+
```
35+
4. [Install the git-remote-codecommit](https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-git-remote-codecommit.html#setting-up-git-remote-codecommit-install) utility to interact with AWS CodeCommit
36+
37+
## Getting started
38+
39+
1. Copy the `spark-data-lake` folder somewhere else on your machine and initialize a new git repository:
40+
41+
```bash
42+
cp -R ../spark-data-lake <MY_LOCAL_PATH>
43+
cd <MY_LOCAL_PATH>
44+
git init
45+
```
46+
47+
2. Modify the `./infra/requirements.txt` to add the `aws_dsf` library as a dependency:
48+
49+
```
50+
aws-cdk-lib==2.94.0
51+
constructs>=10.2.55, <11.0.0
52+
aws_dsf==1.0.0-rc1
53+
```
54+
55+
2. From the `./infra` folder, create Python3 virtual environment and activate it:
56+
57+
```bash
58+
cd infra
59+
python3 -m venv .venv
60+
source .venv/bin/activate
61+
```
62+
63+
3. Install the AWS DSF library:
64+
65+
```bash
66+
pip install -r requirements.txt
67+
```
68+
69+
4. Provide the target accounts and region information for the staging and production steps of the CICD pipeline.
70+
Also configure the global removal policy if you want to delete all the resources including the data when deleting the example.
71+
72+
73+
Create a `cdk.context.json` file with the following content:
74+
75+
```json
76+
{
77+
"staging": {
78+
"accountId": "<STAGING_ACCOUNT_ID>",
79+
"region": "<STAGING_REGION>"
80+
},
81+
"prod": {
82+
"accountId": "<PRODUCTION_ACCOUNT_ID>",
83+
"region": "<PRODUCTION_REGION>"
84+
},
85+
"@aws-data-solutions-framework/removeDataOnDestroy": true
86+
}
87+
```
88+
89+
1. Deploy the CICD pipeline stack:
90+
91+
```
92+
cdk deploy CICDPipelineStack
93+
```
94+
95+
1. Add the CICD pipeline Git repository as a remote. The command is provided by the `CICDPipelineStack` as an output. Then push the code to the repository:
96+
97+
```bash
98+
git remote add demo codecommit::<REGION>://SparkTest
99+
git push demo
100+
```
Lines changed: 1 addition & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -1,98 +1 @@
1-
# Spark Data Lake example
2-
3-
In this example, we build a Data Lake and process aggregations from a NY taxi dataset data with a Spark application.
4-
5-
We are using a self-contained application where developers can managed both business code (the Spark code in `./spark` folder) and the infrastructure code (the CDK code in `./infra` folder).
6-
7-
The business code is a simple PySpark application packaged in a common Python project following [best practices](https://packaging.python.org/en/latest/tutorials/packaging-projects/):
8-
* A `pyproject.toml` file is used to install internal (packages defined in the code structure) and external dependencies (libraries from PyPi).
9-
* An `src` folder containing business code organized in Python packages (`__init__.py` files).
10-
* A `test` folder containing the unit tests run via `pytest .` command from the root folder of the Spark project. You can use the [EMR Vscode toolkit](https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools) to locally test the application on an EMR local runtime.
11-
12-
The infrastructure code is an AWS CDK application using the AWS DSF library to create the required resources. It contains 2 CDK stacks:
13-
* An application stack which provisions the Data Lake, data catalog, and the Spark runtime resources via the following constructs:
14-
* A `DataLakeStorage`
15-
* A `DataCatalogDatatabse`
16-
* A `SparkEmrServerlessRuntime`
17-
* A `SparkEmrServerlessJob`
18-
* A CICD stack which provisions a CICD Pipeline to manage the application development lifecycle via the following constructs:
19-
* A `SparkEmrCICDPipeline`
20-
* A `ApplicationStackFactory`
21-
22-
## Pre-requisite
23-
24-
1. [Install the AWS CDK CLI](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install)
25-
2. [Bootstrap the CICD account](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_bootstrap)
26-
3. [Bootstrap the staging and production accounts](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.pipelines-readme.html#cdk-environment-bootstrapping) with a trust relationship from the CICD account
27-
28-
```
29-
npx cdk bootstrap \
30-
--trust <CICD_ACCOUNT_ID> \
31-
aws://222222222222/us-east-2
32-
```
33-
4. [Install the git-remote-codecommit](https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-git-remote-codecommit.html#setting-up-git-remote-codecommit-install) utility to interact with AWS CodeCommit
34-
35-
## Getting started
36-
37-
1. Copy the `spark-data-lake` folder somewhere else on your laptop and initialize a new git repository
38-
39-
```
40-
cp -R ../spark-data-lake <MY_LOCAL_PATH>
41-
cd <MY_LOCAL_PATH>
42-
git init
43-
```
44-
45-
2. Modify the `./infra/requirements.txt` to add the `aws_dsf` library as a dependency:
46-
47-
```
48-
aws-cdk-lib==2.94.0
49-
constructs>=10.2.55, <11.0.0
50-
aws_dsf==1.0.0-rc1
51-
```
52-
53-
2. From the `./infra` folder, create a Python3 virtual environment and source it:
54-
55-
```
56-
cd infra
57-
python3 -m venv .venv
58-
source .venv/bin/activate
59-
```
60-
61-
3. Install the AWS DSF library:
62-
63-
```
64-
pip install -r requirements.txt
65-
```
66-
67-
4. Provide the target accounts and region information for the staging and production steps of the CICD pipeline.
68-
Also configure the global removal policy if you want to delete all the resources including the data when deleting the example.
69-
70-
71-
Create a `cdk.context.json` file with the following content:
72-
73-
```
74-
{
75-
"staging": {
76-
"accountId": "<STAGING_ACCOUNT_ID>",
77-
"region": "<STAGING_REGION>"
78-
},
79-
"prod": {
80-
"accountId": "<PRODUCTION_ACCOUNT_ID>",
81-
"region": "<PRODUCTION_REGION>"
82-
},
83-
"@aws-data-solutions-framework/removeDataOnDestroy": true
84-
}
85-
```
86-
87-
1. Deploy the CICD pipeline stack:
88-
89-
```
90-
cdk deploy CICDPipelineStack
91-
```
92-
93-
1. Add the CICD pipeline Git repository as a remote. The command is provided by the `CICDPipelineStack` as an output. Then push the code to the repository:
94-
95-
```
96-
git remote add demo codecommit::<REGION>://SparkTest
97-
git push demo
98-
```
1+
# replace this

framework/package.json

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

website/docs/constructs/index.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,19 @@ sidebar_label: Introduction
77

88
AWS DSF is an open-source framework that simplifies implementation and delivery of integrated, customizable, and ready-to-deploy solutions that address the most common data analytics requirements.
99

10+
![AWS DSF Overview](../../static/img/aws-dsf-overview.png)
11+
1012
AWS DSF uses infrastructure as code and [AWS CDK](https://aws.amazon.com/cdk/) to package AWS products together into easy-to-use solutions. It provides an abstraction atop AWS services based on AWS CDK L3 [constructs](https://docs.aws.amazon.com/cdk/v2/guide/constructs.html).
1113
L3 Constructs are opinionated implementations of common technical patterns and generally create multiple resources that are configured to work with each other. For example, we provide a construct that creates a complete data lake storage with three different Amazon S3 buckets, encryption, data lifecycle policies, and etc.
1214
This means that you can create a data lake with in your CDK application with just a few lines of code.
1315

14-
Constructs are written in Typescript but available in both Typescript (on NPM) and Python (on Pypi).
16+
Constructs are written in Typescript but available in both Typescript (on NPM) and Python (on PyPi).
1517

16-
The AWS CDK L3 constructs in ADSF are built following these tenets:
17-
* They simplify the use of AWS products in common situations via configuration helpers and smart defaults.
18+
The AWS CDK L3 constructs in AWS DSF are built following these tenets:
19+
* They simplify the use of AWS services in common situations via configuration helpers and smart defaults.
1820
* Even if they provide smart defaults, you can customize them using the construct parameters to better fit your requirements.
1921
* If customizing the parameters is not enough, CDK composability allows you to build your own abstractions by composing lower-level constructs.
2022
* They are [well architected](https://aws.amazon.com/fr/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc&wa-guidance-whitepapers.sort-by=item.additionalFields.sortDate&wa-guidance-whitepapers.sort-order=desc). We use [CDK-nag](https://github.com/cdklabs/cdk-nag) and the [AWS Solutions rules](https://github.com/cdklabs/cdk-nag/blob/main/RULES.md#awssolutions) in the vetting process.
2123

2224

23-
You can use AWS DSF to accelerate building your analytics solutions, and/or you can use solutions that've been built with it.
25+
You can use AWS DSF to accelerate building your analytics solutions, and/or you can use solutions that have been built with it.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"label": "Solutions",
3+
"position": 4,
4+
"link": {
5+
"type": "generated-index"
6+
}
7+
}
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
sidebar_position: 1
3+
sidebar_label: Spark Data Lake example
4+
---
5+
6+
# Spark Data Lake
7+
8+
Build a data lake, and process data with Spark.
9+
10+
In this example, we will be using AWS DSF to quickly build an end-to-end solution to store and process the data, with a multi-environment CICD pipeline (staging, production) for the application logic. Our Spark application will process NYC Taxi dataset.
11+
12+
We will be using several constructs from the AWS DSF:
13+
- [`DataLakeStorage`](/docs/constructs/library/data-lake-storage)
14+
- [`DataCatalogDatabase`](/docs/constructs/library/data-catalog-database)
15+
- [`DataLakeStorage`](/docs/constructs/library/data-lake-storage)
16+
- [`SparkEMRServerlessRuntime`](/docs/constructs/library/spark-emr-serverless-runtime)
17+
- [`SparkEmrServerlessJob`](/docs/constructs/library/spark-job)
18+
- [`SparkEmrCICDPipeline`](/docs/constructs/library/spark-cicd-pipeline)
19+
- [`ApplicationStackFactory`](/docs/constructs/library/spark-cicd-pipeline#defining-a-cdk-stack-for-the-spark-application)
20+
- [`PySparkApplicationPackage`](/docs/constructs/library/pyspark-application-package)
21+
22+
This is what we will bulild!
23+
24+
![Data lake storage](../../../static/img/spark-data-lake.png)
25+
26+
## Deployment guide
27+
28+
You can follow the [deployment guide](https://github.com/awslabs/aws-data-solutions-framework/tree/main/examples/spark-data-lake) from AWS DSF GitHub repo to deploy the solution.

website/docusaurus.config.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ const config = {
1212
title: niceProjectName,
1313
tagline: 'Accelerate building your data analytics solutions with AWS Data Solutions Framework',
1414
url: 'https://' + organization + '.github.io',
15-
baseUrl: '/',
15+
// baseUrl: '/', //uncomment for local dev
16+
baseUrl: '/aws-data-solutions-framework/',
1617
trailingSlash: false,
1718
onBrokenLinks: 'throw',
1819
onBrokenMarkdownLinks: 'warn',

0 commit comments

Comments
 (0)