docs: spark data lake doc update (#172)

dzeno · web-flow · commit e229c18471fa · 2023-10-10T13:54:35.000+02:00
* docs: add example docs

* Update website config

* Fix the build
diff --git a/.projenrc.ts b/.projenrc.ts
@@ -8,7 +8,7 @@ const CDK_CONSTRUCTS_VERSION = '10.2.55';
 const JSII_VERSION = '~5.0.0';
 const KUBECTL_LAYER_VERSION='v27';
 
-const repositoryUrl = 'git@github.com:awslabs/aws-data-solutions-framework.git';
+const repositoryUrl = 'https://github.com/awslabs/aws-data-solutions-framework.git';
 const homepage = 'https://awslabs.github.io/aws-data-solutions-framework/';
 const author = 'Amazon Web Services';
 const authorAddress = 'https://aws.amazon.com';
diff --git a/README.md b/README.md
@@ -10,9 +10,9 @@ You can leverage AWS DSF to implement your data platform in weeks rather than in
 - Use the framework to build your data solutions instead of building cloud infrastructure from scratch.
 - Compose data solutions using integrated building blocks via Infrastructure as Code (IaC).
 - Benefit from smart defaults and built-in AWS best practices.
-- Customize or extend according your requirements.
+- Customize or extend according to your requirements.
 
-**Get started** by exploring the [framework](./framework/) and available [examples](./example/)
+**Get started** by exploring the [framework](./framework/) and available [examples](./examples/). Learn more from [documentation](https://awslabs.github.io/aws-data-solutions-framework/). 
 
 ## Security
 See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
diff --git a/examples/spark-data-lake/README.md b/examples/spark-data-lake/README.md
@@ -0,0 +1,100 @@
+# Spark Data Lake example 
+
+In this example, we build a Data Lake and process aggregations from the NY taxi dataset with Spark application. This `README` is a step-by-step deployment guide. You can read more details about this example solution in the [documentation](https://awslabs.github.io/aws-data-solutions-framework/)
+
+We are using a self-contained application where developers can manage both business code (Spark code in `./spark` folder), and the infrastructure code (AWS CDK code in `./infra` folder).
+
+The business code is a simple **PySpark** application packaged in a common Python project following the [best practices](https://packaging.python.org/en/latest/tutorials/packaging-projects/):
+ * A `pyproject.toml` file is used to install internal (packages defined in the code structure) and external dependencies (libraries from PyPi).
+ * An `src` folder containing business code organized in Python packages (`__init__.py` files).
+ * A `test` folder containing the unit tests run via `pytest .` command from the root folder of the Spark project. You can use the [EMR Vscode toolkit](https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools) to locally test the application on an EMR local runtime.
+
+The infrastructure code is an AWS CDK application using the AWS DSF library to create the required resources. It contains 2 CDK stacks:
+ * An **application stack** which provisions the Data Lake, data catalog, and the Spark runtime resources via the following constructs:
+   * A `DataLakeStorage` 
+   * A `DataCatalogDatatabse`
+   * A `SparkEmrServerlessRuntime`
+   * A `SparkEmrServerlessJob`
+ * A **CICD stack** which provisions a CICD Pipeline to manage the application development lifecycle via the following constructs:
+   * A `SparkEmrCICDPipeline`
+   * A `ApplicationStackFactory`
+
+## Pre-requisite
+
+1. [Install the AWS CDK CLI](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install)
+2. [Bootstrap the CICD account](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_bootstrap)
+3. [Bootstrap the staging and production accounts](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.pipelines-readme.html#cdk-environment-bootstrapping) with a trust relationship from the CICD account
+
+```bash
+cdk bootstrap \
+--profile staging \
+--trust <CICD_ACCOUNT_ID> \
+--cloudformation-execution-policies “POLICY_ARN” \
+aws://<STAGING_ACCOUNT_ID>/<REGION>
+```
+4. [Install the git-remote-codecommit](https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-git-remote-codecommit.html#setting-up-git-remote-codecommit-install) utility to interact with AWS CodeCommit
+
+## Getting started
+
+1. Copy the `spark-data-lake` folder somewhere else on your machine and initialize a new git repository:
+
+```bash
+cp -R ../spark-data-lake <MY_LOCAL_PATH>
+cd <MY_LOCAL_PATH>
+git init
+```
+
+2. Modify the `./infra/requirements.txt` to add the `aws_dsf` library as a dependency:
+
+```
+aws-cdk-lib==2.94.0
+constructs>=10.2.55, <11.0.0
+aws_dsf==1.0.0-rc1
+```
+
+2. From the `./infra` folder, create Python3 virtual environment and activate it:
+
+```bash
+cd infra
+python3 -m venv .venv 
+source .venv/bin/activate 
+```
+
+3. Install the AWS DSF library:
+
+```bash
+pip install -r requirements.txt 
+```
+
+4. Provide the target accounts and region information for the staging and production steps of the CICD pipeline. 
+   Also configure the global removal policy if you want to delete all the resources including the data when deleting the example.
+
+
+   Create a `cdk.context.json` file with the following content:
+
+```json
+{
+  "staging": {
+    "accountId": "<STAGING_ACCOUNT_ID>",
+    "region": "<STAGING_REGION>"
+  },
+  "prod": {
+    "accountId": "<PRODUCTION_ACCOUNT_ID>",
+    "region": "<PRODUCTION_REGION>"
+  },
+  "@aws-data-solutions-framework/removeDataOnDestroy": true
+}
+```
+
+1. Deploy the CICD pipeline stack:
+
+```
+cdk deploy CICDPipelineStack
+```
+
+1. Add the CICD pipeline Git repository as a remote. The command is provided by the `CICDPipelineStack` as an output. Then push the code to the repository:
+
+```bash
+git remote add demo codecommit::<REGION>://SparkTest
+git push demo
+```
diff --git a/examples/spark-data-lake/infra/README.md b/examples/spark-data-lake/infra/README.md
@@ -1,98 +1 @@
-# Spark Data Lake example 
-
-In this example, we build a Data Lake and process aggregations from a NY taxi dataset data with a Spark application.
-
-We are using a self-contained application where developers can managed both business code (the Spark code in `./spark` folder) and the infrastructure code (the CDK code in `./infra` folder).
-
-The business code is a simple PySpark application packaged in a common Python project following [best practices](https://packaging.python.org/en/latest/tutorials/packaging-projects/):
- * A `pyproject.toml` file is used to install internal (packages defined in the code structure) and external dependencies (libraries from PyPi).
- * An `src` folder containing business code organized in Python packages (`__init__.py` files).
- * A `test` folder containing the unit tests run via `pytest .` command from the root folder of the Spark project. You can use the [EMR Vscode toolkit](https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools) to locally test the application on an EMR local runtime.
-
-The infrastructure code is an AWS CDK application using the AWS DSF library to create the required resources. It contains 2 CDK stacks:
- * An application stack which provisions the Data Lake, data catalog, and the Spark runtime resources via the following constructs:
-   * A `DataLakeStorage` 
-   * A `DataCatalogDatatabse`
-   * A `SparkEmrServerlessRuntime`
-   * A `SparkEmrServerlessJob`
- * A CICD stack which provisions a CICD Pipeline to manage the application development lifecycle via the following constructs:
-   * A `SparkEmrCICDPipeline`
-   * A `ApplicationStackFactory`
-
-## Pre-requisite
-
-1. [Install the AWS CDK CLI](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install)
-2. [Bootstrap the CICD account](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_bootstrap)
-3. [Bootstrap the staging and production accounts](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.pipelines-readme.html#cdk-environment-bootstrapping) with a trust relationship from the CICD account
-
-```
-npx cdk bootstrap \
-    --trust <CICD_ACCOUNT_ID> \
-    aws://222222222222/us-east-2
-```
-4. [Install the git-remote-codecommit](https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up-git-remote-codecommit.html#setting-up-git-remote-codecommit-install) utility to interact with AWS CodeCommit
-
-## Getting started
-
-1. Copy the `spark-data-lake` folder somewhere else on your laptop and initialize a new git repository
-
-```
-cp -R ../spark-data-lake <MY_LOCAL_PATH>
-cd <MY_LOCAL_PATH>
-git init
-```
-
-2. Modify the `./infra/requirements.txt` to add the `aws_dsf` library as a dependency:
-
-```
-aws-cdk-lib==2.94.0
-constructs>=10.2.55, <11.0.0
-aws_dsf==1.0.0-rc1
-```
-
-2. From the `./infra` folder, create a Python3 virtual environment and source it:
-
-```
-cd infra
-python3 -m venv .venv 
-source .venv/bin/activate 
-```
-
-3. Install the AWS DSF library:
-
-```
-pip install -r requirements.txt 
-```
-
-4. Provide the target accounts and region information for the staging and production steps of the CICD pipeline. 
-   Also configure the global removal policy if you want to delete all the resources including the data when deleting the example.
-
-
-   Create a `cdk.context.json` file with the following content:
-
-```
-{
-  "staging": {
-    "accountId": "<STAGING_ACCOUNT_ID>",
-    "region": "<STAGING_REGION>"
-  },
-  "prod": {
-    "accountId": "<PRODUCTION_ACCOUNT_ID>",
-    "region": "<PRODUCTION_REGION>"
-  },
-  "@aws-data-solutions-framework/removeDataOnDestroy": true
-}
-```
-
-1. Deploy the CICD pipeline stack:
-
-```
-cdk deploy CICDPipelineStack
-```
-
-1. Add the CICD pipeline Git repository as a remote. The command is provided by the `CICDPipelineStack` as an output. Then push the code to the repository:
-
-```
-git remote add demo codecommit::<REGION>://SparkTest
-git push demo
-```
+# replace this
diff --git a/framework/package.json b/framework/package.json
diff --git a/package.json b/package.json
diff --git a/website/docs/constructs/index.md b/website/docs/constructs/index.md
@@ -7,17 +7,19 @@ sidebar_label: Introduction
 
 AWS DSF is an open-source framework that simplifies implementation and delivery of integrated, customizable, and ready-to-deploy solutions that address the most common data analytics requirements. 
 
+![AWS DSF Overview](../../static/img/aws-dsf-overview.png)
+
 AWS DSF uses infrastructure as code and [AWS CDK](https://aws.amazon.com/cdk/) to package AWS products together into easy-to-use solutions. It provides an abstraction atop AWS services based on AWS CDK L3 [constructs](https://docs.aws.amazon.com/cdk/v2/guide/constructs.html). 
 L3 Constructs are opinionated implementations of common technical patterns and generally create multiple resources that are configured to work with each other. For example, we provide a construct that creates a complete data lake storage with three different Amazon S3 buckets, encryption, data lifecycle policies, and etc. 
 This means that you can create a data lake with in your CDK application with just a few lines of code. 
 
-Constructs are written in Typescript but available in both Typescript (on NPM) and Python (on Pypi).
+Constructs are written in Typescript but available in both Typescript (on NPM) and Python (on PyPi).
 
-The AWS CDK L3 constructs in ADSF are built following these tenets:
-* They simplify the use of AWS products in common situations via configuration helpers and smart defaults.
+The AWS CDK L3 constructs in AWS DSF are built following these tenets:
+* They simplify the use of AWS services in common situations via configuration helpers and smart defaults.
 * Even if they provide smart defaults, you can customize them using the construct parameters to better fit your requirements.
 * If customizing the parameters is not enough, CDK composability allows you to build your own abstractions by composing lower-level constructs.
 * They are [well architected](https://aws.amazon.com/fr/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc&wa-guidance-whitepapers.sort-by=item.additionalFields.sortDate&wa-guidance-whitepapers.sort-order=desc). We use [CDK-nag](https://github.com/cdklabs/cdk-nag) and the [AWS Solutions rules](https://github.com/cdklabs/cdk-nag/blob/main/RULES.md#awssolutions) in the vetting process.
 
 
-You can use AWS DSF to accelerate building your analytics solutions, and/or you can use solutions that've been built with it.
+You can use AWS DSF to accelerate building your analytics solutions, and/or you can use solutions that have been built with it.
diff --git a/website/docs/examples/library/_category_.json b/website/docs/examples/library/_category_.json
@@ -0,0 +1,7 @@
+{
+  "label": "Solutions",
+  "position": 4,
+  "link": {
+    "type": "generated-index"
+  }
+}
diff --git a/website/docs/examples/library/spark-data-lake.md b/website/docs/examples/library/spark-data-lake.md
@@ -0,0 +1,28 @@
+---
+sidebar_position: 1
+sidebar_label: Spark Data Lake example
+---
+
+# Spark Data Lake
+
+Build a data lake, and process data with Spark.
+
+In this example, we will be using AWS DSF to quickly build an end-to-end solution to store and process the data, with a multi-environment CICD pipeline (staging, production) for the application logic. Our Spark application will process NYC Taxi dataset.
+
+We will be using several constructs from the AWS DSF:
+- [`DataLakeStorage`](/docs/constructs/library/data-lake-storage)
+- [`DataCatalogDatabase`](/docs/constructs/library/data-catalog-database)
+- [`DataLakeStorage`](/docs/constructs/library/data-lake-storage)
+- [`SparkEMRServerlessRuntime`](/docs/constructs/library/spark-emr-serverless-runtime)
+- [`SparkEmrServerlessJob`](/docs/constructs/library/spark-job)
+- [`SparkEmrCICDPipeline`](/docs/constructs/library/spark-cicd-pipeline)
+- [`ApplicationStackFactory`](/docs/constructs/library/spark-cicd-pipeline#defining-a-cdk-stack-for-the-spark-application)
+- [`PySparkApplicationPackage`](/docs/constructs/library/pyspark-application-package)
+
+This is what we will bulild!
+
+![Data lake storage](../../../static/img/spark-data-lake.png)
+
+## Deployment guide
+
+You can follow the [deployment guide](https://github.com/awslabs/aws-data-solutions-framework/tree/main/examples/spark-data-lake) from AWS DSF GitHub repo to deploy the solution.
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
@@ -12,7 +12,8 @@ const config = {
   title: niceProjectName,
   tagline: 'Accelerate building your data analytics solutions with AWS Data Solutions Framework',
   url: 'https://' + organization + '.github.io',
-  baseUrl: '/',
+  // baseUrl: '/', //uncomment for local dev
+  baseUrl: '/aws-data-solutions-framework/',
   trailingSlash: false,
   onBrokenLinks: 'throw',
   onBrokenMarkdownLinks: 'warn',
diff --git a/website/static/img/aws-dsf-overview.png b/website/static/img/aws-dsf-overview.png
diff --git a/website/static/img/spark-data-lake.png b/website/static/img/spark-data-lake.png