Skip to content

Commit c3454de

Browse files
committed
Added documentation
1 parent 2a35157 commit c3454de

File tree

4 files changed

+91
-65
lines changed

4 files changed

+91
-65
lines changed
113 KB
Loading
96.7 KB
Loading
102 KB
Loading

documentation/userguide/docs/pipelines.md

Lines changed: 91 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -5,73 +5,99 @@ mechanisms. Different business units might have their own data lake, the diversi
55
Scikit Learn, Spark, SparkML, Sage Maker, Athena… and consequently, the diversity of tools and use-cases result in a
66
wide variety of CI/CD standards which difficult developing collaboration.
77

8-
In order to distribute data processing, data.all introduces data.all pipelines where:
8+
In order to distribute data ingestion and processing, data.all introduces data.all pipelines:
9+
910
- data.all takes care of CI/CD infrastructure
10-
- data.all offers flexible pipeline blueprints to deploy AWS resources and a Step Function
11+
- data.all integrates with <a href="https://awslabs.github.io/aws-ddk/">AWS DDK</a>
1112

1213
## Creating a pipeline
13-
data.all pipelines are created from the UI, under "Automate Pipeline". Similar to the datasets, in the creation form of
14+
data.all pipelines are created from the UI, under Pipelines. Similar to the datasets, in the creation form of
1415
the pipeline we have to specify:
15-
- Environment
16-
- Group: Users inside this environment-group will be able to see and access the pipeline in the list of pipelines in
17-
- the UI.
16+
1817
- Name, Description and tags
19-
- Template: currently 2 different templates, "SageMaker pipeline" and "General pipeline" (more details below)
20-
- Stages: "Deploy to PROD account only" or "Deploy to ALL accounts" (more details below)
21-
22-
[INSERT IMAGE]
23-
24-
When a pipeline is created, various CloudFormation stacks are deployed:
25-
- One stack holds the CICD resources: CodePipeline pipeline + CodeCommit repository among others
26-
- Another stack that creates a Step Function and AWS resources defined in the pipeline CodeCommit repository.
27-
28-
The content of the CodeCommit repository is the directory that is used as a base for data.all Data/ML Projects.
29-
Upon creation of the project, the content is copied from the data.all backend repo to the business account repository.
30-
Depending on the specifications defined on the pipeline repository, different resources and a different step function
31-
gets deployed by the second Cloud Formation stack. Details on how to use and develop on top of the blueprint are
32-
included in the blueprint section.
33-
34-
#### CodePipeline stages
35-
If we select deploy to PROD only, then we would deploy one CICD CloudFormation stack that creates a 5-stages AWS
36-
CodePipeline pipeline that:
37-
1. reads the pipeline CodeCommit repository
38-
2. deploys the resources (standalone resources + step function) in a test stage
39-
3. runs unit tests on the code
40-
4. waits for manual approval
41-
5. deploys the resources (standalone resources + step function) in a production stage
42-
43-
In step 2 and 5 of the CodePipeline pipeline we are deploying the StepFunction+resources CloudFormation stack.
44-
Note that all the stacks (CICD and StepFunction+resources) stay in the Production account of the selected environment.
45-
46-
[INSERT IMAGE]
47-
48-
### Template
49-
The previous explanation and stages inside the CICD CodePipeline pipeline refer to the "General pipeline" template.
50-
This template can be used for general data processing workflows. Besides the General pipeline, data.all provides a
51-
template for ML projects, a "SageMaker pipeline" template that includes additional steps in the CICD pipeline, so
52-
that it:
53-
1. reads the pipeline CodeCommit repository
54-
2. **builds SageMaker Jobs**
55-
3. **deploys Docker Images**
56-
2. deploys the resources (standalone resources + step function) in a test stage
57-
3. runs unit tests on the code
58-
4. waits for manual approval
59-
5. deploys the resources (standalone resources + step function) in a production stage
60-
61-
## Cloning the repository
62-
1. Install git: **sudo yum install git**
63-
1. Install pip: **sudo yum -y install python-pip**
64-
1. Install git-remote-codecommit: **sudo pip install git-remote-codecommit**
65-
1. Setup credentials and clone you pipeline repository:
66-
- Go to data.all UI ==> Pipeline ==> Click on your pipeline ==> Code ==> Get Credentials ==>
67-
- Copy paste the commands and execute in your terminal
68-
69-
70-
## Executing the step function - who runs the step function?
71-
The role assumed by the step function is a basic developer role for the env-group selected, which means that, if the
72-
step function reads data from a data.all dataset a sharing request has to be done for this environment-group. In case
73-
of doubts, the role can be accessed in the AWS Step Functions console.
74-
75-
*Clarification*: Does this mean that we only can create pipelines with basic-dev roles? No. It means that the IAM role
76-
assumed by the step function is relative to the environment-group selected (this can be admin,fulldev or basicdev in
77-
the environment), but it has basicdev role permissions.
18+
- Environment and Team
19+
- Development strategy: GitFlow or Trunk-based
20+
- Development stages: dev, test, prod, qa,... It is required that at least "prod" is added.
21+
- Template: it corresponds with the --template parameter that can be passed to DDK init command. See the <a href="https://awslabs.github.io/aws-ddk/release/latest/api/cli/aws_ddk.html#ddk-init">docs</a> for more details.
22+
- From our the environment and team selected, we can choose whether this pipeline has an input or/and output dataset.
23+
24+
![create_pipeline](pictures/pipelines/pip_create_form.png#zoom#shadow)
25+
26+
When a pipeline is created, a CICD CloudFormation stack is deployed in the environment AWS account.
27+
It contains a CodePipeline pipeline (or more for GitFlow development strategy) that reads from an AWS CodeCommit repository.
28+
29+
In the first run of the CodePipeline Pipeline a DDK application is initialized in the Pipeline repository. This DDK app is deployed in the subsequent runs.
30+
If you want to change the deploy commands in the AWS CodeBuild deploy stage, note that the buildspec of the CodeBuild step is part of the CodeCommit repository.
31+
32+
33+
!!!abstract "GitFlow and branches"
34+
If you selected GitFlow as development strategy, you probably notices that the CodePipelines for non-prod stages fail in the first run because they cannot find their source.
35+
After the first successful run of the prod-CodePipeline pipeline, just create branches in the CodeCommit repository for the other stages and you are ready to go.
36+
37+
## Working with pipelines
38+
### Cloning the repository
39+
1. Install git: `sudo yum install git`
40+
1. Install pip: `sudo yum -y install python-pip`
41+
1. Install git-remote-codecommit: `sudo pip install git-remote-codecommit`
42+
1. Setup credentials and clone you pipeline repository. Copy the Credentials from the AWS Credentials button in the Pipeline overciew tab.
43+
44+
![created_pipeline](pictures/pipelines/pip_overview.png#zoom#shadow)
45+
46+
### Environment variables
47+
From the repository we can access the following environment variables:
48+
49+
![created_pipeline](pictures/pipelines/env_vars.png#zoom#shadow)
50+
51+
!!!abstract "No more hardcoding parameters"
52+
Use these environment variables in your code and avoid hardcoding IAM roles and S3 Bucket names. Use the ENVTEAM IAM role
53+
to access the datasets of your team. With the input/output variables you can forget about checking the name of Glue databases and S3 Buckets.
54+
55+
## Deploying to multiple AWS accounts/Environments
56+
By default, the DDK application is deployed in the same account as the CICD. The data pipelines that we build with DDK
57+
constructs are deployed in the same environment account even when we define multiple development stages.
58+
59+
Maybe in your enterprise you use one AWS account for CICD resources, and one AWS account for each of the development stages where you host the data pipelines.
60+
In this scenario, in which you want to deploy the DDK application to different AWS accounts, this is our proposed approach:
61+
62+
### Setting up the environments
63+
64+
For example, the Data Science team has 3 AWS accounts: DS-DEV, DS-TEST and DS-PROD. In data.all we create 3 environments linked to each of these accounts: DS-DEV-Environment, DS-TEST-Environment and DS-PROD-Environment.
65+
We also link the CICD account to data.all by creating the CICD-Environment.
66+
67+
DS-DEV, DS-TEST and DS-PROD accounts need to be bootstrapped with the following line, assuming 111111111111 = CICD account. The parameter -e needs to be set according to the stage of the account.
68+
69+
`ddk bootstrap -e dev -a 111111111111`
70+
71+
### Create pipeline
72+
73+
We create the pipeline in the CICD-Environment. The CICD stack will be deployed to the CICD account. Create the pipeline selecting with trunk-based + prod stage only.
74+
75+
### Customize the ddk.json configuration file
76+
77+
We customize the ddk.json file in the CodeCommit repository. More info <a href="https://awslabs.github.io/aws-ddk/release/stable/how-to/multi-account-deployment.html">here</a>.
78+
79+
```json
80+
{
81+
"environments": {
82+
"cicd": {
83+
"account": "111111111111",
84+
"region": "us-west-2"
85+
},
86+
"dev": {
87+
"account": "222222222222",
88+
"region": "us-west-2",
89+
"resources": {
90+
"ddk-bucket": {"versioned": false, "removal_policy": "destroy"}
91+
}
92+
},
93+
"test": {
94+
"account": "333333333333",
95+
"region": "us-west-2",
96+
"resources": {
97+
"ddk-bucket": {"versioned": true, "removal_policy": "retain"}
98+
}
99+
}
100+
}
101+
}
102+
```
103+
It self-mutates the stack and adds steps to deploy to the other accounts. DDK multiaccount strategy is trunk-based.

0 commit comments

Comments
 (0)