Skip to content

Commit 541460c

Browse files
committed
Adding intial work to internal repo
Merge branch 'master' of https://github.com/paulu-aws/chembl-opentargets-data-lake-example into mainline
2 parents 326dbcb + 0554c70 commit 541460c

26 files changed

+2022
-0
lines changed

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
*.js
2+
!jest.config.js
3+
*.d.ts
4+
node_modules
5+
6+
# CDK asset staging directory
7+
.cdk.staging
8+
cdk.out

.npmignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
*.ts
2+
!*.d.ts
3+
4+
# CDK asset staging directory
5+
.cdk.staging
6+
cdk.out

DeployChemblOpenTargetsEnv.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
npm run build
2+
cdk bootstrap
3+
cdk deploy BaselineStack --require-approval never
4+
cdk deploy CoreDataLake --require-approval never
5+
cdk deploy ChemblStack --require-approval never
6+
cdk deploy OpenTargetsStack --require-approval never
7+
cdk deploy AnalyticsStack --require-approval never

InstallCdkDependencies.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
npm install -g aws-cdk
2+
npm install "@types/node" —save-dev
3+
npm update
4+
sudo yum install jq -y
5+
npm install

README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
<h1 id='HPG9CA7YL2o'>ChEMBL and Open Targets in an AWS Data Lake</h1>
2+
3+
Companion code for upcoming AWS blogpost on enrolling chembl and opentargets into a data lake on AWS<br/>
4+
5+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/D5akZWKUWmfWEhA8u4loEA?a=U93UPcmkUsuoToxZr2QpWU5nosB1RwimIsIW5TtaJvEa' id='HPG9CASUn8g' alt='' width='800' height='380'></img></div><h2 id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
6+
7+
Your local machine needs to have the AWS CLI installed on your machine along with IAM permissions setup (through IAM role or .aws/credentials file). I like to use Cloud9 as my IDE as it comes with both of those already setup for me.<br/>
8+
9+
<br/>
10+
11+
Run the following commands<br/>
12+
13+
<pre id='HPG9CAKfUT3'>git clone https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git<br>cd chembl-opentargets-data-lake-example<br>./InstallCdkDependencies.sh<br>./DeployChemblOpenTargetsEnv.sh</pre>
14+
15+
Wait for Chembl and OpenTargets to be ‘staged’ into the baseline stack.<br/>
16+
17+
<br/>
18+
19+
The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket (for OpenTargets) and an RDS Postgres instance (for Chembl). It also spins up a little helper EC2 instance that stages those assets in their ‘raw’ form after downloading them from<a href="http://OpenTargets.org"> OpenTargets.org</a> and EMBL-EBI.<br/>
20+
21+
<br/>
22+
23+
Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. <br/>
24+
25+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></div><br/>
26+
27+
It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
28+
29+
<br/>
30+
31+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></div><br/>
32+
33+
That will open a SSM session window. Run the following command to tail the log output.<br/>
34+
35+
<pre id='HPG9CAuziva'>tail -f /home/ssm-user/progressLog</pre>
36+
37+
<br/>
38+
39+
<div data-section-style='11' class='tall' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></div><br/>
40+
41+
<h2 id='HPG9CAe1Pmp'>Enroll Chembl and OpenTargets into the data lake</h2>
42+
43+
Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section<br/>
44+
45+
<br/>
46+
47+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></div><br/>
48+
49+
Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'<br/>
50+
51+
<br/>
52+
53+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></div><br/>
54+
55+
Do the same for the chemblDataLakeEnrollmentWorkflow. Wait for the workflows to finish.<br/>
56+
57+
<br/>
58+
59+
Both workflows will run in parallel, but it will take the openTargetsDataLakeEnrollmentWorkflow ~170 minutes to complete while the Chembl enrollment will finish in about 30 minutes. <br/>
60+
61+
<h2 id='HPG9CAYpovV'>Query an Conquer!</h2>
62+
63+
Go to Athena in the AWS Console.<br/>
64+
65+
<br/>
66+
67+
If you havent used Athena in your account before, you will need to define a storage location for your query results. Click on the ‘Settings’ tab in the top right and specify a bucket name where you would like Athena results stored and click save.<br/>
68+
69+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/d9imQFzWnNdhWYDAo9Bt1A?a=8Q4UOXPqvG1fk3knDX9x2wr9Jeu9g8V2tPRYsnE3Vlga' id='HPG9CASHmiz' alt='' width='800' height='429'></img></div><br/>
70+
71+
Now, click the ‘Databases’ dropdown:<br/>
72+
73+
<br/>
74+
75+
You will see 4 databases listed, you only want to use 2 of them:<br/>
76+
77+
<br/>
78+
79+
<u><i><b>Use:</b></i></u><br/>
80+
81+
<br/>
82+
83+
<b>chembl-25-dl </b>- This is the ‘dl’ or ‘data lake’ Chembl database. Always use tables in this database when running Chembl queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ Chembl Postgres formats into a ‘data lake’ friendly parquet format optimized for Athena. <br/>
84+
85+
<br/>
86+
87+
<b>opentargets-1911-dl </b>- This is the ‘dl’ or ‘data lake’ OpenTargets database. Always use this table when running OpenTarget queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ OpenTargets json and csv formats into a ‘data lake’ parquet format optimized for Athena. <br/>
88+
89+
<br/>
90+
91+
<u><i><b>Dont use:</b></i></u><br/>
92+
93+
<br/>
94+
95+
<b>chembl-25-src - </b>This represents the ‘src’ or ‘source’ Chembl postgres database. By design, the source database is not directly queryable from Athena, so you will not use this database. <br/>
96+
97+
<br/>
98+
99+
<b>opentargets-1911-src - </b>This is the ‘src’ or ‘source’ table. When you query this table, you are directly querying the original chembl json and csv files<b> </b>from OpenTargets. The performance may be slow as those formats are not optimized for querying with Athena. <br/>
100+
101+
<br/>
102+
103+
<br/>

bin/aws.ts

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
#!/usr/bin/env node
2+
import 'source-map-support/register';
3+
import * as cdk from '@aws-cdk/core';
4+
import { BaselineStack } from '../lib/baseline-stack';
5+
import { DatalakeStack } from '../lib/datalake-stack';
6+
import { OpenTargetsStack } from '../lib/opentargets-stack';
7+
import { ChemblStack } from '../lib/chembl-25-stack';
8+
import { AnalyticsStack } from '../lib/analytics-stack.js';
9+
import s3 = require('@aws-cdk/aws-s3');
10+
11+
12+
const app = new cdk.App();
13+
const baseline = new BaselineStack(app, 'BaselineStack');
14+
15+
16+
const coreDataLake = new DatalakeStack(app, 'CoreDataLake', {
17+
18+
});
19+
20+
21+
22+
const chemblStack = new ChemblStack(app, 'ChemblStack', {
23+
database: baseline.ChemblDb,
24+
accessSecurityGroup: baseline.chemblDBChemblDbAccessSg,
25+
databaseSecret: baseline.chemblDBSecret,
26+
dataLakeBucket: coreDataLake.DataLakeBucket
27+
});
28+
29+
const openTargetsStack = new OpenTargetsStack(app, 'OpenTargetsStack', {
30+
sourceBucket: baseline.OpenTargetsSourceBucket,
31+
sourceBucketDataPrefix: '/opentargets/sourceExports/19.11/output/',
32+
dataLakeBucket: coreDataLake.DataLakeBucket
33+
});
34+
35+
const analyticsStack = new AnalyticsStack(app, 'AnalyticsStack', {
36+
targetVpc: baseline.Vpc,
37+
});
38+
39+
40+
chemblStack.grantRead(analyticsStack.NotebookRole);
41+
openTargetsStack.grantRead(analyticsStack.NotebookRole);

cdk.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"app": "npx ts-node bin/aws.ts"
3+
}

jest.config.js

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
module.exports = {
2+
"roots": [
3+
"<rootDir>/test"
4+
],
5+
testMatch: [ '**/*.test.ts'],
6+
"transform": {
7+
"^.+\\.tsx?$": "ts-jest"
8+
},
9+
}

lib/analytics-stack.ts

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
import * as cdk from '@aws-cdk/core';
2+
import ec2 = require('@aws-cdk/aws-ec2');
3+
import iam = require('@aws-cdk/aws-iam');
4+
import rds = require('@aws-cdk/aws-rds');
5+
import glue = require('@aws-cdk/aws-glue');
6+
import s3 = require('@aws-cdk/aws-s3');
7+
import s3assets = require('@aws-cdk/aws-s3-assets');
8+
import sagemaker = require('@aws-cdk/aws-sagemaker');
9+
import { DataSetStack, DataSetStackProps} from './dataset-stack';
10+
11+
export interface AnalyticsStackProps extends cdk.StackProps{
12+
targetVpc: ec2.Vpc
13+
}
14+
15+
16+
export class AnalyticsStack extends cdk.Stack {
17+
18+
public readonly NotebookRole: iam.Role;
19+
20+
constructor(scope: cdk.Construct, id: string, props: AnalyticsStackProps) {
21+
super(scope, id, props);
22+
23+
24+
const notebookSg = new ec2.SecurityGroup(this, 'notebookSg',{
25+
vpc: props.targetVpc
26+
});
27+
28+
const athenaStagingDirectory = new s3.Bucket(this, 'athenaStagingDir', {});
29+
30+
const lifecycleCode = [
31+
{"content": cdk.Fn.base64(`
32+
wget -O /home/ec2-user/SageMaker/opentargets.chembl.example.ipynb https://raw.githubusercontent.com/paulu-aws/chembl-opentargets-data-lake-example/master/scripts/sagemaker.opentargets.chembl.example.ipynb
33+
sudo chown ec2-user /home/ec2-user/SageMaker/opentargets.chembl.example.ipynb
34+
sed -i 's/XXXXAthenaStagingDirectoryXXXX/${athenaStagingDirectory.bucketName}/g' /home/ec2-user/SageMaker/opentargets.chembl.example.ipynb
35+
sed -i 's/XXXXAthenaRegionXXXX/${cdk.Stack.of(this).region}/g' /home/ec2-user/SageMaker/opentargets.chembl.example.ipynb
36+
`) }
37+
];
38+
const sageMakerIntanceLifecyclePolicy = new sagemaker.CfnNotebookInstanceLifecycleConfig(this, 'notebookLifecyclePolicy', {
39+
notebookInstanceLifecycleConfigName: "Boostrap-Chembl-OpenTargets-Demo-Notebook",
40+
onStart: lifecycleCode
41+
42+
});
43+
44+
const notebookPolicy = {
45+
"Version": "2012-10-17",
46+
"Statement": [
47+
{
48+
"Effect": "Allow",
49+
"Action": [
50+
"cloudwatch:PutMetricData",
51+
"logs:CreateLogStream",
52+
"logs:PutLogEvents",
53+
"logs:CreateLogGroup",
54+
"logs:DescribeLogStreams",
55+
],
56+
"Resource": "*"
57+
}
58+
]
59+
};
60+
61+
const notebookPolicyDoc = iam.PolicyDocument.fromJson(notebookPolicy);
62+
63+
this.NotebookRole = new iam.Role(this, 'notebookInstanceRole', {
64+
roleName: "chemblOpenTargetsNotebookRole",
65+
assumedBy: new iam.ServicePrincipal('sagemaker'),
66+
inlinePolicies: {
67+
"notebookPermissions": notebookPolicyDoc
68+
}
69+
});
70+
71+
athenaStagingDirectory.grantReadWrite(this.NotebookRole);
72+
73+
74+
75+
new sagemaker.CfnNotebookInstance(this, 'analyticsNotebook', {
76+
instanceType : 'ml.t2.medium',
77+
volumeSizeInGb: 100,
78+
securityGroupIds: [notebookSg.securityGroupId],
79+
subnetId: props.targetVpc.selectSubnets({subnetType: ec2.SubnetType.PRIVATE}).subnetIds[0],
80+
notebookInstanceName: "Chembl-OpenTargets-Demo-Notebook",
81+
roleArn: this.NotebookRole.roleArn,
82+
directInternetAccess: 'Disabled',
83+
lifecycleConfigName: sageMakerIntanceLifecyclePolicy.notebookInstanceLifecycleConfigName
84+
});
85+
86+
87+
}
88+
}
89+

0 commit comments

Comments
 (0)