Skip to content

Commit f8c3462

Browse files
author
EC2 Default User
committed
Merge branch 'master' of github.com:paulu-aws/chembl-opentargets-data-lake-example
2 parents f0fb3f8 + 5c6cfcf commit f8c3462

File tree

1 file changed

+75
-21
lines changed

1 file changed

+75
-21
lines changed

README.md

Lines changed: 75 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,103 @@
1-
<h1 id='HPG9CA7YL2o'>Chembl and Open Targets in an AWS Data Lake</h1>
1+
<h1 id='HPG9CA7YL2o'>ChEMBL and Open Targets in an AWS Data Lake</h1>
22

33
Companion code for upcoming AWS blogpost on enrolling chembl and opentargets into a data lake on AWS<br/>
44

5+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/D5akZWKUWmfWEhA8u4loEA?a=U93UPcmkUsuoToxZr2QpWU5nosB1RwimIsIW5TtaJvEa' id='HPG9CASUn8g' alt='' width='800' height='380'></img></div><h2 id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
6+
7+
Your local machine needs to have the AWS CLI installed on your machine along with IAM permissions setup (through IAM role or .aws/credentials file). I like to use Cloud9 as my IDE as it comes with both of those already setup for me.<br/>
8+
9+
<br/>
10+
11+
Run the following commands<br/>
12+
13+
<pre id='HPG9CAKfUT3'>git clone https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git<br>cd chembl-opentargets-data-lake-example<br>./InstallCdkDependencies.sh<br>./DeployChemblOpenTargetsEnv.sh</pre>
14+
15+
Wait for Chembl and OpenTargets to be ‘staged’ into the baseline stack.<br/>
16+
17+
<br/>
18+
19+
The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket (for OpenTargets) and an RDS Postgres instance (for Chembl). It also spins up a little helper EC2 instance that stages those assets in their ‘raw’ form after downloading them from<a href="http://OpenTargets.org"> OpenTargets.org</a> and EMBL-EBI.<br/>
20+
21+
<br/>
22+
23+
Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. <br/>
24+
25+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></div><br/>
26+
27+
It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
28+
29+
<br/>
30+
31+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></div><br/>
32+
33+
That will open a SSM session window. Run the following command to tail the log output.<br/>
34+
35+
<pre id='HPG9CAuziva'>tail -f /home/ssm-user/progressLog</pre>
36+
37+
<br/>
38+
39+
<div data-section-style='11' class='tall' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></div><br/>
40+
41+
<h2 id='HPG9CAe1Pmp'>Enroll Chembl and OpenTargets into the data lake</h2>
42+
43+
Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section<br/>
44+
545
<br/>
646

7-
To install this in your own AWS account:<br/>
47+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></div><br/>
48+
49+
Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'<br/>
850

951
<br/>
1052

11-
<div style="" data-section-style='6' class=""><ul id='HPG9CAdvuXu'><li id='HPG9CAsuX2r' class='' value='1'>Clone this repo
53+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></div><br/>
1254

13-
<br/></li></ul></div><pre id='HPG9CAKfUT3'>git clone <a href="https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git">https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git</a></pre>
55+
Do the same for the chemblDataLakeEnrollmentWorkflow. Wait for the workflows to finish.<br/>
1456

15-
<div style="" data-section-style='6' class="list-numbering-continue"><ul id='HPG9CAPCYqe'><li id='HPG9CASYAsv' class='' value='1'>Install the CDK dependencies
57+
<br/>
1658

17-
<br/></li></ul></div><pre id='HPG9CAwKcSI'>./InstallCdkDependencies.sh</pre>
59+
Both workflows will run in parallel, but it will take the openTargetsDataLakeEnrollmentWorkflow ~170 minutes to complete while the Chembl enrollment will finish in about 30 minutes. <br/>
1860

19-
<div style="" data-section-style='6' class="list-numbering-continue"><ul id='HPG9CAHILzl'><li id='HPG9CAK6MvP' class='' value='1'>Deploy the CDK Stacks
61+
<h2 id='HPG9CAYpovV'>Query an Conquer!</h2>
2062

21-
<br/></li></ul></div><pre id='HPG9CAeahes'>./DeployChemblOpenTargetsEnv.sh</pre>
63+
Go to Athena in the AWS Console.<br/>
2264

23-
<div class="list-numbering-restart-at" data-section-style='6' style="--indent0: 4"><ul id='HPG9CAS1C4M'><li id='HPG9CA0az5T' class='parent' value='1'>Wait for Chembl and OpenTargets to be ‘staged’ into the baseline stack.
65+
<br/>
2466

25-
<br/></li><ul><li id='HPG9CAimAzR' class=''>The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket (for OpenTargets) and an RDS Postgres instance (for Chembl). It also spins up a little helper EC2 instance that stages those assets in their ‘raw’ form<a href="http://OpenTargets.org"> OpenTargets.org</a> and EMBL-EBI into your account.
67+
If you havent used Athena in your account before, you will need to define a storage location for your query results. Click on the ‘Settings’ tab in the top right and specify a bucket name where you would like Athena results stored and click save.<br/>
2668

27-
<br/></li><li id='HPG9CA9PXT3' class=''>Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. 
69+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/d9imQFzWnNdhWYDAo9Bt1A?a=8Q4UOXPqvG1fk3knDX9x2wr9Jeu9g8V2tPRYsnE3Vlga' id='HPG9CASHmiz' alt='' width='800' height='429'></img></div><br/>
2870

29-
<br/></li><li id='HPG9CA9WNsB' class=''><span data-section-style='11' style='max-width:168%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></span></li><li id='HPG9CAQFK7w' class=''>It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.
71+
Now, click the ‘Databases’ dropdown:<br/>
3072

31-
<br/></li><li id='HPG9CADqhgF' class=''><span data-section-style='11' style='max-width:155%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></span></li><li id='HPG9CAByHqU' class=''>That will open a SSM session window run the following command
73+
<br/>
3274

33-
<br/></li></ul></ul></div><pre id='HPG9CAuziva'> <code>tail -f progressLog</code></pre>
75+
You will see 4 databases listed, you only want to use 2 of them:<br/>
3476

35-
<div style="" data-section-style='6' class=""><ul id='HPG9CAaYWQI'><li id='HPG9CAgo8Yy' class='' value='1'><span data-section-style='11' class='tall' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></span></li></ul></div><br/>
77+
<br/>
3678

37-
<div style="" data-section-style='6' class="list-numbering-continue"><ul id='HPG9CA8jP0B'><li id='HPG9CA6hIcf' class='' value='1'>Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section
79+
<u><i><b>Use:</b></i></u><br/>
3880

39-
<br/></li><li id='HPG9CADnepH' class=''><span data-section-style='11' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></span></li><li id='HPG9CApeYdR' class=''>Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'
81+
<br/>
4082

41-
<br/></li><li id='HPG9CAgkuAH' class=''><span data-section-style='11' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></span></li><li id='HPG9CA1tRSd' class=''>Do the same for the chemblDataLakeEnrollmentWorkflow
83+
<b>chembl-25-dl </b>- This is the ‘dl’ or ‘data lake’ Chembl database. Always use tables in this database when running Chembl queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ Chembl Postgres formats into a ‘data lake’ friendly parquet format optimized for Athena. <br/>
4284

43-
<br/></li><li id='HPG9CA4pCXY' class=''>Wait for the workflows to finish.
85+
<br/>
4486

45-
<br/></li></ul></div><br/>
87+
<b>opentargets-1911-dl </b>- This is the ‘dl’ or ‘data lake’ OpenTargets database. Always use this table when running OpenTarget queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ OpenTargets json and csv formats into a ‘data lake’ parquet format optimized for Athena. <br/>
4688

47-
You can now query opentargets and chembl data through Athena!<br/>
89+
<br/>
90+
91+
<u><i><b>Dont use:</b></i></u><br/>
92+
93+
<br/>
94+
95+
<b>chembl-25-src - </b>This represents the ‘src’ or ‘source’ Chembl postgres database. By design, the source database is not directly queryable from Athena, so you will not use this database. <br/>
96+
97+
<br/>
98+
99+
<b>opentargets-1911-src - </b>This is the ‘src’ or ‘source’ table. When you query this table, you are directly querying the original chembl json and csv files<b> </b>from OpenTargets. The performance may be slow as those formats are not optimized for querying with Athena. <br/>
100+
101+
<br/>
48102

49103
<br/>

0 commit comments

Comments
 (0)