Skip to content

Commit 5c6cfcf

Browse files
author
EC2 Default User
committed
More content to readme.
1 parent 92a53eb commit 5c6cfcf

File tree

1 file changed

+66
-10
lines changed

1 file changed

+66
-10
lines changed

README.md

Lines changed: 66 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
<h1 id='HPG9CA7YL2o'>Chembl and Open Targets in an AWS Data Lake</h1>
1+
<h1 id='HPG9CA7YL2o'>ChEMBL and Open Targets in an AWS Data Lake</h1>
22

33
Companion code for upcoming AWS blogpost on enrolling chembl and opentargets into a data lake on AWS<br/>
44

5-
<h2 id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
5+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/D5akZWKUWmfWEhA8u4loEA?a=U93UPcmkUsuoToxZr2QpWU5nosB1RwimIsIW5TtaJvEa' id='HPG9CASUn8g' alt='' width='800' height='380'></img></div><h2 id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
66

77
Your local machine needs to have the AWS CLI installed on your machine along with IAM permissions setup (through IAM role or .aws/credentials file). I like to use Cloud9 as my IDE as it comes with both of those already setup for me.<br/>
88

@@ -22,26 +22,82 @@ The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket
2222

2323
Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. <br/>
2424

25-
<div data-section-style='11' style='max-width:168%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></div>It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
25+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></div><br/>
2626

27-
<div data-section-style='11' style='max-width:155%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></div>That will open a SSM session window. Run the following command to tail the log output.<br/>
27+
It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
2828

29-
<pre id='HPG9CAuziva'>tail -f /home/ssm-user/progressLog</pre>
29+
<br/>
30+
31+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></div><br/>
32+
33+
That will open a SSM session window. Run the following command to tail the log output.<br/>
34+
35+
<pre id='HPG9CAuziva'>tail -f /home/ssm-user/progressLog</pre>
36+
37+
<br/>
3038

31-
<div data-section-style='11' class='tall' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></div><br/>
39+
<div data-section-style='11' class='tall' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></div><br/>
3240

3341
<h2 id='HPG9CAe1Pmp'>Enroll Chembl and OpenTargets into the data lake</h2>
3442

3543
Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section<br/>
3644

37-
<div data-section-style='11' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></div>Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'<br/>
45+
<br/>
46+
47+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></div><br/>
48+
49+
Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'<br/>
50+
51+
<br/>
52+
53+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></div><br/>
54+
55+
Do the same for the chemblDataLakeEnrollmentWorkflow. Wait for the workflows to finish.<br/>
56+
57+
<br/>
58+
59+
Both workflows will run in parallel, but it will take the openTargetsDataLakeEnrollmentWorkflow ~170 minutes to complete while the Chembl enrollment will finish in about 30 minutes. <br/>
60+
61+
<h2 id='HPG9CAYpovV'>Query an Conquer!</h2>
62+
63+
Go to Athena in the AWS Console.<br/>
64+
65+
<br/>
66+
67+
If you havent used Athena in your account before, you will need to define a storage location for your query results. Click on the ‘Settings’ tab in the top right and specify a bucket name where you would like Athena results stored and click save.<br/>
3868

39-
<div data-section-style='11' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></div>Do the same for the chemblDataLakeEnrollmentWorkflow<br/>
69+
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/d9imQFzWnNdhWYDAo9Bt1A?a=8Q4UOXPqvG1fk3knDX9x2wr9Jeu9g8V2tPRYsnE3Vlga' id='HPG9CASHmiz' alt='' width='800' height='429'></img></div><br/>
4070

41-
Wait for the workflows to finish.<br/>
71+
Now, click the ‘Databases’ dropdown:<br/>
4272

4373
<br/>
4474

45-
You can now query opentargets and chembl data through Athena!<br/>
75+
You will see 4 databases listed, you only want to use 2 of them:<br/>
76+
77+
<br/>
78+
79+
<u><i><b>Use:</b></i></u><br/>
80+
81+
<br/>
82+
83+
<b>chembl-25-dl </b>- This is the ‘dl’ or ‘data lake’ Chembl database. Always use tables in this database when running Chembl queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ Chembl Postgres formats into a ‘data lake’ friendly parquet format optimized for Athena. <br/>
84+
85+
<br/>
86+
87+
<b>opentargets-1911-dl </b>- This is the ‘dl’ or ‘data lake’ OpenTargets database. Always use this table when running OpenTarget queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ OpenTargets json and csv formats into a ‘data lake’ parquet format optimized for Athena. <br/>
88+
89+
<br/>
90+
91+
<u><i><b>Dont use:</b></i></u><br/>
92+
93+
<br/>
94+
95+
<b>chembl-25-src - </b>This represents the ‘src’ or ‘source’ Chembl postgres database. By design, the source database is not directly queryable from Athena, so you will not use this database. <br/>
96+
97+
<br/>
98+
99+
<b>opentargets-1911-src - </b>This is the ‘src’ or ‘source’ table. When you query this table, you are directly querying the original chembl json and csv files<b> </b>from OpenTargets. The performance may be slow as those formats are not optimized for querying with Athena. <br/>
100+
101+
<br/>
46102

47103
<br/>

0 commit comments

Comments
 (0)