You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+66-10Lines changed: 66 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
-
<h1id='HPG9CA7YL2o'>Chembl and Open Targets in an AWS Data Lake</h1>
1
+
<h1id='HPG9CA7YL2o'>ChEMBL and Open Targets in an AWS Data Lake</h1>
2
2
3
3
Companion code for upcoming AWS blogpost on enrolling chembl and opentargets into a data lake on AWS<br/>
4
4
5
-
<h2id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
5
+
<divdata-section-style='11'style=''><imgsrc='https://quip-amazon.com/blob/HPG9AAwumxR/D5akZWKUWmfWEhA8u4loEA?a=U93UPcmkUsuoToxZr2QpWU5nosB1RwimIsIW5TtaJvEa'id='HPG9CASUn8g'alt=''width='800'height='380'></img></div><h2id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
6
6
7
7
Your local machine needs to have the AWS CLI installed on your machine along with IAM permissions setup (through IAM role or .aws/credentials file). I like to use Cloud9 as my IDE as it comes with both of those already setup for me.<br/>
8
8
@@ -22,26 +22,82 @@ The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket
22
22
23
23
Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. <br/>
24
24
25
-
<divdata-section-style='11'style='max-width:168%'><imgsrc='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca'id='HPG9CA9WNsB'alt=''width='1276'height='612'></img></div>It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
<divdata-section-style='11'style='max-width:155%'><imgsrc='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a'id='HPG9CADqhgF'alt=''width='1242'height='666'></img></div>That will open a SSM session window. Run the following command to tail the log output.<br/>
27
+
It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
<h2id='HPG9CAe1Pmp'>Enroll Chembl and OpenTargets into the data lake</h2>
34
42
35
43
Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section<br/>
36
44
37
-
<divdata-section-style='11'style='max-width:147%'><imgsrc='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a'id='HPG9CADnepH'alt=''width='1177'height='631'></img></div>Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'<br/>
Do the same for the chemblDataLakeEnrollmentWorkflow. Wait for the workflows to finish.<br/>
56
+
57
+
<br/>
58
+
59
+
Both workflows will run in parallel, but it will take the openTargetsDataLakeEnrollmentWorkflow ~170 minutes to complete while the Chembl enrollment will finish in about 30 minutes. <br/>
60
+
61
+
<h2id='HPG9CAYpovV'>Query an Conquer!</h2>
62
+
63
+
Go to Athena in the AWS Console.<br/>
64
+
65
+
<br/>
66
+
67
+
If you havent used Athena in your account before, you will need to define a storage location for your query results. Click on the ‘Settings’ tab in the top right and specify a bucket name where you would like Athena results stored and click save.<br/>
38
68
39
-
<divdata-section-style='11'style='max-width:147%'><imgsrc='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a'id='HPG9CAgkuAH'alt=''width='1177'height='631'></img></div>Do the same for the chemblDataLakeEnrollmentWorkflow<br/>
You can now query opentargets and chembl data through Athena!<br/>
75
+
You will see 4 databases listed, you only want to use 2 of them:<br/>
76
+
77
+
<br/>
78
+
79
+
<u><i><b>Use:</b></i></u><br/>
80
+
81
+
<br/>
82
+
83
+
<b>chembl-25-dl </b>- This is the ‘dl’ or ‘data lake’ Chembl database. Always use tables in this database when running Chembl queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ Chembl Postgres formats into a ‘data lake’ friendly parquet format optimized for Athena. <br/>
84
+
85
+
<br/>
86
+
87
+
<b>opentargets-1911-dl </b>- This is the ‘dl’ or ‘data lake’ OpenTargets database. Always use this table when running OpenTarget queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ OpenTargets json and csv formats into a ‘data lake’ parquet format optimized for Athena. <br/>
88
+
89
+
<br/>
90
+
91
+
<u><i><b>Dont use:</b></i></u><br/>
92
+
93
+
<br/>
94
+
95
+
<b>chembl-25-src - </b>This represents the ‘src’ or ‘source’ Chembl postgres database. By design, the source database is not directly queryable from Athena, so you will not use this database. <br/>
96
+
97
+
<br/>
98
+
99
+
<b>opentargets-1911-src - </b>This is the ‘src’ or ‘source’ table. When you query this table, you are directly querying the original chembl json and csv files<b> </b>from OpenTargets. The performance may be slow as those formats are not optimized for querying with Athena. <br/>
0 commit comments