Merge branch 'master' of github.com:paulu-aws/chembl-opentargets-data-lake-example

EC2 Default User · EC2 Default User · commit f8c3462e62e3 · 2020-03-09T21:47:25.000Z
diff --git a/README.md b/README.md
@@ -1,49 +1,103 @@
-<h1 id='HPG9CA7YL2o'>Chembl and Open Targets in an AWS Data Lake</h1>
+<h1 id='HPG9CA7YL2o'>ChEMBL and Open Targets in an AWS Data Lake</h1>
 
 Companion code for upcoming AWS blogpost on enrolling chembl and opentargets into a data lake on AWS<br/>
 
+<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/D5akZWKUWmfWEhA8u4loEA?a=U93UPcmkUsuoToxZr2QpWU5nosB1RwimIsIW5TtaJvEa' id='HPG9CASUn8g' alt='' width='800' height='380'></img></div><h2 id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
+
+Your local machine needs to have the AWS CLI installed on your machine along with IAM permissions setup (through IAM role or .aws/credentials file). I like to use Cloud9 as my IDE as it comes with both of those already setup for me.<br/>
+
+<br/>
+
+Run the following commands<br/>
+
+<pre id='HPG9CAKfUT3'>git clone https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git<br>cd chembl-opentargets-data-lake-example<br>./InstallCdkDependencies.sh<br>./DeployChemblOpenTargetsEnv.sh</pre>
+
+Wait for Chembl and OpenTargets to be ‘staged’ into the baseline stack.<br/>
+
+<br/>
+
+The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket (for OpenTargets) and an RDS Postgres instance (for Chembl). It also spins up a little helper EC2 instance that stages those assets in their ‘raw’ form after downloading them from<a href="http://OpenTargets.org"> OpenTargets.org</a> and EMBL-EBI.<br/>
+
+<br/>
+
+Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. <br/>
+
+<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></div><br/>
+
+It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
+
+<br/>
+
+<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></div><br/>
+
+That will open a SSM session window. Run the following command to tail the log output.<br/>
+
+<pre id='HPG9CAuziva'>tail -f /home/ssm-user/progressLog</pre>
+
+<br/>
+
+<div data-section-style='11' class='tall' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></div><br/>
+
+<h2 id='HPG9CAe1Pmp'>Enroll Chembl and OpenTargets into the data lake</h2>
+
+Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section<br/>
+
 <br/>
 
-To install this in your own AWS account:<br/>
+<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></div><br/>
+
+Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'<br/>
 
 <br/>
 
-<div style="" data-section-style='6' class=""><ul id='HPG9CAdvuXu'><li id='HPG9CAsuX2r' class='' value='1'>Clone this repo
+<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></div><br/>
 
-<br/></li></ul></div><pre id='HPG9CAKfUT3'>git clone <a href="https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git">https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git</a></pre>
+Do the same for the chemblDataLakeEnrollmentWorkflow. Wait for the workflows to finish.<br/>
 
-<div style="" data-section-style='6' class="list-numbering-continue"><ul id='HPG9CAPCYqe'><li id='HPG9CASYAsv' class='' value='1'>Install the CDK dependencies
+<br/>
 
-<br/></li></ul></div><pre id='HPG9CAwKcSI'>./InstallCdkDependencies.sh</pre>
+Both workflows will run in parallel, but it will take the openTargetsDataLakeEnrollmentWorkflow ~170 minutes to complete while the Chembl enrollment will finish in about 30 minutes. <br/>
 
-<div style="" data-section-style='6' class="list-numbering-continue"><ul id='HPG9CAHILzl'><li id='HPG9CAK6MvP' class='' value='1'>Deploy the CDK Stacks
+<h2 id='HPG9CAYpovV'>Query an Conquer!</h2>
 
-<br/></li></ul></div><pre id='HPG9CAeahes'>./DeployChemblOpenTargetsEnv.sh</pre>
+Go to Athena in the AWS Console.<br/>
 
-<div class="list-numbering-restart-at" data-section-style='6' style="--indent0: 4"><ul id='HPG9CAS1C4M'><li id='HPG9CA0az5T' class='parent' value='1'>Wait for Chembl and OpenTargets to be ‘staged’ into the baseline stack.
+<br/>
 
-<br/></li><ul><li id='HPG9CAimAzR' class=''>The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket (for OpenTargets) and an RDS Postgres instance (for Chembl). It also spins up a little helper EC2 instance that stages those assets in their ‘raw’ form<a href="http://OpenTargets.org"> OpenTargets.org</a> and EMBL-EBI into your account.
+If you havent used Athena in your account before, you will need to define a storage location for your query results. Click on the ‘Settings’ tab in the top right and specify a bucket name where you would like Athena results stored and click save.<br/>
 
-<br/></li><li id='HPG9CA9PXT3' class=''>Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. 
+<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/d9imQFzWnNdhWYDAo9Bt1A?a=8Q4UOXPqvG1fk3knDX9x2wr9Jeu9g8V2tPRYsnE3Vlga' id='HPG9CASHmiz' alt='' width='800' height='429'></img></div><br/>
 
-<br/></li><li id='HPG9CA9WNsB' class=''><span data-section-style='11' style='max-width:168%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></span></li><li id='HPG9CAQFK7w' class=''>It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.
+Now, click the ‘Databases’ dropdown:<br/>
 
-<br/></li><li id='HPG9CADqhgF' class=''><span data-section-style='11' style='max-width:155%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></span></li><li id='HPG9CAByHqU' class=''>That will open a SSM session window run the following command
+<br/>
 
-<br/></li></ul></ul></div><pre id='HPG9CAuziva'> <code>tail -f progressLog</code></pre>
+You will see 4 databases listed, you only want to use 2 of them:<br/>
 
-<div style="" data-section-style='6' class=""><ul id='HPG9CAaYWQI'><li id='HPG9CAgo8Yy' class='' value='1'><span data-section-style='11' class='tall' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></span></li></ul></div><br/>
+<br/>
 
-<div style="" data-section-style='6' class="list-numbering-continue"><ul id='HPG9CA8jP0B'><li id='HPG9CA6hIcf' class='' value='1'>Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section
+<u><i><b>Use:</b></i></u><br/>
 
-<br/></li><li id='HPG9CADnepH' class=''><span data-section-style='11' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></span></li><li id='HPG9CApeYdR' class=''>Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'
+<br/>
 
-<br/></li><li id='HPG9CAgkuAH' class=''><span data-section-style='11' style='max-width:147%'><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></span></li><li id='HPG9CA1tRSd' class=''>Do the same for the chemblDataLakeEnrollmentWorkflow
+<b>chembl-25-dl </b>- This is the ‘dl’ or ‘data lake’ Chembl database. Always use tables in this database when running Chembl queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ Chembl Postgres formats into a ‘data lake’ friendly parquet format optimized for Athena. <br/>
 
-<br/></li><li id='HPG9CA4pCXY' class=''>Wait for the workflows to finish.
+<br/>
 
-<br/></li></ul></div><br/>
+<b>opentargets-1911-dl </b>- This is the ‘dl’ or ‘data lake’ OpenTargets database. Always use this table when running OpenTarget queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ OpenTargets json and csv formats into a ‘data lake’ parquet format optimized for Athena. <br/>
 
-You can now query opentargets and chembl data through Athena!<br/>
+<br/>
+
+<u><i><b>Dont use:</b></i></u><br/>
+
+<br/>
+
+<b>chembl-25-src - </b>This represents the ‘src’ or ‘source’ Chembl postgres database. By design, the source database is not directly queryable from Athena, so you will not use this database. <br/>
+
+<br/>
+
+<b>opentargets-1911-src - </b>This is the ‘src’ or ‘source’ table. When you query this table, you are directly querying the original chembl json and csv files<b> </b>from OpenTargets. The performance may be slow as those formats are not optimized for querying with Athena. <br/>
+
+<br/>
 
 <br/>