Skip to content

Commit a4d89ea

Browse files
rootroot
authored andcommitted
Merging initial aws-samples auto created repo with source.
2 parents 9453875 + 25b8fb5 commit a4d89ea

File tree

4 files changed

+179
-52
lines changed

4 files changed

+179
-52
lines changed

CODE_OF_CONDUCT.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
## Code of Conduct
2+
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3+
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4+
[email protected] with any additional questions or comments.

CONTRIBUTING.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Contributing Guidelines
2+
3+
Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4+
documentation, we greatly value feedback and contributions from our community.
5+
6+
Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7+
information to effectively respond to your bug report or contribution.
8+
9+
10+
## Reporting Bugs/Feature Requests
11+
12+
We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13+
14+
When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15+
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16+
17+
* A reproducible test case or series of steps
18+
* The version of our code being used
19+
* Any modifications you've made relevant to the bug
20+
* Anything unusual about your environment or deployment
21+
22+
23+
## Contributing via Pull Requests
24+
Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25+
26+
1. You are working against the latest source on the *master* branch.
27+
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28+
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29+
30+
To send us a pull request, please:
31+
32+
1. Fork the repository.
33+
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34+
3. Ensure local tests pass.
35+
4. Commit to your fork using clear commit messages.
36+
5. Send us a pull request, answering any default questions in the pull request interface.
37+
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38+
39+
GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40+
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41+
42+
43+
## Finding contributions to work on
44+
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45+
46+
47+
## Code of Conduct
48+
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49+
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50+
[email protected] with any additional questions or comments.
51+
52+
53+
## Security issue notifications
54+
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55+
56+
57+
## Licensing
58+
59+
See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60+
61+
We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.

LICENSE

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of
4+
this software and associated documentation files (the "Software"), to deal in
5+
the Software without restriction, including without limitation the rights to
6+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7+
the Software, and to permit persons to whom the Software is furnished to do so.
8+
9+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15+

README.md

Lines changed: 99 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,103 +1,150 @@
1-
<h1 id='HPG9CA7YL2o'>Data Lake as Code; Featuring ChEMBL and Open Targets</h1>
1+
# Data Lake as Code; Featuring ChEMBL and Open Targets
22

3-
Companion code for upcoming AWS blogpost on enrolling chembl and opentargets into a data lake on AWS<br/>
3+
Companion code for upcoming AWS blogpost on enrolling chembl and opentargets into a data lake on AWS
44

5-
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/D5akZWKUWmfWEhA8u4loEA?a=U93UPcmkUsuoToxZr2QpWU5nosB1RwimIsIW5TtaJvEa' id='HPG9CASUn8g' alt='' width='800' height='380'></img></div><h2 id='HPG9CAlPR6i'>To install this in your own AWS account:</h2>
5+
![](https://quip-amazon.com/blob/HPG9AAwumxR/D5akZWKUWmfWEhA8u4loEA?a=U93UPcmkUsuoToxZr2QpWU5nosB1RwimIsIW5TtaJvEa)
66

7-
Your local machine needs to have the AWS CLI installed on your machine along with IAM permissions setup (through IAM role or .aws/credentials file). I like to use Cloud9 as my IDE as it comes with both of those already setup for me.<br/>
7+
## To install this in your own AWS account:
88

9-
<br/>
9+
Your local machine needs to have the AWS CLI installed on your machine along with IAM permissions setup (through IAM role or .aws/credentials file). I like to use Cloud9 as my IDE as it comes with both of those already setup for me.
1010

11-
Run the following commands<br/>
11+
Run the following commands
1212

13-
<pre id='HPG9CAKfUT3'>git clone https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git<br>cd chembl-opentargets-data-lake-example<br>./InstallCdkDependencies.sh<br>./DeployChemblOpenTargetsEnv.sh</pre>
13+
```shell
14+
git clone https://github.com/paulu-aws/chembl-opentargets-data-lake-example.git
15+
cd chembl-opentargets-data-lake-example
16+
./InstallCdkDependencies.sh
17+
./DeployChemblOpenTargetsEnv.sh
18+
```
1419

15-
Wait for Chembl and OpenTargets to be ‘staged’ into the baseline stack.<br/>
20+
Wait for Chembl and OpenTargets to be ‘staged’ into the baseline stack.
1621

17-
<br/>
22+
The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket (for OpenTargets) and an RDS Postgres instance (for ChEMBL). It also spins up a little helper EC2 instance that stages those assets in their ‘raw’ form after downloading them from [OpenTargets.org](http://OpenTargets.org) and EMBL-EBI.
1823

19-
The ‘baseline stack’ in the CDK application spins up a VPC with an S3 bucket (for OpenTargets) and an RDS Postgres instance (for Chembl). It also spins up a little helper EC2 instance that stages those assets in their ‘raw’ form after downloading them from<a href="http://OpenTargets.org"> OpenTargets.org</a> and EMBL-EBI.<br/>
24+
Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents.
2025

21-
<br/>
26+
![](https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca)
2227

23-
Go to Systems Manager in the AWS console, and then the ‘Run Command’ section. You will see the currently running command documents. <br/>
28+
It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.
2429

25-
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/x4lfduQeC3Ww-DyK8loIAg?a=6aMBuWAgnWaZ5pQaJndaM06ob734VpmiCI5xfguyPaca' id='HPG9CA9WNsB' alt='' width='1276' height='612'></img></div><br/>
30+
![](https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a)
2631

27-
It takes about an hour for Chembl to build. If you get impatient and want to see the progress in real time, go to ‘Session Manager’ in the Systems Manager console, click the ‘Start session’ button, choose the ‘ChembDbImportInstance’ radio button, and click the ‘Start Session’ button.<br/>
32+
That will open a SSM session window. Run the following command to tail the log output.
2833

29-
<br/>
34+
```tail -f /home/ssm-user/progressLog```
3035

31-
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/Fj7sA3VuIuvdPOHl017Xcg?a=EYFlHaKY8weEGFezDR4ld3sEhBMWl88afFdDjJQ15H8a' id='HPG9CADqhgF' alt='' width='1242' height='666'></img></div><br/>
36+
![](https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa)
3237

33-
That will open a SSM session window. Run the following command to tail the log output.<br/>
38+
## Enroll Chembl and OpenTargets into the data lake
3439

35-
<pre id='HPG9CAuziva'>tail -f /home/ssm-user/progressLog</pre>
40+
Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section
3641

37-
<br/>
42+
![](https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a)
3843

39-
<div data-section-style='11' class='tall' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/rMcRhjzUcIGQVYeBFxup4Q?a=2NRscRrktD9kLK7rDqqD9bO3aXtTYttCeaEWLwDXVgIa' id='HPG9CAgo8Yy' alt='' width='1115' height='1030'></img></div><br/>
44+
Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'
4045

41-
<h2 id='HPG9CAe1Pmp'>Enroll Chembl and OpenTargets into the data lake</h2>
46+
![](https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a)
4247

43-
Once the database has finished importing, go to Glue in the AWS console, and then the “Workflows” section<br/>
48+
Do the same for the chemblDataLakeEnrollmentWorkflow. Wait for the workflows to finish.
4449

45-
<br/>
50+
Both workflows will run in parallel, but it will take the openTargetsDataLakeEnrollmentWorkflow ~170 minutes to complete while the ChEMBL enrollment will finish in about 30 minutes.
4651

47-
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/K0liqaLzOGNHdODU_fN_MA?a=GQQahtSxVQNvaU6AkEjATwCE0WJglr630LH3bZcngB0a' id='HPG9CADnepH' alt='' width='1177' height='631'></img></div><br/>
52+
## Query an Conquer!
4853

49-
Select the openTargetsDataLakeEnrollment workflow, and click ‘Actions’, then 'Run'<br/>
54+
Go to Athena in the AWS Console.
5055

51-
<br/>
56+
If you haven't used Athena in your account before, you will need to define a storage location for your query results. Click on the ‘Settings’ tab in the top right and specify a bucket name where you would like Athena results stored and click save.
5257

53-
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/UV0-ZlwmK_KF9L9MfaUgfA?a=97k7vof4qlurzy3zSsmPVhomgCpRUJfREq8UCNZSzt4a' id='HPG9CAgkuAH' alt='' width='1177' height='631'></img></div><br/>
58+
![](https://quip-amazon.com/blob/HPG9AAwumxR/d9imQFzWnNdhWYDAo9Bt1A?a=8Q4UOXPqvG1fk3knDX9x2wr9Jeu9g8V2tPRYsnE3Vlga)
5459

55-
Do the same for the chemblDataLakeEnrollmentWorkflow. Wait for the workflows to finish.<br/>
60+
Now, click the ‘Databases’ dropdown:
5661

57-
<br/>
62+
You will see 4 databases listed, you only want to use 2 of them:
5863

59-
Both workflows will run in parallel, but it will take the openTargetsDataLakeEnrollmentWorkflow ~170 minutes to complete while the Chembl enrollment will finish in about 30 minutes. <br/>
64+
_**Use:**_
6065

61-
<h2 id='HPG9CAYpovV'>Query an Conquer!</h2>
66+
**chembl-25-dl**- This is the ‘dl’ or ‘data lake’ Chembl database. Always use tables in this database when running Chembl queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ Chembl Postgres formats into a ‘data lake’ friendly parquet format optimized for Athena.
6267

63-
Go to Athena in the AWS Console.<br/>
68+
**opentargets-1911-dl**- This is the ‘dl’ or ‘data lake’ OpenTargets database. Always use this table when running OpenTarget queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ OpenTargets json and csv formats into a ‘data lake’ parquet format optimized for Athena.
6469

65-
<br/>
70+
_**Dont use:**_
6671

67-
If you havent used Athena in your account before, you will need to define a storage location for your query results. Click on the ‘Settings’ tab in the top right and specify a bucket name where you would like Athena results stored and click save.<br/>
72+
**chembl-25-src** - **This represents the ‘src’ or ‘source’ Chembl postgres database. By design, the source database is not directly queryable from Athena, so you will not use this database.
6873

69-
<div data-section-style='11' style=''><img src='https://quip-amazon.com/blob/HPG9AAwumxR/d9imQFzWnNdhWYDAo9Bt1A?a=8Q4UOXPqvG1fk3knDX9x2wr9Jeu9g8V2tPRYsnE3Vlga' id='HPG9CASHmiz' alt='' width='800' height='429'></img></div><br/>
74+
**opentargets-1911-src** - This is the ‘src’ or ‘source’ table. When you query this table, you are directly querying the original chembl json and csv filesfrom OpenTargets. The performance may be slow as those formats are not optimized for querying with Athena.
7075

71-
Now, click the ‘Databases’ dropdown:<br/>
76+
77+
## Permissions & Lake Formation
7278

73-
<br/>
79+
There are [two methods of security](https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-overview.html) you can apply to your data lake. The default account configuration, which is likely what you are using at the moment, is essentially “open” Lake Formation permissions and “fine-grained” IAM polices. The DataSetStack construct implements a number of CDK-style grant*() methonds. The grantIamRead() method of the code grants a “fine-grained” IAM policy that gives users read access to just the tables in the data set you preform the grant on.
7480

75-
You will see 4 databases listed, you only want to use 2 of them:<br/>
7681

77-
<br/>
7882

79-
<u><i><b>Use:</b></i></u><br/>
83+
For example, in the bin/aws.ts file you can see an example of granting that “fine-grained” IAM read permission. Pretty easy! Here we are passing the role from the notebook, but you can import an existing IAM user, role, or group using the CDK.
84+
```typescript
85+
chemblStack.grantIamRead(analyticsStack.NotebookRole);
86+
openTargetsStack.grantIamRead(analyticsStack.NotebookRole);
87+
```
88+
The other method of security gives you more control. Specifically, the ability to control permissions at the database, table, and column level. This requires “fine-grained” Lake Formation permissions and “coarse” IAM permissions. The `grantDatabasePermissions()`, `grantTablePermissions()`, and `grantTableWithColumnPermissions()` setup both the fine-grained LakeFormation and coarse IAM permissions for you.
8089

81-
<br/>
90+
8291

83-
<b>chembl-25-dl </b>- This is the ‘dl’ or ‘data lake’ Chembl database. Always use tables in this database when running Chembl queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ Chembl Postgres formats into a ‘data lake’ friendly parquet format optimized for Athena. <br/>
92+
Again, another example in the `bin/aws.ts` file:
8493

85-
<br/>
94+
```typescript
95+
const exampleUser = iam.User.fromUserName(coreDataLake, 'exampleGrantee', 'paulUnderwood' );
8696

87-
<b>opentargets-1911-dl </b>- This is the ‘dl’ or ‘data lake’ OpenTargets database. Always use this table when running OpenTarget queries. Part of the chemblDataLakeEnrollment workflow converts the ‘source’ OpenTargets json and csv formats into a ‘data lake’ parquet format optimized for Athena. <br/>
97+
var exampleTableWithColumnsGrant: DataLakeEnrollment.TableWithColumnPermissionGrant = {
98+
table: "chembl_25_public_compound_structures",
99+
// Note that we are NOT including 'canonical_smiles'. That effectivley prevents this user from querying that column.
100+
columns: ['molregno', 'molfile', 'standard_inchi', 'standard_inchi_key'],
101+
DatabasePermissions: [],
102+
GrantableDatabasePermissions: [],
103+
TableColumnPermissions: [DataLakeEnrollment.TablePermission.Select],
104+
GrantableTableColumnPermissions: []
105+
};
88106

89-
<br/>
107+
chemblStack.grantTableWithColumnPermissions(exampleUser, exampleTableWithColumnsGrant);
108+
````
109+
90110

91-
<u><i><b>Dont use:</b></i></u><br/>
111+
The `GrantableDatabasePermissions`, `GrantableTableColumnPermissions`, and `GrantableTableColumnPermissions` give the supplied IAM principal permissions to grant permissions others. If you have a data-set steward, or someone who should have the authority to grant permissions to others, you cant "grant the permission to grant" using those properties.
92112

93-
<br/>
113+
94114

95-
<b>chembl-25-src - </b>This represents the ‘src’ or ‘source’ Chembl postgres database. By design, the source database is not directly queryable from Athena, so you will not use this database. <br/>
115+
To illustrate the the relationship between the fine-grained and coarse permissions, think of it as two doors. An IAM principal needs to have permission to walk through both doors to query the data lake. The DataLakeEnrollment construct handles granting both the fine and coarse permissions for you.
96116

97-
<br/>
117+
![image.png](https://api.quip-amazon.com/2/blob/HPG9AAwumxR/ACYxNvcfFhaRL15neEGWHA)
98118

99-
<b>opentargets-1911-src - </b>This is the ‘src’ or ‘source’ table. When you query this table, you are directly querying the original chembl json and csv files<b> </b>from OpenTargets. The performance may be slow as those formats are not optimized for querying with Athena. <br/>
119+
100120

101-
<br/>
121+
If you decide that you want the additional flexibility of Lake Formation permissions, you need to perform two manual actions before Lake Formation permissions will begin protecting your resources. Until you perform these two steps, you are only protecting your resources with the coarse IAM permission and the Lake Formation permissions wont apply.
102122

103-
<br/>
123+
124+
125+
1) Change the default permissions for newly created databases and tables
126+
127+
128+
129+
Visit the Lake Formation service page in the AWS console, and go to theSettingssection on the left.
130+
131+
132+
You need to **UNCHECK** the two boxes and hitSave
133+
134+
![image.png](https://api.quip-amazon.com/2/blob/HPG9AAwumxR/luIf4C1WcTNeDeixOEbqsg)
135+
136+
2) You need to revoke all of the Lake Formation permissions that have been granted to `IAM_ALLOWED_PRINCIPALS`. If you have used Glue in the past or the ChEMBL or OpenTarget workflows have already completed you can see a bunch of them in theData Permissionssection in the Lake Formation console. By unchecking the boxes before, we are now stopping the default behavior where Lake Formation adds a `IAM_ALLOWED_PRINCIPALS` grant to any Glue Tables/Resources created.
137+
138+
139+
140+
Now that we have stopped that default-add `IAM_ALLOWED_PRINCIPALS` behavior, we need to back out any existing grants to `IAM_ALLOWED_PRINCIPALS`. As long as they remain, any IAM principal with coarse IAM permissions to the resource will still be able to query columns or tables they shouldn't have access to.
141+
142+
143+
144+
The `local.datalake.RemoveIamAllowedPrincipals.py` python script will save you the effort of manually revoking those permissions from IAM_ALLOWED_PRINCIPALS. Running the following command will issue the revokes for all IAM_ALLOWED_PRINCIPALS granted permissions.
145+
146+
```
147+
python ./script/local.datalake.RemoveIamAllowedPrincipals.py
148+
```
149+
150+
DONT RUN THIS COMMAND IF YOU HAVE PEOPLE ALREADY RELYING ON THE AWS GLUE CATALOG (via Athena for example). This will effectively remove their access until you grant them user/role/group specific Lake Formation permissions.

0 commit comments

Comments
 (0)