AWS Glue, Database/Table, Location and Policies #1446

mithmatt · 2021-11-28T21:37:42Z

mithmatt
Nov 28, 2021

Flow

Customer stores data in S3 in the path s3://openmetadata-test/customer_info/2019/output.csv where openmetadata-test is the bucket name, customer_info is the prefix, and 2019 is supposed to be the partition.

aws s3 ls s3://openmetadata-test/customer_info/2019/
2021-11-28 11:04:54     464025 output.csv

Customer creates a db called opemetadata-test on AWS Glue UI and sets up a crawler to crawl the specified path s3://openmetadata-test/customer_info/

Once the crawler runs it creates a partition 2019 (as indicated in step 1) and sets up the table, with the columns. Customer edits/defines the schema to be accurate (update column names if not picked up correctly).
Customer goes to openmetadata-test s3 bucket to configure lifecycle rules on the customer_info prefix

Strategy

Collect database/table/pipeline(crawler) information as we have today in glue.py
Update 1 above to ingest location information from here

For the above table, the location is "s3://openmetadata-test/customer_info/" which is of location type "Prefix"
(not Bucket)
Scan through S3 buckets to gather lifecycle rules that match the location "s3://openmetadata-test/customer_info/", and ingest the policy to OpenMetadata

@harshach @amiorin Here's what I have in mind so far. I'm mostly focussing on step 3 of the strategy.
Looking for feedback and suggestions.

PS: The above example is limited to using a bucket per database. I've seen an example which have a handful of buckets and each bucket has several databases and tables, which again are namespaced by prefix. s3://bucket/database/table/partition/data.csv

amiorin · 2021-11-28T23:21:46Z

amiorin
Nov 28, 2021

My idea is a little different but I'm also solving a different problem:

In s3_service.s3://bucket_platformTeam/prefix_departmentDirector/database_team/table_user there is a cascade of ownership starting with a user who can own a table, a team who can own a database, a director of a department who can own a prefix, and a data platform team who can own a bucket.
The service creating external tables will check all the locations using this SQL query that it is different from the normal pagination. If they exist, it will check the owner and use this authorization algo. That's the reason why all locations are always including the full path including the protocol and the bucket.

given these locations
s3_service.s3://a
s3_service.s3://b
s3_service.s3://b/a
s3_service.s3://b/b
s3_service.s3://b/b/a
s3_service.s3://b/b/b
s3_service.s3://b/b/b/a
s3_service.s3://b/b/b/b

f(s3_service.s3://b/b/b/b) will produce

s3_service.s3://b
s3_service.s3://b/b
s3_service.s3://b/b/b
s3_service.s3://b/b/b/b

I've made a test for it.

1 reply

mithmatt Nov 29, 2021
Author

Thanks for pointing me to this part of the codebase @amiorin

I understand your use case better now.

I opened #1450 to follow the naming convention you are referring to

harshach · 2021-11-29T01:14:20Z

harshach
Nov 29, 2021
Maintainer

@amiorin Lets not look at the creation process right now. We should use our Authorizer interface on who can create/read/update/delete not have a per entity logic.
I am more interested in how we are going to associate the Table to a Location

We have redshift source that is fetching external tables schema and part of it contains the location. We can add location creation step to sql_source such that Table is created and a location entity can be created
Glue gives this information as well where we can get the S3 location related to a table

If a user has both Redshift and Glue sources connected and want to extract the information we need to make sure it's not duplicated.

1 reply

mithmatt Nov 29, 2021
Author

Location is defined an entity reference from Table.
If we get the name consistent across both redshift and glue ingestors, we should be able to avoid the duplication.

A possible place where things could go wrong:
storage_service name is configured with different name across the two ingestors. (I ran into this with glue + s3 during local prototyping)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AWS Glue, Database/Table, Location and Policies #1446

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

AWS Glue, Database/Table, Location and Policies #1446

Uh oh!

Uh oh!

mithmatt Nov 28, 2021

Replies: 2 comments · 2 replies

Uh oh!

amiorin Nov 28, 2021

Uh oh!

mithmatt Nov 29, 2021 Author

Uh oh!

harshach Nov 29, 2021 Maintainer

Uh oh!

Uh oh!

mithmatt Nov 29, 2021 Author

mithmatt
Nov 28, 2021

Replies: 2 comments 2 replies

amiorin
Nov 28, 2021

mithmatt Nov 29, 2021
Author

harshach
Nov 29, 2021
Maintainer

mithmatt Nov 29, 2021
Author