Skip to content

Commit d0cbd9f

Browse files
authored
Feature: Glue Governed Tables (#560) (#571)
* Feature: Write to Glue Governed Tables (#560) * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Initial Commit * Minor - Refactoring Work Units Logic * Major - Checkpoint w/ functional read code/example * Minor - Removing unnecessary ensure_session * Minor - Adding changes from comments and review * Minor - Adding Abort, Begin, Commit and Extend transactions * Minor - Adding missing functions * Minor - Adding missing @Property * Minor - Disable too many public methods * Minor - Checkpoint * Major - Governed tables write operations tested * Minor - Adding validate flow on branches * Minor - reducing static checks * Minor - Adding to_csv code * Minor - Disabling too-many-branches * Major - Ready for release * Minor - Proofreading * Minor - Removing needless use_threads argument * Minor - Removing the need to specify table_type when table is already created * Minor - Fixing _catalog_id call * Minor - Clarifying SQL filter operation * Minor - Removing type ignore * Minor - Reducing scope gitworkflow * Minor - Fixing _sanitize_name * Minor - Adding map_types flag * Minor - Aligning optional path argument with main branch * Minor tests adjustments * Minor - Removing Chunked parameter * Fixing issues from diverged branch * Major - M1 Launch (Stable) * Mmproving read by concatenating zero copies of arrow ttables * [skip ci] - Minor - Killing thread * [skip ci] - Minor - Passing client instead of session * Major - Adding Metadata Transaction API changes * Minor - Adding query as of time to test * [skip ci] - syncing with main branch * Merge main and adapt tests to API changes from Erie team * Merge main * Merging main - move to poetry * Merge main and resolve conflicts * Minor - Sync with main * Merging with main * Lint * Fixing get_table_obj retries * Green tests * Minor - Fixing automated merge * LakeFormation test infra * Commit protocol change - Erie * [skip ci] - Minor - Fixing catalog unit test * [skip ci] - Minor - Adding transaction_id to does_table_exist * Minor - Missing projection_storage_location_template * Upgrading botocore * xfail moto * Adding s3fs to tox * LF concurrent modification exception * catalog.py test * lint
1 parent cb8c69f commit d0cbd9f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+3155
-1414
lines changed

.github/workflows/minimal-tests.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ on:
44
push:
55
branches:
66
- main
7-
- main-governed-tables
87
pull_request:
98
branches:
109
- main

.github/workflows/static-checking.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ on:
44
push:
55
branches:
66
- main
7-
- main-governed-tables
87
pull_request:
98
branches:
109
- main

CONTRIBUTING.md

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -153,9 +153,9 @@ or
153153

154154
``cd scripts``
155155

156-
* Deploy the Cloudformation template `base.yaml`
156+
* Deploy the `base` CDK stack
157157

158-
``./deploy-base.sh``
158+
``./deploy-stack.sh base``
159159

160160
* Return to the project root directory
161161

@@ -175,7 +175,7 @@ or
175175

176176
* [OPTIONAL] To remove the base test environment cloud formation stack post testing:
177177

178-
``./test_infra/scripts/delete-base.sh``
178+
``./test_infra/scripts/delete-stack.sh base``
179179

180180
### Full test environment
181181

@@ -210,14 +210,18 @@ or
210210

211211
``cd scripts``
212212

213-
* Deploy the Cloudformation templates `base.yaml` and `databases.yaml`. This step could take about 15 minutes to deploy.
213+
* Deploy the `base` and `databases` CDK stacks. This step could take about 15 minutes to deploy.
214214

215-
``./deploy-base.sh``
216-
``./deploy-databases.sh``
215+
``./deploy-stack.sh base``
216+
``./deploy-stack.sh databases``
217217

218-
* [OPTIONAL] Deploy the Cloudformation template `opensearch.yaml` (if you need to test Amazon OpenSearch Service). This step could take about 15 minutes to deploy.
218+
* [OPTIONAL] Deploy the `lakeformation` CDK stack (if you need to test against the AWS Lake Formation Service). You must ensure Lake Formation is enabled in the account.
219219

220-
``./deploy-opensearch.sh``
220+
``./deploy-stack.sh lakeformation``
221+
222+
* [OPTIONAL] Deploy the `opensearch` CDK stack (if you need to test against the Amazon OpenSearch Service). This step could take about 15 minutes to deploy.
223+
224+
``./deploy-stack.sh opensearch``
221225

222226
* Go to the `EC2 -> SecurityGroups` console, open the `aws-data-wrangler-*` security group and configure to accept your IP from any TCP port.
223227
- Alternatively run:
@@ -254,9 +258,9 @@ or
254258

255259
* [OPTIONAL] To remove the base test environment cloud formation stack post testing:
256260

257-
``./test_infra/scripts/delete-base.sh``
261+
``./test_infra/scripts/delete-stack.sh base``
258262

259-
``./test_infra/scripts/delete-databases.sh``
263+
``./test_infra/scripts/delete-stack.sh databases``
260264

261265
## Recommended Visual Studio Code Recommended setting
262266

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,7 @@ FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
139139
- [029 - S3 Select](https://github.com/awslabs/aws-data-wrangler/blob/main/tutorials/029%20-%20S3%20Select.ipynb)
140140
- [030 - Data Api](https://github.com/awslabs/aws-data-wrangler/blob/main/tutorials/030%20-%20Data%20Api.ipynb)
141141
- [031 - OpenSearch](https://github.com/awslabs/aws-data-wrangler/blob/main/tutorials/031%20-%20OpenSearch.ipynb)
142+
- [032 - Lake Formation Governed Tables](https://github.com/awslabs/aws-data-wrangler/blob/main/tutorials/032%20-%Lake%20Formation%20Governed%20Tables.ipynb)
142143
- [**API Reference**](https://aws-data-wrangler.readthedocs.io/en/2.12.1/api.html)
143144
- [Amazon S3](https://aws-data-wrangler.readthedocs.io/en/2.12.1/api.html#amazon-s3)
144145
- [AWS Glue Catalog](https://aws-data-wrangler.readthedocs.io/en/2.12.1/api.html#aws-glue-catalog)

awswrangler/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
dynamodb,
1717
emr,
1818
exceptions,
19+
lakeformation,
1920
mysql,
2021
opensearch,
2122
postgresql,
@@ -44,6 +45,7 @@
4445
"s3",
4546
"sts",
4647
"redshift",
48+
"lakeformation",
4749
"mysql",
4850
"postgresql",
4951
"secretsmanager",

awswrangler/_config.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,15 @@ class _ConfigArg(NamedTuple):
4343
"redshift_endpoint_url": _ConfigArg(dtype=str, nullable=True, enforced=True),
4444
"kms_endpoint_url": _ConfigArg(dtype=str, nullable=True, enforced=True),
4545
"emr_endpoint_url": _ConfigArg(dtype=str, nullable=True, enforced=True),
46+
"lakeformation_endpoint_url": _ConfigArg(dtype=str, nullable=True, enforced=True),
4647
"dynamodb_endpoint_url": _ConfigArg(dtype=str, nullable=True, enforced=True),
4748
"secretsmanager_endpoint_url": _ConfigArg(dtype=str, nullable=True, enforced=True),
4849
# Botocore config
4950
"botocore_config": _ConfigArg(dtype=botocore.config.Config, nullable=True),
5051
}
5152

5253

53-
class _Config: # pylint: disable=too-many-instance-attributes, too-many-public-methods
54+
class _Config: # pylint: disable=too-many-instance-attributes,too-many-public-methods
5455
"""Wrangler's Configuration class."""
5556

5657
def __init__(self) -> None:
@@ -63,6 +64,7 @@ def __init__(self) -> None:
6364
self.redshift_endpoint_url = None
6465
self.kms_endpoint_url = None
6566
self.emr_endpoint_url = None
67+
self.lakeformation_endpoint_url = None
6668
self.dynamodb_endpoint_url = None
6769
self.secretsmanager_endpoint_url = None
6870
self.botocore_config = None
@@ -356,6 +358,15 @@ def emr_endpoint_url(self) -> Optional[str]:
356358
def emr_endpoint_url(self, value: Optional[str]) -> None:
357359
self._set_config_value(key="emr_endpoint_url", value=value)
358360

361+
@property
362+
def lakeformation_endpoint_url(self) -> Optional[str]:
363+
"""Property lakeformation_endpoint_url."""
364+
return cast(Optional[str], self["lakeformation_endpoint_url"])
365+
366+
@lakeformation_endpoint_url.setter
367+
def lakeformation_endpoint_url(self, value: Optional[str]) -> None:
368+
self._set_config_value(key="lakeformation_endpoint_url", value=value)
369+
359370
@property
360371
def dynamodb_endpoint_url(self) -> Optional[str]:
361372
"""Property dynamodb_endpoint_url."""

awswrangler/_utils.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,8 @@ def _get_endpoint_url(service_name: str) -> Optional[str]:
9393
endpoint_url = _config.config.kms_endpoint_url
9494
elif service_name == "emr" and _config.config.emr_endpoint_url is not None:
9595
endpoint_url = _config.config.emr_endpoint_url
96+
elif service_name == "lakeformation" and _config.config.lakeformation_endpoint_url is not None:
97+
endpoint_url = _config.config.lakeformation_endpoint_url
9698
elif service_name == "dynamodb" and _config.config.dynamodb_endpoint_url is not None:
9799
endpoint_url = _config.config.dynamodb_endpoint_url
98100
elif service_name == "secretsmanager" and _config.config.secretsmanager_endpoint_url is not None:

awswrangler/catalog/_add.py

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
_parquet_partition_definition,
1515
_update_table_definition,
1616
)
17-
from awswrangler.catalog._utils import _catalog_id, sanitize_table_name
17+
from awswrangler.catalog._utils import _catalog_id, _transaction_id, sanitize_table_name
1818

1919
_logger: logging.Logger = logging.getLogger(__name__)
2020

@@ -300,7 +300,8 @@ def add_column(
300300
table: str,
301301
column_name: str,
302302
column_type: str = "string",
303-
column_comment: Optional[str] = "",
303+
column_comment: Optional[str] = None,
304+
transaction_id: Optional[str] = None,
304305
boto3_session: Optional[boto3.Session] = None,
305306
catalog_id: Optional[str] = None,
306307
) -> None:
@@ -318,6 +319,8 @@ def add_column(
318319
Column type.
319320
column_comment : str
320321
Column Comment
322+
transaction_id: str, optional
323+
The ID of the transaction (i.e. used with GOVERNED tables).
321324
boto3_session : boto3.Session(), optional
322325
Boto3 Session. The default boto3 session will be used if boto3_session receive None.
323326
catalog_id : str, optional
@@ -341,13 +344,21 @@ def add_column(
341344
"""
342345
if _check_column_type(column_type):
343346
client_glue: boto3.client = _utils.client(service_name="glue", session=boto3_session)
344-
table_res: Dict[str, Any] = client_glue.get_table(DatabaseName=database, Name=table)
347+
table_res: Dict[str, Any] = client_glue.get_table(
348+
**_catalog_id(
349+
catalog_id=catalog_id,
350+
**_transaction_id(transaction_id=transaction_id, DatabaseName=database, Name=table),
351+
)
352+
)
345353
table_input: Dict[str, Any] = _update_table_definition(table_res)
346354
table_input["StorageDescriptor"]["Columns"].append(
347355
{"Name": column_name, "Type": column_type, "Comment": column_comment}
348356
)
349357
res: Dict[str, Any] = client_glue.update_table(
350-
**_catalog_id(catalog_id=catalog_id, DatabaseName=database, TableInput=table_input)
358+
**_catalog_id(
359+
catalog_id=catalog_id,
360+
**_transaction_id(transaction_id=transaction_id, DatabaseName=database, TableInput=table_input),
361+
)
351362
)
352363
if ("Errors" in res) and res["Errors"]:
353364
for error in res["Errors"]:

0 commit comments

Comments
 (0)