Skip to content

Commit d84839d

Browse files
committed
add orphan file deletion configuration to Glue Iceberg tables
1 parent 95dcd92 commit d84839d

File tree

6 files changed

+423
-7
lines changed

6 files changed

+423
-7
lines changed

lib/shortcuts/api.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -222,15 +222,19 @@ Create a Glue table backed by Apache Iceberg format on S3.
222222
| [options.TableType] | <code>String</code> | <code>&#x27;EXTERNAL_TABLE&#x27;</code> | Hard-wired by this shortcut. |
223223
| [options.IcebergVersion] | <code>String</code> | <code>&#x27;2&#x27;</code> | The table version for the Iceberg table. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-table-iceberginput.html). |
224224
| [options.EnableOptimizer] | <code>Boolean</code> | <code>false</code> | Whether to enable the snapshot retention optimizer for this Iceberg table. |
225-
| [options.OptimizerRoleArn] | <code>String</code> | | The ARN of the IAM role for the retention optimizer to use. Required if EnableOptimizer is true. Can be the same role as CompactionRoleArn if both optimizers are enabled. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html). |
225+
| [options.OptimizerRoleArn] | <code>String</code> | | The ARN of the IAM role for the retention optimizer to use. Required if EnableOptimizer is true. Can be the same role as CompactionRoleArn or OrphanFileDeletionRoleArn if multiple optimizers are enabled. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html). |
226226
| [options.SnapshotRetentionPeriodInDays] | <code>Number</code> | <code>5</code> | The number of days to retain snapshots. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-properties-glue-tableoptimizer-icebergretentionconfiguration.html). |
227227
| [options.NumberOfSnapshotsToRetain] | <code>Number</code> | <code>1</code> | The minimum number of snapshots to retain. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-properties-glue-tableoptimizer-icebergretentionconfiguration.html). |
228228
| [options.CleanExpiredFiles] | <code>Boolean</code> | <code>true</code> | Whether to delete expired data files after expiring snapshots. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-properties-glue-tableoptimizer-icebergretentionconfiguration.html). |
229229
| [options.EnableCompaction] | <code>Boolean</code> | <code>false</code> | Whether to enable the compaction optimizer for this Iceberg table. |
230-
| [options.CompactionRoleArn] | <code>String</code> | | The ARN of the IAM role for the compaction optimizer to use. Required if EnableCompaction is true. Can be the same role as OptimizerRoleArn if both optimizers are enabled. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html). |
230+
| [options.CompactionRoleArn] | <code>String</code> | | The ARN of the IAM role for the compaction optimizer to use. Required if EnableCompaction is true. Can be the same role as OptimizerRoleArn or OrphanFileDeletionRoleArn if multiple optimizers are enabled. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html). |
231231
| [options.CompactionStrategy] | <code>String</code> | <code>&#x27;binpack&#x27;</code> | The compaction strategy: binpack, sort, or z-order. See [AWS documentation](https://docs.aws.amazon.com/glue/latest/dg/enable-compaction.html). |
232232
| [options.MinInputFiles] | <code>Number</code> | <code>100</code> | Minimum number of data files before compaction triggers. |
233233
| [options.DeleteFileThreshold] | <code>Number</code> | <code>1</code> | Minimum deletes in a file to make it eligible for compaction. |
234+
| [options.EnableOrphanFileDeletion] | <code>Boolean</code> | <code>false</code> | Whether to enable the orphan file deletion optimizer for this Iceberg table. |
235+
| [options.OrphanFileDeletionRoleArn] | <code>String</code> | | The ARN of the IAM role for the orphan file deletion optimizer to use. Required if EnableOrphanFileDeletion is true. Can be the same role as OptimizerRoleArn or CompactionRoleArn if multiple optimizers are enabled. See [AWS documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html). |
236+
| [options.OrphanFileRetentionPeriodInDays] | <code>Number</code> | <code>3</code> | The number of days to retain orphan files before deleting them. See [AWS documentation](https://docs.aws.amazon.com/glue/latest/dg/enable-orphan-file-deletion.html). |
237+
| [options.OrphanFileDeletionLocation] | <code>String</code> | | The S3 location to scan for orphan files. Defaults to the table location if not specified. See [AWS documentation](https://docs.aws.amazon.com/glue/latest/dg/enable-orphan-file-deletion.html). |
234238

235239
<a name="GlueJsonTable"></a>
236240

lib/shortcuts/glue-iceberg-table.js

Lines changed: 62 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ const GlueTable = require('./glue-table');
2020
* snapshot retention optimizer for this Iceberg table.
2121
* @param {String} [options.OptimizerRoleArn=undefined] - The ARN of the IAM
2222
* role for the retention optimizer to use. Required if EnableOptimizer is
23-
* true. Can be the same role as CompactionRoleArn if both optimizers are
24-
* enabled. See [AWS
23+
* true. Can be the same role as CompactionRoleArn or OrphanFileDeletionRoleArn
24+
* if multiple optimizers are enabled. See [AWS
2525
* documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html).
2626
* @param {Number} [options.SnapshotRetentionPeriodInDays=5] - The number of
2727
* days to retain snapshots. See [AWS
@@ -36,8 +36,8 @@ const GlueTable = require('./glue-table');
3636
* compaction optimizer for this Iceberg table.
3737
* @param {String} [options.CompactionRoleArn=undefined] - The ARN of the IAM
3838
* role for the compaction optimizer to use. Required if EnableCompaction is
39-
* true. Can be the same role as OptimizerRoleArn if both optimizers are
40-
* enabled. See [AWS
39+
* true. Can be the same role as OptimizerRoleArn or OrphanFileDeletionRoleArn
40+
* if multiple optimizers are enabled. See [AWS
4141
* documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html).
4242
* @param {String} [options.CompactionStrategy='binpack'] - The compaction
4343
* strategy: binpack, sort, or z-order. See [AWS
@@ -46,6 +46,20 @@ const GlueTable = require('./glue-table');
4646
* before compaction triggers.
4747
* @param {Number} [options.DeleteFileThreshold=1] - Minimum deletes in a file
4848
* to make it eligible for compaction.
49+
* @param {Boolean} [options.EnableOrphanFileDeletion=false] - Whether to
50+
* enable the orphan file deletion optimizer for this Iceberg table.
51+
* @param {String} [options.OrphanFileDeletionRoleArn=undefined] - The ARN of
52+
* the IAM role for the orphan file deletion optimizer to use. Required if
53+
* EnableOrphanFileDeletion is true. Can be the same role as OptimizerRoleArn
54+
* or CompactionRoleArn if multiple optimizers are enabled. See [AWS
55+
* documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-tableoptimizer-tableoptimizerconfiguration.html).
56+
* @param {Number} [options.OrphanFileRetentionPeriodInDays=3] - The number of
57+
* days to retain orphan files before deleting them. See [AWS
58+
* documentation](https://docs.aws.amazon.com/glue/latest/dg/enable-orphan-file-deletion.html).
59+
* @param {String} [options.OrphanFileDeletionLocation=undefined] - The S3
60+
* location to scan for orphan files. Defaults to the table location if not
61+
* specified. See [AWS
62+
* documentation](https://docs.aws.amazon.com/glue/latest/dg/enable-orphan-file-deletion.html).
4963
*/
5064
class GlueIcebergTable extends GlueTable {
5165
constructor(options) {
@@ -62,7 +76,11 @@ class GlueIcebergTable extends GlueTable {
6276
CompactionRoleArn,
6377
CompactionStrategy = 'binpack',
6478
MinInputFiles = 100,
65-
DeleteFileThreshold = 1
79+
DeleteFileThreshold = 1,
80+
EnableOrphanFileDeletion = false,
81+
OrphanFileDeletionRoleArn,
82+
OrphanFileRetentionPeriodInDays = 3,
83+
OrphanFileDeletionLocation
6684
} = options;
6785

6886
const required = [Location];
@@ -75,6 +93,9 @@ class GlueIcebergTable extends GlueTable {
7593
if (EnableCompaction && !CompactionRoleArn)
7694
throw new Error('You must provide a CompactionRoleArn when EnableCompaction is true');
7795

96+
if (EnableOrphanFileDeletion && !OrphanFileDeletionRoleArn)
97+
throw new Error('You must provide an OrphanFileDeletionRoleArn when EnableOrphanFileDeletion is true');
98+
7899
const validStrategies = ['binpack', 'sort', 'z-order'];
79100
if (!validStrategies.includes(CompactionStrategy))
80101
throw new Error('CompactionStrategy must be one of: binpack, sort, z-order');
@@ -158,6 +179,42 @@ class GlueIcebergTable extends GlueTable {
158179
this.Resources[compactionLogicalName].Condition = options.Condition;
159180
}
160181
}
182+
183+
// Optionally add TableOptimizer for orphan file deletion
184+
if (EnableOrphanFileDeletion) {
185+
const orphanLogicalName = `${logicalName}OrphanFileDeletionOptimizer`;
186+
const icebergConfiguration = {
187+
OrphanFileRetentionPeriodInDays
188+
};
189+
190+
// Only add Location if specified, otherwise it defaults to table location
191+
if (OrphanFileDeletionLocation) {
192+
icebergConfiguration.Location = OrphanFileDeletionLocation;
193+
}
194+
195+
this.Resources[orphanLogicalName] = {
196+
Type: 'AWS::Glue::TableOptimizer',
197+
DependsOn: logicalName,
198+
Properties: {
199+
CatalogId: options.CatalogId || { Ref: 'AWS::AccountId' },
200+
DatabaseName: options.DatabaseName,
201+
TableName: options.Name,
202+
Type: 'orphan_file_deletion',
203+
TableOptimizerConfiguration: {
204+
RoleArn: OrphanFileDeletionRoleArn,
205+
Enabled: true,
206+
OrphanFileDeletionConfiguration: {
207+
IcebergConfiguration: icebergConfiguration
208+
}
209+
}
210+
}
211+
};
212+
213+
// Apply Condition to orphan file deletion optimizer if specified on the table
214+
if (options.Condition) {
215+
this.Resources[orphanLogicalName].Condition = options.Condition;
216+
}
217+
}
161218
}
162219
}
163220

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
{
2+
"AWSTemplateFormatVersion": "2010-09-09",
3+
"Metadata": {},
4+
"Parameters": {},
5+
"Rules": {},
6+
"Mappings": {},
7+
"Conditions": {},
8+
"Resources": {
9+
"MyTable": {
10+
"Type": "AWS::Glue::Table",
11+
"Properties": {
12+
"CatalogId": {
13+
"Ref": "AWS::AccountId"
14+
},
15+
"DatabaseName": "my_database",
16+
"TableInput": {
17+
"Description": {
18+
"Fn::Sub": "Created by the ${AWS::StackName} CloudFormation stack"
19+
},
20+
"Name": "my_table",
21+
"Parameters": {
22+
"EXTERNAL": "TRUE"
23+
},
24+
"PartitionKeys": [],
25+
"TableType": "EXTERNAL_TABLE",
26+
"StorageDescriptor": {
27+
"Columns": [
28+
{
29+
"Name": "column",
30+
"Type": "string"
31+
}
32+
],
33+
"Compressed": false,
34+
"Location": "s3://fake/location",
35+
"NumberOfBuckets": 0,
36+
"SerdeInfo": {},
37+
"StoredAsSubDirectories": true
38+
}
39+
},
40+
"OpenTableFormatInput": {
41+
"IcebergInput": {
42+
"MetadataOperation": "CREATE",
43+
"Version": "2"
44+
}
45+
}
46+
}
47+
},
48+
"MyTableRetentionOptimizer": {
49+
"Type": "AWS::Glue::TableOptimizer",
50+
"DependsOn": "MyTable",
51+
"Properties": {
52+
"CatalogId": {
53+
"Ref": "AWS::AccountId"
54+
},
55+
"DatabaseName": "my_database",
56+
"TableName": "my_table",
57+
"Type": "retention",
58+
"TableOptimizerConfiguration": {
59+
"RoleArn": "arn:aws:iam::123456789012:role/SharedRole",
60+
"Enabled": true,
61+
"RetentionConfiguration": {
62+
"IcebergConfiguration": {
63+
"SnapshotRetentionPeriodInDays": 5,
64+
"NumberOfSnapshotsToRetain": 1,
65+
"CleanExpiredFiles": true
66+
}
67+
}
68+
}
69+
}
70+
},
71+
"MyTableCompactionOptimizer": {
72+
"Type": "AWS::Glue::TableOptimizer",
73+
"DependsOn": "MyTable",
74+
"Properties": {
75+
"CatalogId": {
76+
"Ref": "AWS::AccountId"
77+
},
78+
"DatabaseName": "my_database",
79+
"TableName": "my_table",
80+
"Type": "compaction",
81+
"TableOptimizerConfiguration": {
82+
"RoleArn": "arn:aws:iam::123456789012:role/SharedRole",
83+
"Enabled": true,
84+
"CompactionConfiguration": {
85+
"IcebergConfiguration": {
86+
"Strategy": "binpack",
87+
"MinInputFiles": 100,
88+
"DeleteFileThreshold": 1
89+
}
90+
}
91+
}
92+
}
93+
},
94+
"MyTableOrphanFileDeletionOptimizer": {
95+
"Type": "AWS::Glue::TableOptimizer",
96+
"DependsOn": "MyTable",
97+
"Properties": {
98+
"CatalogId": {
99+
"Ref": "AWS::AccountId"
100+
},
101+
"DatabaseName": "my_database",
102+
"TableName": "my_table",
103+
"Type": "orphan_file_deletion",
104+
"TableOptimizerConfiguration": {
105+
"RoleArn": "arn:aws:iam::123456789012:role/SharedRole",
106+
"Enabled": true,
107+
"OrphanFileDeletionConfiguration": {
108+
"IcebergConfiguration": {
109+
"OrphanFileRetentionPeriodInDays": 3
110+
}
111+
}
112+
}
113+
}
114+
}
115+
},
116+
"Outputs": {}
117+
}
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
{
2+
"AWSTemplateFormatVersion": "2010-09-09",
3+
"Metadata": {},
4+
"Parameters": {},
5+
"Rules": {},
6+
"Mappings": {},
7+
"Conditions": {},
8+
"Resources": {
9+
"OrphanFileDeletionRole": {
10+
"Type": "AWS::IAM::Role",
11+
"Properties": {
12+
"AssumeRolePolicyDocument": {}
13+
}
14+
},
15+
"MyTable": {
16+
"Type": "AWS::Glue::Table",
17+
"Properties": {
18+
"CatalogId": {
19+
"Ref": "AWS::AccountId"
20+
},
21+
"DatabaseName": "my_database",
22+
"TableInput": {
23+
"Description": {
24+
"Fn::Sub": "Created by the ${AWS::StackName} CloudFormation stack"
25+
},
26+
"Name": "my_table",
27+
"Parameters": {
28+
"EXTERNAL": "TRUE"
29+
},
30+
"PartitionKeys": [],
31+
"TableType": "EXTERNAL_TABLE",
32+
"StorageDescriptor": {
33+
"Columns": [
34+
{
35+
"Name": "column",
36+
"Type": "string"
37+
}
38+
],
39+
"Compressed": false,
40+
"Location": "s3://fake/location",
41+
"NumberOfBuckets": 0,
42+
"SerdeInfo": {},
43+
"StoredAsSubDirectories": true
44+
}
45+
},
46+
"OpenTableFormatInput": {
47+
"IcebergInput": {
48+
"MetadataOperation": "CREATE",
49+
"Version": "2"
50+
}
51+
}
52+
}
53+
},
54+
"MyTableOrphanFileDeletionOptimizer": {
55+
"Type": "AWS::Glue::TableOptimizer",
56+
"DependsOn": "MyTable",
57+
"Properties": {
58+
"CatalogId": {
59+
"Ref": "AWS::AccountId"
60+
},
61+
"DatabaseName": "my_database",
62+
"TableName": "my_table",
63+
"Type": "orphan_file_deletion",
64+
"TableOptimizerConfiguration": {
65+
"RoleArn": {
66+
"Fn::GetAtt": [
67+
"OrphanFileDeletionRole",
68+
"Arn"
69+
]
70+
},
71+
"Enabled": true,
72+
"OrphanFileDeletionConfiguration": {
73+
"IcebergConfiguration": {
74+
"OrphanFileRetentionPeriodInDays": 7,
75+
"Location": "s3://fake/location/subdir"
76+
}
77+
}
78+
}
79+
}
80+
}
81+
},
82+
"Outputs": {}
83+
}

0 commit comments

Comments
 (0)