Skip to content

Commit faa1ce6

Browse files
authored
v0.0.1-preview
Initial Commit - template.yaml created - lambda/app.py created - lambda/requirements.txt created - README.md created
1 parent a6732ae commit faa1ce6

File tree

10 files changed

+1126
-5
lines changed

10 files changed

+1126
-5
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
.DS_Store
2+
temp
3+
assets/cloudwatch-dashboard.rendered.json
4+
samconfig.toml
5+
.aws-sam
6+
.env.local.json

README.md

Lines changed: 255 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,261 @@
1-
## My Project
1+
## Monitoring Apache Iceberg Table metadata layer using AWS Lambda, AWS Glue and AWS CloudWatch
22

3-
TODO: Fill this README out!
3+
This repository provides you with sample code on how to collect metrics of an existing Apache Iceberg table managed in Amazon S3. The code consists of AWS Lambda deployment package that collects and submits metrics into AWS CloudWatch. Repository also includes helper scripts for deploying CloudWatch monitoring dashboard to visualize collected metrics.
44

5-
Be sure to:
5+
### Table of Contents
6+
- [Technical implementation](#technical-implementation)
7+
- [Metrics collected](#metrics-collected)
8+
- [Setup](#setup)
9+
- [Prerequisites](#prerequisites)
10+
- [Build and Deploy](#build-and-deploy)
11+
- [Test Locally](#test-locally)
12+
- [Dependencies](#dependencies)
13+
- [Clean Up](#clean-up)
14+
- [Security](#security)
15+
- [License](#license)
616

7-
* Change the title in this README
8-
* Edit your repository description on GitHub
17+
18+
19+
### Technical implementation
20+
21+
![Architectural diagram of the solution](assets/arch.png)
22+
23+
* AWS Lambda triggered on every Iceberg snapshot creation to collect and send metrics to CloudWatch. This achieved with [S3 event notification](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html). See [Setting up S3 event notification](#3-setting-up-s3-event-notification) section.
24+
* AWS Lambda code includes `pyiceberg` library and [AWS Glue interactive Sessions](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-overview.html) with minimal compute to read `snapshots`, `partitions` and `files` Apache Iceberg metadata tables with Apache Spark.
25+
* AWS Lambda code aggregates information retrieved from metadata tables to create metrics and submits those to AWS CloudWatch.
26+
27+
28+
### Metrics collected
29+
*Snapshot metrics*
30+
* snapshot.total_data_files
31+
* snapshot.added_data_files
32+
* snapshot.deleted_data_files
33+
* snapshot.total_delete_files
34+
* snapshot.added_records
35+
* snapshot.deleted_records
36+
* snapshot.added_files_size
37+
* snapshot.removed_files_size
38+
* snapshot.added_position_deletes
39+
40+
*Partitions aggregated metrics*
41+
* partitions.avg_record_count
42+
* partitions.max_record_count
43+
* partitions.min_record_count
44+
* partitions.deviation_record_count
45+
* partitions.skew_record_count
46+
* partitions.avg_file_count
47+
* partitions.max_file_count
48+
* partitions.min_file_count
49+
* partitions.deviation_file_count
50+
* partitions.skew_file_count
51+
52+
*Per-partition metrics*
53+
* partitions.file_count
54+
* partitions.record_count
55+
56+
*Files aggregated metrics*
57+
* files.avg_record_count
58+
* files.max_record_count
59+
* files.min_record_count
60+
* files.deviation_record_count
61+
* files.skew_record_count
62+
* files.avg_file_size
63+
* files.max_file_size
64+
* files.min_file_size
65+
66+
## Setup
67+
68+
### Prerequisites
69+
70+
#### Install Docker
71+
72+
This solution uses Docker as a dependency for AWS SAM CLI.
73+
To install Docker follow Docker official documentation.
74+
https://docs.docker.com/get-docker/
75+
76+
#### Install SAM CLI
77+
78+
This solution is using AWS SAM CLI to build test and deploy AWS Lambda code that collects the Iceberg table metrics and submits them into AWS CloudWatch.
79+
80+
To install AWS SAM CLI follow AWS Documentation. \
81+
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html
82+
83+
84+
#### Configuring IAM permissions for AWS Glue
85+
86+
- [Step 1: Create an IAM policy for the AWS Glue service](https://docs.aws.amazon.com/glue/latest/dg/create-service-policy.html)
87+
- [Step 2: Create an IAM role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html)
88+
89+
### Build and Deploy
90+
91+
> ! Important - The guidance below uses AWS Serverless Application Model (SAM) for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files:
92+
> - template.yaml
93+
> - lambda/requirements.txt
94+
> - lambda/app.py
95+
96+
#### 1. Build AWS Lambda using AWS SAM CLI
97+
98+
Once you've installed [Docker](#install-docker) and [SAM CLI](#install-sam-cli) you are ready to build the AWS Lambda. Open your terminal and run command below.
99+
100+
```bash
101+
sam build --use-container
102+
```
103+
104+
#### 2. Deploy AWS Lambda using AWS SAM CLI
105+
106+
Once build is finished you can deploy your AWS Lambda. SAM will upload packaged code and deploy AWS Lambda resource using AWS CloudFormation. Run below command using your terminal.
107+
108+
```bash
109+
sam deploy --guided
110+
```
111+
112+
##### Parameters
113+
114+
- `CWNamespace` - A namespace is a container for CloudWatch metrics.
115+
- `DBName` - Glue Data Catalog Database Name.
116+
- `TableName` - Apache Iceberg Table name as it appears in the Glue Data Catalog.
117+
- `GlueServiceRole` - AWS Glue Role arn you created [earlier](#configuring-iam-permissions-for-aws-glue).
118+
- `Warehouse` - Required catalog property to determine the root path of the data warehouse on S3. This can be any path on your S3 bucket. Not critical for the solution.
119+
- `IcebergTableS3BucketName` - S3 bucket name is required to allow S3 bucket event notification. SAM will add resource-based permission to allow S3 bucket to invoke AWS Lambda.
120+
121+
122+
#### 3. Setting up S3 event notification
123+
124+
You need to setup an automatic trigger that will activate AWS Lambda metrics collection on every Apache Iceberg commit. This solution is relying on S3 event notification feature to trigger AWS Lambda every time new `metadata.json` is written to S3 `metadata` folder of the table.
125+
126+
You can follow AWS Documentation on how to [enable and configuring event notifications using the Amazon S3 console](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html).
127+
128+
or use the Python Boto3 sample code below. Replace with your bucket name and path to metadata.
129+
130+
```python
131+
import boto3
132+
s3_client = boto3.client('s3')
133+
lambda_arn = "<REPLACE WITH YOUR ARN>"
134+
bucket_name = "<REPLACE WITH YOUR S3 BUCKET NAME>"
135+
path_to_metadata_folder = "<REPLACE WITH YOUR S3 PATH>"
136+
137+
notification_configuration = {
138+
'LambdaFunctionConfigurations': [
139+
{
140+
'LambdaFunctionArn': lambda_arn,
141+
'Events': [
142+
's3:ObjectCreated:Put'
143+
],
144+
'Filter': {
145+
'Key': {
146+
'FilterRules': [
147+
{
148+
'Name': 'Prefix',
149+
'Value': path_to_metadata_folder
150+
},
151+
{
152+
'Name': 'Suffix',
153+
'Value': '.json'
154+
}
155+
]
156+
}
157+
}
158+
}
159+
]
160+
}
161+
response = s3_client.put_bucket_notification_configuration(
162+
Bucket=bucket_name,
163+
NotificationConfiguration=notification_configuration
164+
)
165+
if response['ResponseMetadata']['HTTPStatusCode'] == 200:
166+
print("Success")
167+
else:
168+
print("Something went wrong")
169+
170+
```
171+
172+
The final result should look like this
173+
174+
![S3 to AWS Lambda trigger example](assets/trigger.png)
175+
176+
#### 4. (Optional) Create CloudWatch Dashboard
177+
Once your Iceberg Table metrics are submitted to CloudWatch you can start using them to monitor and create alarms. CloudWatch also let you visualize metrics using CloudWatch Dashboards.
178+
179+
`assets/cloudwatch-dashboard.template.json` is a sample CloudWatch dashboard configuration that uses fraction of the submitted metrics and combines it with AWS Glue native metrics for Apache Iceberg.
180+
We use Jinja2 so you could generate your own dashboard by providing your parameters.
181+
182+
183+
![CloudWatch Dashboard Screenshot](assets/cw-dashboard-screenshot.png)
184+
185+
Run the script below to generate your own CloudWatch dashboard configuration.
186+
Replace input values with the relevant [parameters](#parameters) from previous sections.
187+
188+
```python
189+
import json
190+
from jinja2 import Template
191+
192+
def render_json_template(template_path, data):
193+
with open(template_path, 'r') as file:
194+
template_text = file.read()
195+
196+
template = Template(template_text)
197+
rendered_json = template.render(data)
198+
json_data = json.loads(rendered_json)
199+
return json_data
200+
201+
# Data to fill in the template
202+
data = {
203+
"CW_NAMESPACE": "<<REPLACE>>",
204+
"REGION": "<<REPLACE>>",
205+
"DBNAME": "<<REPLACE>>",
206+
"TABLENAME": "<<REPLACE>>"
207+
}
208+
209+
# Path to cloudwatch template file
210+
template_path = 'assets/cloudwatch-dashboard.template.json'
211+
rendered_data = render_json_template(template_path, data)
212+
output_path = 'assets/cloudwatch-dashboard.rendered.json'
213+
214+
with open(output_path, 'w') as file:
215+
json.dump(rendered_data, file, indent=4)
216+
217+
print(f"Your dashboard configuration successfully generated at {output_path}")
218+
```
219+
220+
Now follow steps to create CloudWatch dashboard from rendered json.
221+
222+
1. Sign in to the AWS Management Console and navigate to the CloudWatch service.
223+
2. In the navigation pane, click on "Dashboards" on the left pane.
224+
3. Click on "Create Dashboard" and give it a name.
225+
4. If widget configuration popup appears click "Cancel".
226+
5. Click the "Actions" dropdown menu in the top right corner of the dashboard and select "View/edit source".
227+
This will open a new tab with the source JSON for the dashboard. You can then paste rendered JSON into a Dashboard source to create a custom dashboard resource.
228+
6. Click "Update"
229+
7. The new dashboard supposedly empty. Once your AWS Lambda will generate metrics they will appear here.
230+
231+
### Test Locally
232+
233+
You can test the code locally on using SAM CLI.
234+
Ensure you have configured the [right AWS permissions](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) to call CloudWatch and AWS Glue.
235+
236+
```bash
237+
sam local invoke IcebergMetricsLambda --env-vars .env.local.json
238+
```
239+
240+
`.env.local.json` - The JSON file that contains values for the Lambda function's environment variables. Lambda code is dependent on env vars that you are passing in the deploy section. You need to create the file it and include relevant [parameters](#parameters) before you calling `sam local invoke`.
241+
242+
243+
## Dependencies
244+
245+
PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. \
246+
https://py.iceberg.apache.org
247+
248+
AWS Serverless Application Model (AWS SAM) \
249+
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html
250+
251+
Docker \
252+
https://docs.docker.com/get-docker/
253+
254+
## Clean Up
255+
256+
1. Delete AWS Lambda `sam delete`.
257+
2. Delete CloudWatch Dashboard.
258+
3. Remove S3 event notification.
9259

10260
## Security
11261

assets/arch.png

95.8 KB
Loading

0 commit comments

Comments
 (0)