Skip to content

Commit 912d749

Browse files
committed
revamp glue
1 parent e191202 commit 912d749

File tree

1 file changed

+134
-108
lines changed
  • src/content/docs/aws/services

1 file changed

+134
-108
lines changed

src/content/docs/aws/services/glue.md

Lines changed: 134 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
---
22
title: Glue
3-
linkTitle: Glue
43
description: Get started with Glue on LocalStack
54
tags: ["Ultimate"]
65
---
@@ -10,9 +9,9 @@ tags: ["Ultimate"]
109
The Glue API in LocalStack Pro allows you to run ETL (Extract-Transform-Load) jobs locally, maintaining table metadata in the local Glue data catalog, and using the Spark ecosystem (PySpark/Scala) to run data processing workflows.
1110

1211
LocalStack allows you to use the Glue APIs in your local environment.
13-
The supported APIs are available on our [API coverage page](/references/coverage/coverage_glue/), which provides information on the extent of Glue's integration with LocalStack.
12+
The supported APIs are available on our [API coverage page](), which provides information on the extent of Glue's integration with LocalStack.
1413

15-
{{< callout >}}
14+
:::note
1615
LocalStack now includes a container-based Glue Job executor, enabling Glue jobs to run within a Docker environment.
1716
Previously, LocalStack relied on a pre-packaged binary that included Spark and other required components.
1817
The new executor leverages the `aws-glue-libs` Docker image, provides better production parity, faster startup times, and more reliable execution.
@@ -27,7 +26,7 @@ Key enhancements include:
2726

2827
To use it, set `GLUE_JOB_EXECUTOR=docker` and `GLUE_JOB_EXECUTOR_PROVIDER=v2` in your LocalStack configuration.
2928
The new executor additionally deprecates older versions of Glue (`0.9`, `1.0`, `2.0`).
30-
{{< /callout >}}
29+
:::
3130

3231
## Getting started
3332

@@ -36,20 +35,20 @@ This guide is designed for users new to Glue and assumes basic knowledge of the
3635
Start your LocalStack container using your preferred method.
3736
We will demonstrate how to create databases and table metadata in Glue, run Glue ETL jobs, import databases from Athena, and run Glue Crawlers with the AWS CLI.
3837

39-
{{< callout >}}
40-
In order to run Glue jobs, some additional dependencies have to be fetched from the network, including a Docker image of apprx.
41-
1.5GB which includes Spark, Presto, Hive and other tools.
38+
:::note
39+
In order to run Glue jobs, some additional dependencies have to be fetched from the network, including a Docker image of approximately 1.5GB which includes Spark, Presto, Hive and other tools.
4240
These dependencies are automatically fetched when you start up the service, so please make sure you're on a decent internet connection when pulling the dependencies for the first time.
43-
{{< /callout >}}
41+
:::
4442

4543
### Creating Databases and Table Metadata
4644

4745
The commands below illustrate the creation of some very basic entries (databases, tables) in the Glue data catalog:
48-
{{< command >}}
49-
$ awslocal glue create-database --database-input '{"Name":"db1"}'
50-
$ awslocal glue create-table --database db1 --table-input '{"Name":"table1"}'
51-
$ awslocal glue get-tables --database db1
52-
{{< /command >}}
46+
47+
```bash
48+
awslocal glue create-database --database-input '{"Name":"db1"}'
49+
awslocal glue create-table --database db1 --table-input '{"Name":"table1"}'
50+
awslocal glue get-tables --database db1
51+
```
5352

5453
You should see the following output:
5554

@@ -87,27 +86,32 @@ if __name__ == '__main__':
8786
```
8887

8988
You can now copy the script to an S3 bucket:
90-
{{< command >}}
91-
$ awslocal s3 mb s3://glue-test
92-
$ awslocal s3 cp job.py s3://glue-test/job.py
93-
{{< / command >}}
89+
90+
```bash
91+
awslocal s3 mb s3://glue-test
92+
awslocal s3 cp job.py s3://glue-test/job.py
93+
```
9494

9595
Next, you can create a job definition:
9696

97-
{{< command >}}
98-
$ awslocal glue create-job --name job1 --role arn:aws:iam::000000000000:role/glue-role \
99-
--command '{"Name": "pythonshell", "ScriptLocation": "s3://glue-test/job.py"}'
100-
{{< / command >}}
97+
```bash
98+
awslocal glue create-job \
99+
--name job1 \
100+
--role arn:aws:iam::000000000000:role/glue-role \
101+
--command '{"Name": "pythonshell", "ScriptLocation": "s3://glue-test/job.py"}'
102+
```
101103

102104
You can finally start the job execution:
103105

104-
{{< command >}}
105-
$ awslocal glue start-job-run --job-name job1
106-
{{< / command >}}
106+
```bash
107+
awslocal glue start-job-run --job-name job1
108+
```
109+
107110
The returned `JobRunId` can be used to query the status job the job execution, until it becomes `SUCCEEDED`:
108-
{{< command >}}
109-
$ awslocal glue get-job-run --job-name job1 --run-id <JobRunId>
110-
{{< / command >}}
111+
112+
```bash
113+
awslocal glue get-job-run --job-name job1 --run-id <JobRunId>
114+
```
111115

112116
You should see the following output:
113117

@@ -136,16 +140,17 @@ CREATE EXTERNAL TABLE db2.table2 (a1 Date, a2 STRING, a3 INT) LOCATION 's3://tes
136140
```
137141

138142
Then this command will import these DB/table definitions into the Glue data catalog:
139-
{{< command >}}
140-
$ awslocal glue import-catalog-to-glue
141-
{{< /command >}}
143+
144+
```bash
145+
awslocal glue import-catalog-to-glue
146+
```
142147

143148
Afterwards, the databases and tables will be available in Glue.
144149
You can query the databases with the `get-databases` operation:
145150

146-
{{< command >}}
147-
$ awslocal glue get-databases
148-
{{< /command >}}
151+
```bash
152+
awslocal glue get-databases
153+
```
149154

150155
You should see the following output:
151156

@@ -166,9 +171,11 @@ You should see the following output:
166171
```
167172

168173
And you can query the databases with the `get-databases` operation:
169-
{{< command >}}
170-
$ awslocal glue get-tables --database-name db2
171-
{{< / command >}}
174+
175+
```bash
176+
awslocal glue get-tables --database-name db2
177+
```
178+
172179
You should see the following output:
173180

174181
```json
@@ -203,28 +210,33 @@ The example below illustrates crawling tables and partition metadata from S3 buc
203210

204211
You can first create an S3 bucket with a couple of items:
205212

206-
{{< command >}}
207-
$ awslocal s3 mb s3://test
208-
$ printf "1, 2, 3, 4\n5, 6, 7, 8" > /tmp/file.csv
209-
$ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Jan/day=1/file.csv
210-
$ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Jan/day=2/file.csv
211-
$ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Feb/day=1/file.csv
212-
$ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Feb/day=2/file.csv
213-
{{< / command >}}
213+
```bash
214+
awslocal s3 mb s3://test
215+
printf "1, 2, 3, 4\n5, 6, 7, 8" > /tmp/file.csv
216+
awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Jan/day=1/file.csv
217+
awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Jan/day=2/file.csv
218+
awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Feb/day=1/file.csv
219+
awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Feb/day=2/file.csv
220+
```
214221

215222
You can then create and trigger the crawler:
216223

217-
{{< command >}}
218-
$ awslocal glue create-database --database-input '{"Name":"db1"}'
219-
$ awslocal glue create-crawler --name c1 --database-name db1 --role arn:aws:iam::000000000000:role/glue-role --targets '{"S3Targets": [{"Path": "s3://test/table1"}]}'
220-
$ awslocal glue start-crawler --name c1
221-
{{< / command >}}
224+
```bash
225+
awslocal glue create-database --database-input '{"Name":"db1"}'
226+
awslocal glue create-crawler \
227+
--name c1 \
228+
--database-name db1 \
229+
--role arn:aws:iam::000000000000:role/glue-role \
230+
--targets '{"S3Targets": [{"Path": "s3://test/table1"}]}'
231+
awslocal glue start-crawler --name c1
232+
```
222233

223234
Finally, you can query the table metadata that has been created by the crawler:
224235

225-
{{< command >}}
226-
$ awslocal glue get-tables --database-name db1
227-
{{< / command >}}
236+
```bash
237+
awslocal glue get-tables --database-name db1
238+
```
239+
228240
You should see the following output:
229241

230242
```json
@@ -237,9 +249,11 @@ You should see the following output:
237249
```
238250

239251
You can also query the created table partitions:
240-
{{< command >}}
241-
$ awslocal glue get-partitions --database-name db1 --table-name table1
242-
{{< / command >}}
252+
253+
```bash
254+
awslocal glue get-partitions --database-name db1 --table-name table1
255+
```
256+
243257
You should see the following output:
244258

245259
```json
@@ -257,9 +271,16 @@ When using JDBC crawlers, you can point your crawler towards a Redshift database
257271

258272
Below is a rough outline of the steps required to get the integration for the JDBC crawler working.
259273
You can first create the local Redshift cluster via:
260-
{{< command >}}
261-
$ awslocal redshift create-cluster --cluster-identifier c1 --node-type dc1.large --master-username test --master-user-password test --db-name db1
262-
{{< / command >}}
274+
275+
```bash
276+
awslocal redshift create-cluster \
277+
--cluster-identifier c1 \
278+
--node-type dc1.large \
279+
--master-username test \
280+
--master-user-password test \
281+
--db-name db1
282+
```
283+
263284
The output of this command contains the endpoint address of the created Redshift database:
264285

265286
```json
@@ -275,18 +296,23 @@ Then you can use any JDBC or Postgres client to create a table `mytable1` in the
275296

276297
Next, you're creating the Glue database, the JDBC connection, as well as the crawler:
277298

278-
{{< command >}}
279-
$ awslocal glue create-database --database-input '{"Name":"gluedb1"}'
280-
$ awslocal glue create-connection --connection-input \
299+
```bash
300+
awslocal glue create-database --database-input '{"Name":"gluedb1"}'
301+
awslocal glue create-connection --connection-input \
281302
{"Name":"conn1","ConnectionType":"JDBC","ConnectionProperties":{"USERNAME":"test","PASSWORD":"test","JDBC_CONNECTION_URL":"jdbc:redshift://localhost.localstack.cloud:4510/db1"}}'
282-
$ awslocal glue create-crawler --name c1 --database-name gluedb1 --role arn:aws:iam::000000000000:role/glue-role --targets '{"JdbcTargets":[{"ConnectionName":"conn1","Path":"db1/%/mytable1"}]}'
283-
$ awslocal glue start-crawler --name c1
284-
{{< / command >}}
303+
awslocal glue create-crawler \
304+
--name c1 \
305+
--database-name gluedb1 \
306+
--role arn:aws:iam::000000000000:role/glue-role \
307+
--targets '{"JdbcTargets":[{"ConnectionName":"conn1","Path":"db1/%/mytable1"}]}'
308+
awslocal glue start-crawler --name c1
309+
```
285310

286311
Once the crawler has started, you have to wait until the `State` turns to `READY` when querying the current state:
287-
{{< command >}}
288-
$ awslocal glue get-crawler --name c1
289-
{{< /command >}}
312+
313+
```bash
314+
awslocal glue get-crawler --name c1
315+
```
290316

291317
Once the crawler has finished running and is back in `READY` state, the Glue table within the `gluedb1` DB should have been populated and can be queried via the API.
292318

@@ -296,21 +322,27 @@ The Glue Schema Registry allows you to centrally discover, control, and evolve d
296322
With the Schema Registry, you can manage and enforce schemas and schema compatibilities in your streaming applications.
297323
It integrates nicely with [Managed Streaming for Kafka (MSK)](../managed-streaming-for-kafka).
298324

299-
{{< callout >}}
325+
:::note
300326
Currently, LocalStack supports the AVRO dataformat for the Glue Schema Registry.
301327
Support for other dataformats will be added in the future.
302-
{{< /callout >}}
328+
:::
303329

304330
You can create a schema registry with the following command:
305-
{{< command >}}
306-
$ awslocal glue create-registry --registry-name demo-registry
307-
{{< /command >}}
331+
332+
```bash
333+
awslocal glue create-registry --registry-name demo-registry
334+
```
308335

309336
You can create a schema in the newly created registry with the `create-schema` command:
310-
{{< command >}}
311-
$ awslocal glue create-schema --schema-name demo-schema --registry-id RegistryName=demo-registry --data-format AVRO --compatibility FORWARD \
312-
--schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}]}'
313-
{{< /command >}}
337+
338+
```bash
339+
awslocal glue create-schema --schema-name demo-schema \
340+
--registry-id RegistryName=demo-registry \
341+
--data-format AVRO \
342+
--compatibility FORWARD \
343+
--schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}]}'
344+
```
345+
314346
You should see the following output:
315347

316348
```json
@@ -331,10 +363,12 @@ You should see the following output:
331363
```
332364

333365
Once the schema has been created, you can create a new version:
334-
{{< command >}}
335-
$ awslocal glue register-schema-version --schema-id SchemaName=demo-schema,RegistryName=demo-registry \
336-
--schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}, {"name":"Address","type":"string"}]}'
337-
{{< /command >}}
366+
367+
```bash
368+
awslocal glue register-schema-version \
369+
--schema-id SchemaName=demo-schema,RegistryName=demo-registry \
370+
--schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}, {"name":"Address","type":"string"}]}'
371+
```
338372

339373
You should see the following output:
340374

@@ -352,9 +386,9 @@ You can find a more advanced sample in our [localstack-pro-samples repository on
352386

353387
LocalStack Glue supports [Delta Lake](https://delta.io), an open-source storage framework that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.
354388

355-
{{< callout >}}
389+
:::note
356390
Please note that Delta Lake tables are only [supported for Glue versions `3.0` and `4.0`](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-delta-lake.html).
357-
{{< /callout >}}
391+
:::
358392

359393
To illustrate this feature, we take a closer look at a Glue sample job that creates a Delta Lake table, puts some data into it, and then queries data from the table.
360394

@@ -390,18 +424,16 @@ print("SQL result:", result.toJSON().collect())
390424

391425
You can now run the following commands to create and start the Glue job:
392426

393-
{{< command >}}
394-
$ awslocal s3 mb s3://test
395-
$ awslocal s3 cp job.py s3://test/job.py
396-
$ awslocal glue create-job --name job1 --role arn:aws:iam::000000000000:role/test \
397-
--glue-version 4.0 --command '{"Name": "pythonshell", "ScriptLocation": "s3://test/job.py"}'
398-
$ awslocal glue start-job-run --job-name job1
399-
<disable-copy>
400-
{
401-
"JobRunId": "c9471f40"
402-
}
403-
</disable-copy>
404-
{{< / command >}}
427+
```bash
428+
awslocal s3 mb s3://test
429+
awslocal s3 cp job.py s3://test/job.py
430+
awslocal glue create-job --name job1 --role arn:aws:iam::000000000000:role/test \
431+
--glue-version 4.0 \
432+
--command '{"Name": "pythonshell", "ScriptLocation": "s3://test/job.py"}'
433+
awslocal glue start-job-run --job-name job1
434+
```
435+
436+
Retrieve the job run ID from the output of the `start-job-run` command.
405437

406438
The execution of the Glue job can take a few moments - once the job has finished executing, you should see a log line with the query results in the LocalStack container logs, similar to the output below:
407439

@@ -411,20 +443,20 @@ SQL result: ['{"name":"test1","key":123}', '{"name":"test2","key":456}']
411443
```
412444

413445
In order to see the logs above, make sure to enable `DEBUG=1` in the LocalStack container environment.
414-
Alternatively, you can also retrieve the job logs programmatically via the CloudWatch Logs API - for example, using the job run ID `c9471f40` from above:
415-
{{< command >}}
416-
$ awslocal logs get-log-events --log-group-name /aws-glue/jobs/logs-v2 --log-stream-name c9471f40
417-
<disable-copy>
418-
{ "events": [ ... ] }
419-
</disable-copy>
420-
{{< / command >}}
446+
Alternatively, you can also retrieve the job logs programmatically via the CloudWatch Logs API - for example, using the job run ID from the above command.
447+
448+
```bash
449+
awslocal logs get-log-events \
450+
--log-group-name /aws-glue/jobs/logs-v2 \
451+
--log-stream-name <JobRunId>
452+
```
421453

422454
## Resource Browser
423455

424456
The LocalStack Web Application provides a Resource Browser for Glue.
425457
You can access the Resource Browser by opening the LocalStack Web Application in your browser, navigating to the **Resources** section, and then clicking on **Glue** under the **Analytics** section.
426458

427-
<img src="glue-resource-browser.png" alt="Glue Resource Browser" title="Glue Resource Browser" width="900" />
459+
![Glue Resource Browser](/images/aws/glue-resource-browser.png)
428460

429461
The Resource Browser allows you to perform the following actions:
430462

@@ -438,12 +470,6 @@ The Resource Browser allows you to perform the following actions:
438470

439471
## Examples
440472

441-
The following Developer Hub applications are using Glue:
442-
{{< applications service_filter="glu">}}
443-
444-
The following tutorials are using Glue:
445-
{{< tutorials "/tutorials/schema-evolution-glue-msk">}}
446-
447473
The following code snippets and sample applications provide practical examples of how to use Glue in LocalStack for various use cases:
448474

449475
- [localstack-pro-samples/glue-etl-jobs](https://github.com/localstack/localstack-pro-samples/tree/master/glue-etl-jobs)

0 commit comments

Comments
 (0)