You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Glue API in LocalStack Pro allows you to run ETL (Extract-Transform-Load) jobs locally, maintaining table metadata in the local Glue data catalog, and using the Spark ecosystem (PySpark/Scala) to run data processing workflows.
11
10
12
11
LocalStack allows you to use the Glue APIs in your local environment.
13
-
The supported APIs are available on our [API coverage page](/references/coverage/coverage_glue/), which provides information on the extent of Glue's integration with LocalStack.
12
+
The supported APIs are available on our [API coverage page](), which provides information on the extent of Glue's integration with LocalStack.
14
13
15
-
{{< callout >}}
14
+
:::note
16
15
LocalStack now includes a container-based Glue Job executor, enabling Glue jobs to run within a Docker environment.
17
16
Previously, LocalStack relied on a pre-packaged binary that included Spark and other required components.
18
17
The new executor leverages the `aws-glue-libs` Docker image, provides better production parity, faster startup times, and more reliable execution.
@@ -27,7 +26,7 @@ Key enhancements include:
27
26
28
27
To use it, set `GLUE_JOB_EXECUTOR=docker` and `GLUE_JOB_EXECUTOR_PROVIDER=v2` in your LocalStack configuration.
29
28
The new executor additionally deprecates older versions of Glue (`0.9`, `1.0`, `2.0`).
30
-
{{< /callout >}}
29
+
:::
31
30
32
31
## Getting started
33
32
@@ -36,20 +35,20 @@ This guide is designed for users new to Glue and assumes basic knowledge of the
36
35
Start your LocalStack container using your preferred method.
37
36
We will demonstrate how to create databases and table metadata in Glue, run Glue ETL jobs, import databases from Athena, and run Glue Crawlers with the AWS CLI.
38
37
39
-
{{< callout >}}
40
-
In order to run Glue jobs, some additional dependencies have to be fetched from the network, including a Docker image of apprx.
41
-
1.5GB which includes Spark, Presto, Hive and other tools.
38
+
:::note
39
+
In order to run Glue jobs, some additional dependencies have to be fetched from the network, including a Docker image of approximately 1.5GB which includes Spark, Presto, Hive and other tools.
42
40
These dependencies are automatically fetched when you start up the service, so please make sure you're on a decent internet connection when pulling the dependencies for the first time.
43
-
{{< /callout >}}
41
+
:::
44
42
45
43
### Creating Databases and Table Metadata
46
44
47
45
The commands below illustrate the creation of some very basic entries (databases, tables) in the Glue data catalog:
Once the crawler has started, you have to wait until the `State` turns to `READY` when querying the current state:
287
-
{{< command >}}
288
-
$ awslocal glue get-crawler --name c1
289
-
{{< /command >}}
312
+
313
+
```bash
314
+
awslocal glue get-crawler --name c1
315
+
```
290
316
291
317
Once the crawler has finished running and is back in `READY` state, the Glue table within the `gluedb1` DB should have been populated and can be queried via the API.
292
318
@@ -296,21 +322,27 @@ The Glue Schema Registry allows you to centrally discover, control, and evolve d
296
322
With the Schema Registry, you can manage and enforce schemas and schema compatibilities in your streaming applications.
297
323
It integrates nicely with [Managed Streaming for Kafka (MSK)](../managed-streaming-for-kafka).
298
324
299
-
{{< callout >}}
325
+
:::note
300
326
Currently, LocalStack supports the AVRO dataformat for the Glue Schema Registry.
301
327
Support for other dataformats will be added in the future.
302
-
{{< /callout >}}
328
+
:::
303
329
304
330
You can create a schema registry with the following command:
@@ -352,9 +386,9 @@ You can find a more advanced sample in our [localstack-pro-samples repository on
352
386
353
387
LocalStack Glue supports [Delta Lake](https://delta.io), an open-source storage framework that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.
354
388
355
-
{{< callout >}}
389
+
:::note
356
390
Please note that Delta Lake tables are only [supported for Glue versions `3.0` and `4.0`](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-delta-lake.html).
357
-
{{< /callout >}}
391
+
:::
358
392
359
393
To illustrate this feature, we take a closer look at a Glue sample job that creates a Delta Lake table, puts some data into it, and then queries data from the table.
Retrieve the job run ID from the output of the `start-job-run` command.
405
437
406
438
The execution of the Glue job can take a few moments - once the job has finished executing, you should see a log line with the query results in the LocalStack container logs, similar to the output below:
In order to see the logs above, make sure to enable `DEBUG=1` in the LocalStack container environment.
414
-
Alternatively, you can also retrieve the job logs programmatically via the CloudWatch Logs API - for example, using the job run ID `c9471f40`from above:
Alternatively, you can also retrieve the job logs programmatically via the CloudWatch Logs API - for example, using the job run ID from the above command.
447
+
448
+
```bash
449
+
awslocal logs get-log-events \
450
+
--log-group-name /aws-glue/jobs/logs-v2 \
451
+
--log-stream-name <JobRunId>
452
+
```
421
453
422
454
## Resource Browser
423
455
424
456
The LocalStack Web Application provides a Resource Browser for Glue.
425
457
You can access the Resource Browser by opening the LocalStack Web Application in your browser, navigating to the **Resources** section, and then clicking on **Glue** under the **Analytics** section.
0 commit comments