Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
eecd173
feat:added support for upstream lineage for jdbc connectors
alokr-dhub Mar 10, 2026
de41f09
fix: linting checks
alokr-dhub Mar 10, 2026
93f050c
fix: linting fixes
alokr-dhub Mar 10, 2026
9f74b24
feat: added support for glue native connection config
alokr-dhub Mar 11, 2026
4fece1b
fix: handle credential leak.
alokr-dhub Mar 11, 2026
186d04f
fix: glue connection config
alokr-dhub Mar 12, 2026
46465a3
fix: handle v1 and v2 parsing and dbtable extaction from query
alokr-dhub Mar 16, 2026
fd79cb2
fix: linting error
alokr-dhub Mar 16, 2026
ac5d2bc
fix: credential sanitization in jdbc URL
alokr-dhub Mar 16, 2026
2576654
fix: handle joined sql queries
alokr-dhub Mar 16, 2026
89492c6
fix: updated documentation
alokr-dhub Mar 16, 2026
3a269f9
fix: update sqlglot dependency
alokr-dhub Mar 16, 2026
76bea0e
fix: update lock file
alokr-dhub Mar 16, 2026
f3d2c6a
fix: code refactor used JDBC prefix constant
alokr-dhub Mar 23, 2026
4bcb352
fix: use sqlglot_lineage instead of sqlglot parse
alokr-dhub Mar 23, 2026
41b54ae
fix: python lock file
alokr-dhub Mar 23, 2026
18f4f4e
fix: added case for semicolon-separated properties
alokr-dhub Mar 23, 2026
5d6b9de
fix: uv lockfile
alokr-dhub Mar 24, 2026
ed4462e
fix: added support for target platform instance mappling
alokr-dhub Mar 24, 2026
a168feb
fix: updated docs
alokr-dhub Mar 24, 2026
1f4b9c0
Merge branch 'master' into feature/support-glue-job-lineage-for-upstr…
alokr-dhub Mar 25, 2026
bc78c5a
fix: review comments
alokr-dhub Mar 30, 2026
c70793f
Merge branch 'master' into feature/support-glue-job-lineage-for-upstr…
alokr-dhub Mar 30, 2026
b215298
Merge branch 'master' into feature/support-glue-job-lineage-for-upstr…
alokr-dhub Mar 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions metadata-ingestion/docs/sources/glue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,79 @@ If you also have files in S3 that you'd like to ingest, we recommend you use Glu
| Glue Job Transform | [Data Job](../../metamodel/entities/dataJob.md) | |
| Glue Job Data source | [Dataset](../../metamodel/entities/dataset.md) | |
| Glue Job Data sink | [Dataset](../../metamodel/entities/dataset.md) | |

### Compatibility

To capture lineage across Glue jobs and databases, a requirements must be met – otherwise the AWS API is unable to report any lineage. The job must be created in Glue Studio with the "Generate classic script" option turned on (this option can be accessed in the "Script" tab). Any custom scripts that do not have the proper annotations will not have reported lineage.

### JDBC Lineage

DataHub extracts upstream lineage for Glue job nodes that read from JDBC databases. Two node styles are supported:

#### Named Glue Connections (Visual Editor)

Glue Studio's visual editor stores connection references as `connection_options.connectionName`. DataHub calls the `GetConnection` API to resolve the connection and determine the platform and database.

Supported connection types:

| Glue `ConnectionType` | DataHub Platform |
| --------------------- | -------------------------------- |
| `JDBC` | Parsed from JDBC URL (see below) |
| `POSTGRESQL` | `postgres` |
| `MYSQL` | `mysql` |
| `REDSHIFT` | `redshift` |
| `ORACLE` | `oracle` |
| `SQLSERVER` | `mssql` |

The table is read from `connection_options.dbtable`. If `dbtable` is absent, DataHub falls back to parsing `connection_options.query` (see [SQL Query Lineage](#sql-query-lineage) below).

#### Inline JDBC Nodes (Script Style)

Script-style nodes set `connection_type` to the database protocol and pass the JDBC URL inline via `connection_options.url`. Supported protocols:

| `connection_type` | DataHub Platform | Default schema |
| ----------------- | ---------------- | -------------- |
| `postgresql` | `postgres` | `public` |
| `mysql` | `mysql` | — |
| `mariadb` | `mysql` | — |
| `redshift` | `redshift` | `public` |
| `oracle` | `oracle` | — |
| `sqlserver` | `mssql` | `dbo` |

Example job script args that DataHub can parse:

```python
datasource = glueContext.create_dynamic_frame.from_options(
connection_type="postgresql",
connection_options={
"url": "jdbc:postgresql://myhost:5432/mydb",
"dbtable": "public.orders",
# or: "query": "SELECT * FROM public.orders WHERE region = 'US'"
},
)
```

#### Dataset Name Construction

Given a `dbtable` value and the resolved `(platform, database)`:

- `dbtable = "schema.table"` → `database.schema.table`
- `dbtable = "table"` (no schema) → `database.<default_schema>.table` if the platform has a default schema, otherwise `database.table`

#### SQL Query Lineage

When `dbtable` is absent and `connection_options.query` is set, DataHub uses [sqlglot](https://github.com/tobymao/sqlglot) to extract table references from the SQL string.

**Supported:** Single-table queries, JOINs, CTEs, subqueries — all referenced tables are emitted as upstream datasets.

```sql
-- All three tables become upstream lineage inputs
SELECT o.id, c.name, p.price
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
```

**Not supported:** Queries that fail to parse, or queries with no table references (e.g. `SELECT 1`). These produce a warning and the node is skipped.

> **Note:** `query`-based lineage reflects the tables referenced in the SQL at ingestion time. Dynamic SQL, parameterized queries, or queries built at runtime cannot be statically analyzed.
31 changes: 31 additions & 0 deletions metadata-ingestion/docs/sources/glue/glue_post.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,41 @@ Use the **Important Capabilities** table above as the source of truth for suppor

To capture lineage across Glue jobs and databases, a requirements must be met – otherwise the AWS API is unable to report any lineage. The job must be created in Glue Studio with the "Generate classic script" option turned on (this option can be accessed in the "Script" tab). Any custom scripts that do not have the proper annotations will not have reported lineage.

#### JDBC Upstream Lineage

When a Glue job reads from a JDBC source (e.g. PostgreSQL, MySQL, Redshift, Oracle, SQL Server), the plugin automatically extracts upstream lineage to the referenced tables. This works for both:

- **Direct JDBC connections** specified inline in the job script (via `connection_type` and `connection_options`)
- **Named Glue connections** configured in the Glue console and referenced by `connectionName`

Supported JDBC platforms: PostgreSQL, MySQL, MariaDB, Redshift, Oracle, SQL Server.

The plugin resolves table references from either the `dbtable` parameter or by parsing SQL from the `query` parameter.

#### Aligning URNs with Target Platform Connectors

If you also ingest the JDBC source separately (e.g. using the `postgres` or `mysql` connector) and that connector uses a `platform_instance` or a different `env`, you should configure `target_platform_configs` so the URNs match:

```yaml
source:
type: glue
config:
target_platform_configs:
postgres:
platform_instance: prod-postgres
env: PROD
mysql:
platform_instance: prod-mysql
```

When this is configured, dataset URNs produced by the Glue connector will include the same `platform_instance` and `env` as the target platform's connector, ensuring entities merge correctly in DataHub. If the target platform connector does not use a `platform_instance`, no configuration is needed — URNs will match by default.

### Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

JDBC upstream lineage from SQL queries (`query` parameter) does not currently apply `target_platform_configs`. Only the `dbtable` code path uses the configured `platform_instance` and `env`.

### Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.
4 changes: 4 additions & 0 deletions metadata-ingestion/docs/sources/glue/glue_pre.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This plugin extracts the following:
- Column types associated with each table
- Table metadata, such as owner, description and parameters
- Jobs and their component transformations, data sources, and data sinks
- Upstream lineage from JDBC sources (e.g. PostgreSQL, MySQL, Redshift) referenced by Glue jobs

### Prerequisites

Expand Down Expand Up @@ -40,12 +41,15 @@ For ingesting jobs (extract_transforms: True), the following additional permissi
"Action": [
"glue:GetDataflowGraph",
"glue:GetJobs",
"glue:GetConnection",
"s3:GetObject",
],
"Resource": "*"
}
```

The `glue:GetConnection` permission is required when Glue jobs reference named connections (e.g. JDBC connections configured in the Glue console). If your jobs only use inline connection parameters, this permission is not needed.

For profiling datasets, the following additional permissions are required:

```
Expand Down
2 changes: 2 additions & 0 deletions metadata-ingestion/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,8 @@ glue = [
"boto3>=1.35.0,<2.0.0",
"botocore!=1.23.0,<2.0.0",
"cachetools<6.0.0",
"patchy==2.8.0",
"sqlglot[c]==30.0.3",
"urllib3>=1.26,<3.0",
]

Expand Down
2 changes: 1 addition & 1 deletion metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -633,7 +633,7 @@
},
"flink": {"requests<3.0.0", "tenacity>=8.0.1,<9.0.0"},
"grafana": {"requests<3.0.0", *sqlglot_lib},
"glue": aws_common | cachetools_lib,
"glue": aws_common | cachetools_lib | sqlglot_lib,
# hdbcli is supported officially by SAP, sqlalchemy-hana is built on top but not officially supported
"hana": sql_common
| {
Expand Down
Loading
Loading