datahub-project · alokr-dhub · Apr 1, 2026 · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026
diff --git a/metadata-ingestion/docs/sources/glue/README.md b/metadata-ingestion/docs/sources/glue/README.md
@@ -19,3 +19,79 @@ If you also have files in S3 that you'd like to ingest, we recommend you use Glu
 | Glue Job Transform   | [Data Job](../../metamodel/entities/dataJob.md)           |                    |
 | Glue Job Data source | [Dataset](../../metamodel/entities/dataset.md)            |                    |
 | Glue Job Data sink   | [Dataset](../../metamodel/entities/dataset.md)            |                    |
+
+### Compatibility
+
+To capture lineage across Glue jobs and databases, a requirements must be met – otherwise the AWS API is unable to report any lineage. The job must be created in Glue Studio with the "Generate classic script" option turned on (this option can be accessed in the "Script" tab). Any custom scripts that do not have the proper annotations will not have reported lineage.
+
+### JDBC Lineage
+
+DataHub extracts upstream lineage for Glue job nodes that read from JDBC databases. Two node styles are supported:
+
+#### Named Glue Connections (Visual Editor)
+
+Glue Studio's visual editor stores connection references as `connection_options.connectionName`. DataHub calls the `GetConnection` API to resolve the connection and determine the platform and database.
+
+Supported connection types:
+
+| Glue `ConnectionType` | DataHub Platform                 |
+| --------------------- | -------------------------------- |
+| `JDBC`                | Parsed from JDBC URL (see below) |
+| `POSTGRESQL`          | `postgres`                       |
+| `MYSQL`               | `mysql`                          |
+| `REDSHIFT`            | `redshift`                       |
+| `ORACLE`              | `oracle`                         |
+| `SQLSERVER`           | `mssql`                          |
+
+The table is read from `connection_options.dbtable`. If `dbtable` is absent, DataHub falls back to parsing `connection_options.query` (see [SQL Query Lineage](#sql-query-lineage) below).
+
+#### Inline JDBC Nodes (Script Style)
+
+Script-style nodes set `connection_type` to the database protocol and pass the JDBC URL inline via `connection_options.url`. Supported protocols:
+
+| `connection_type` | DataHub Platform | Default schema |
+| ----------------- | ---------------- | -------------- |
+| `postgresql`      | `postgres`       | `public`       |
+| `mysql`           | `mysql`          | —              |
+| `mariadb`         | `mysql`          | —              |
+| `redshift`        | `redshift`       | `public`       |
+| `oracle`          | `oracle`         | —              |
+| `sqlserver`       | `mssql`          | `dbo`          |
+
+Example job script args that DataHub can parse:
+
+```python
+datasource = glueContext.create_dynamic_frame.from_options(
+    connection_type="postgresql",
+    connection_options={
+        "url": "jdbc:postgresql://myhost:5432/mydb",
+        "dbtable": "public.orders",
+        # or: "query": "SELECT * FROM public.orders WHERE region = 'US'"
+    },
+)
+```
+
+#### Dataset Name Construction
+
+Given a `dbtable` value and the resolved `(platform, database)`:
+
+- `dbtable = "schema.table"` → `database.schema.table`
+- `dbtable = "table"` (no schema) → `database.<default_schema>.table` if the platform has a default schema, otherwise `database.table`
+
+#### SQL Query Lineage
+
+When `dbtable` is absent and `connection_options.query` is set, DataHub uses [sqlglot](https://github.com/tobymao/sqlglot) to extract table references from the SQL string.
+
+**Supported:** Single-table queries, JOINs, CTEs, subqueries — all referenced tables are emitted as upstream datasets.
+
+```sql
+-- All three tables become upstream lineage inputs
+SELECT o.id, c.name, p.price
+FROM orders o
+JOIN customers c ON o.customer_id = c.id
+JOIN products p ON o.product_id = p.id
+```
+
+**Not supported:** Queries that fail to parse, or queries with no table references (e.g. `SELECT 1`). These produce a warning and the node is skipped.
+
+> **Note:** `query`-based lineage reflects the tables referenced in the SQL at ingestion time. Dynamic SQL, parameterized queries, or queries built at runtime cannot be statically analyzed.
diff --git a/metadata-ingestion/docs/sources/glue/glue_post.md b/metadata-ingestion/docs/sources/glue/glue_post.md
@@ -6,10 +6,41 @@ Use the **Important Capabilities** table above as the source of truth for suppor
 
 To capture lineage across Glue jobs and databases, a requirements must be met – otherwise the AWS API is unable to report any lineage. The job must be created in Glue Studio with the "Generate classic script" option turned on (this option can be accessed in the "Script" tab). Any custom scripts that do not have the proper annotations will not have reported lineage.
 
+#### JDBC Upstream Lineage
+
+When a Glue job reads from a JDBC source (e.g. PostgreSQL, MySQL, Redshift, Oracle, SQL Server), the plugin automatically extracts upstream lineage to the referenced tables. This works for both:
+
+- **Direct JDBC connections** specified inline in the job script (via `connection_type` and `connection_options`)
+- **Named Glue connections** configured in the Glue console and referenced by `connectionName`
+
+Supported JDBC platforms: PostgreSQL, MySQL, MariaDB, Redshift, Oracle, SQL Server.
+
+The plugin resolves table references from either the `dbtable` parameter or by parsing SQL from the `query` parameter.
+
+#### Aligning URNs with Target Platform Connectors
+
+If you also ingest the JDBC source separately (e.g. using the `postgres` or `mysql` connector) and that connector uses a `platform_instance` or a different `env`, you should configure `target_platform_configs` so the URNs match:
+
+```yaml
+source:
+  type: glue
+  config:
+    target_platform_configs:
+      postgres:
+        platform_instance: prod-postgres
+        env: PROD
+      mysql:
+        platform_instance: prod-mysql
+```
+
+When this is configured, dataset URNs produced by the Glue connector will include the same `platform_instance` and `env` as the target platform's connector, ensuring entities merge correctly in DataHub. If the target platform connector does not use a `platform_instance`, no configuration is needed — URNs will match by default.
+
 ### Limitations
 
 Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
 
+JDBC upstream lineage from SQL queries (`query` parameter) does not currently apply `target_platform_configs`. Only the `dbtable` code path uses the configured `platform_instance` and `env`.
+
 ### Troubleshooting
 
 If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.
diff --git a/metadata-ingestion/docs/sources/glue/glue_pre.md b/metadata-ingestion/docs/sources/glue/glue_pre.md
@@ -8,6 +8,7 @@ This plugin extracts the following:
 - Column types associated with each table
 - Table metadata, such as owner, description and parameters
 - Jobs and their component transformations, data sources, and data sinks
+- Upstream lineage from JDBC sources (e.g. PostgreSQL, MySQL, Redshift) referenced by Glue jobs
 
 ### Prerequisites
 
@@ -40,12 +41,15 @@ For ingesting jobs (extract_transforms: True), the following additional permissi
     "Action": [
         "glue:GetDataflowGraph",
         "glue:GetJobs",
+        "glue:GetConnection",
         "s3:GetObject",
     ],
     "Resource": "*"
 }
 ```
 
+The `glue:GetConnection` permission is required when Glue jobs reference named connections (e.g. JDBC connections configured in the Glue console). If your jobs only use inline connection parameters, this permission is not needed.
+
 For profiling datasets, the following additional permissions are required:
 
 ```

diff --git a/metadata-ingestion/pyproject.toml b/metadata-ingestion/pyproject.toml
@@ -631,6 +631,8 @@ glue = [
     "boto3>=1.35.0,<2.0.0",
     "botocore!=1.23.0,<2.0.0",
     "cachetools<6.0.0",
+    "patchy==2.8.0",
+    "sqlglot[c]==30.0.3",
     "urllib3>=1.26,<3.0",
 ]
 

diff --git a/metadata-ingestion/setup.py b/metadata-ingestion/setup.py
@@ -633,7 +633,7 @@
     },
     "flink": {"requests<3.0.0", "tenacity>=8.0.1,<9.0.0"},
     "grafana": {"requests<3.0.0", *sqlglot_lib},
-    "glue": aws_common | cachetools_lib,
+    "glue": aws_common | cachetools_lib | sqlglot_lib,
     # hdbcli is supported officially by SAP, sqlalchemy-hana is built on top but not officially supported
     "hana": sql_common
     | {