fix(ingest): Handle empty column names from Snowflake access history #15106

kyungsoo-datahub · 2025-10-24T21:26:49Z

Snowflake's access history can return empty column names for certain query types (e.g., DELETE, queries on views over external sources like Google Sheets). This was causing invalid schemaField URNs to be sent to GMS.

This fix adds two layers of protection:

At ingestion source level: Detect empty columns in direct_objects_accessed and fall back to ObservedQuery for DataHub's own SQL parsing
At query subjects generation: Skip empty column names when creating schemaField URNs to prevent invalid URN generation

Snowflake's access history can return empty column names for certain query types (e.g., DELETE, queries on views over external sources like Google Sheets). This was causing invalid schemaField URNs to be sent to GMS. This fix adds two layers of protection: 1. At ingestion source level: Detect empty columns in direct_objects_accessed and fall back to ObservedQuery for DataHub's own SQL parsing 2. At query subjects generation: Skip empty column names when creating schemaField URNs to prevent invalid URN generation

codecov · 2025-10-24T21:29:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

alwaysmeticulous · 2025-10-24T21:31:44Z

✅ Meticulous spotted 0 visual differences across 1016 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Expected differences? Click here. Last updated for commit ecd54eb. This comment will update as new commits are pushed.}

codecov · 2025-10-24T21:42:11Z

Bundle Report

Bundle size has no change ✅

skrydal · 2025-10-28T21:45:53Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

        upstreams = []
        column_usage = {}

+        has_empty_column = False


do we need to introduce a new flag? Why don't we return ObservedQuery directly from the loop, instead of having break-related logic?

Good callout. Right, we don't need to wait for all iteration. Revised.

skrydal · 2025-10-28T21:46:40Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

-                    self.identifiers.snowflake_identifier(modified_column["columnName"])
-                )
+                column_name = modified_column["columnName"]
+                if not column_name or not column_name.strip():


Thank you for addressing also the case of non-empty, but containing only white-spaces column names!

skrydal · 2025-10-28T21:58:14Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

-                columns.add(
-                    self.identifiers.snowflake_identifier(modified_column["columnName"])
-                )
+                column_name = modified_column["columnName"]


We need to make it very much visible that we decided to parse the query, for which we would otherwise use info coming directly from audit log, this is for 2 reasons:

We want to understand why Snowflake would have such, from our perspective, malformed audit log. It would be the best to be able to pinpoint also the query involved.

Parsing queries take much longer than just copying information from the audit log. This change has potential adverse effects for overall ingestion performance. We need to be aware how many queries had to be parsed by us.

So to meet above conditions we need to:

Extend report object for Snowflake source, so that we can keep count of queries. Maybe saving query_id for each query which was forced to be parsed would be a good idea - use LossyList to not store too many. Such query_id could be used to retrieve actual query from the warehouse.

We need to print information that this happened. I think at least info level should be used, maybe even warning. It is an open question whether we should go as far as using self.report.warning - in such case this message would appear in the Managed Ingestion UI, maybe that would be an overkill. WDYT?

Thank you for your suggestion. It makes sense to me. I've extended the SnowflakeQueriesExtractorReport to track queries with empty column names. Specifically, I've added a counter (num_queries_with_empty_column_name) and a LossyList (queries_with_empty_column_name) to store the IDs of affected queries. When an empty column name is detected, the source now logs an informational message including the query ID and a note about the performance impact, and updates the new report fields before falling back to SQL parsing.

skrydal · 2025-10-28T21:59:28Z

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

            query_subject_urns.add(upstream)
            if include_fields:
                for column in sorted(self.column_usage.get(upstream, [])):
+                    # Skip empty column names to avoid creating invalid URNs


I think we need to print a message here, either warning or info. Same as below.

Warning logs added. Thank you for the suggestion.

skrydal · 2025-10-28T22:00:08Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

-                )
+                column_name = modified_column["columnName"]
+                if not column_name or not column_name.strip():
+                    has_empty_column = True


I would also add some comment explaining why are we deciding to parse the query ourselves in cases where there are empty column names.

Added. Thank you for the suggestion.

skrydal · 2025-10-28T22:04:03Z

metadata-ingestion/tests/unit/snowflake/test_snowflake_queries.py

            assert extractor.report.sql_aggregator.num_preparsed_queries == 0


+class TestSnowflakeQueryParser:


This test is awesome! Maybe the comments should be more clear that we are testing the case where Snowflake sends as somehow corrupted results.
Also - why imports are done in the functions? Can't we move them top?

Thank you for the comments. Revised.

skrydal · 2025-10-28T22:07:54Z