Skip to content

Conversation

@kyungsoo-datahub
Copy link
Contributor

Snowflake's access history can return empty column names for certain query types (e.g., DELETE, queries on views over external sources like Google Sheets). This was causing invalid schemaField URNs to be sent to GMS.

This fix adds two layers of protection:

  1. At ingestion source level: Detect empty columns in direct_objects_accessed and fall back to ObservedQuery for DataHub's own SQL parsing
  2. At query subjects generation: Skip empty column names when creating schemaField URNs to prevent invalid URN generation

Snowflake's access history can return empty column names for certain query types (e.g., DELETE, queries on views over external sources like
Google Sheets). This was causing invalid schemaField URNs to be sent to GMS.

This fix adds two layers of protection:
1. At ingestion source level: Detect empty columns in direct_objects_accessed and fall back to ObservedQuery for DataHub's own SQL parsing
2. At query subjects generation: Skip empty column names when creating schemaField URNs to prevent invalid URN generation
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 24, 2025
@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Oct 24, 2025

✅ Meticulous spotted 0 visual differences across 1016 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit ecd54eb. This comment will update as new commits are pushed.

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Oct 24, 2025
@codecov
Copy link

codecov bot commented Oct 24, 2025

Bundle Report

Bundle size has no change ✅

upstreams = []
column_usage = {}

has_empty_column = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to introduce a new flag? Why don't we return ObservedQuery directly from the loop, instead of having break-related logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout. Right, we don't need to wait for all iteration. Revised.

self.identifiers.snowflake_identifier(modified_column["columnName"])
)
column_name = modified_column["columnName"]
if not column_name or not column_name.strip():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing also the case of non-empty, but containing only white-spaces column names!

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Oct 28, 2025
columns.add(
self.identifiers.snowflake_identifier(modified_column["columnName"])
)
column_name = modified_column["columnName"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make it very much visible that we decided to parse the query, for which we would otherwise use info coming directly from audit log, this is for 2 reasons:

  1. We want to understand why Snowflake would have such, from our perspective, malformed audit log. It would be the best to be able to pinpoint also the query involved.
  2. Parsing queries take much longer than just copying information from the audit log. This change has potential adverse effects for overall ingestion performance. We need to be aware how many queries had to be parsed by us.

So to meet above conditions we need to:

  1. Extend report object for Snowflake source, so that we can keep count of queries. Maybe saving query_id for each query which was forced to be parsed would be a good idea - use LossyList to not store too many. Such query_id could be used to retrieve actual query from the warehouse.
  2. We need to print information that this happened. I think at least info level should be used, maybe even warning. It is an open question whether we should go as far as using self.report.warning - in such case this message would appear in the Managed Ingestion UI, maybe that would be an overkill. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. It makes sense to me. I've extended the SnowflakeQueriesExtractorReport to track queries with empty column names. Specifically, I've added a counter (num_queries_with_empty_column_name) and a LossyList (queries_with_empty_column_name) to store the IDs of affected queries. When an empty column name is detected, the source now logs an informational message including the query ID and a note about the performance impact, and updates the new report fields before falling back to SQL parsing.

query_subject_urns.add(upstream)
if include_fields:
for column in sorted(self.column_usage.get(upstream, [])):
# Skip empty column names to avoid creating invalid URNs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to print a message here, either warning or info. Same as below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning logs added. Thank you for the suggestion.

)
column_name = modified_column["columnName"]
if not column_name or not column_name.strip():
has_empty_column = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add some comment explaining why are we deciding to parse the query ourselves in cases where there are empty column names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thank you for the suggestion.

assert extractor.report.sql_aggregator.num_preparsed_queries == 0


class TestSnowflakeQueryParser:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is awesome! Maybe the comments should be more clear that we are testing the case where Snowflake sends as somehow corrupted results.
Also - why imports are done in the functions? Can't we move them top?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments. Revised.

) -> None:
"""Test that QuerySubjects with empty column names doesn't create invalid URNs.
This simulates the Snowflake scenario where DELETE queries return empty column
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an overstatement - we haven't been able to identify queries or table types for which it happens, let's remove indication about DELETE queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.

"snowflake", "production.dca_core.snowplow_user_engagement_mart__dbt_tmp"
).urn()

# Simulate a DELETE query with subquery where Snowflake returns empty columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an overstatement - we haven't been able to identify queries or table types for which it happens, let's remove indication about DELETE queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out. I revised it to "# Simulate a query where Snowflake's access_history contains empty column names."

upstreams=[upstream_urn],
downstream=downstream_urn,
column_lineage=[
# Snowflake returns empty column names for DELETE operations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an overstatement - we haven't been able to identify queries or table types for which it happens, let's remove indication about DELETE queries.
(I mean only the comment, test code is great, please keep it!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Revised to "# This simulates a case where an empty column name might be present in the audit log. "

This is the scenario that would send invalid URNs to GMS rather than crash in Python,
matching the customer's error: "Provided urn urn:li:schemaField:(...,) is invalid"
Example: SELECT queries on views over Google Sheets or other special data sources
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an overstatement - we haven't been able to identify queries or table types for which it happens, let's remove indication about Google Sheets or other special sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. Removed.

).urn()

# Simulate a SELECT query (no downstream) where Snowflake has empty column tracking
# This is common with views over external data sources like Google Sheets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an overstatement - we haven't been able to identify queries or table types for which it happens, let's remove indication about Google Sheets or other special sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

downstream=None, # SELECT query has no downstream
column_lineage=[], # No column lineage because no downstream
column_usage={
# Snowflake returns empty column names for problematic views
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an overstatement - we haven't been able to identify queries or table types for which it happens, let's remove indication about Google Sheets or other special sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Revised to "# Simulate a case where an empty column name is present in the audit log. " Thanks.

@skrydal skrydal self-requested a review October 28, 2025 22:10
Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I greatly appreciate your meticulous approach to unit tests, which look exactly as proper unit tests should look like! I have left just a couple of comments.

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Oct 30, 2025
This commit improves the robustness and clarity of handling empty column names from audit logs during ingestion.

- Refactors audit log parsing to be cleaner and more direct.
- Adds logging and reporting to track fallbacks to manual SQL parsing.
- Improves comments and test code style for better maintainability.
Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants