- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.3k
fix(ingest): Handle empty column names from Snowflake access history #15106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
ecd54eb
              2b24093
              eff19c8
              00e5bfd
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -78,6 +78,7 @@ | |
| ConnectionWrapper, | ||
| FileBackedList, | ||
| ) | ||
| from datahub.utilities.lossy_collections import LossyList | ||
| from datahub.utilities.perf_timer import PerfTimer | ||
|  | ||
| logger = logging.getLogger(__name__) | ||
|  | @@ -169,6 +170,10 @@ class SnowflakeQueriesExtractorReport(Report): | |
| num_stream_queries_observed: int = 0 | ||
| num_create_temp_view_queries_observed: int = 0 | ||
| num_users: int = 0 | ||
| num_queries_with_empty_column_name: int = 0 | ||
| queries_with_empty_column_name: LossyList[str] = dataclasses.field( | ||
| default_factory=LossyList | ||
| ) | ||
|  | ||
|  | ||
| @dataclass | ||
|  | @@ -626,9 +631,28 @@ def _parse_audit_log_row( | |
|  | ||
| columns = set() | ||
| for modified_column in obj["columns"]: | ||
| columns.add( | ||
| self.identifiers.snowflake_identifier(modified_column["columnName"]) | ||
| ) | ||
| column_name = modified_column["columnName"] | ||
| # An empty column name in the audit log would cause an error when creating column URNs. | ||
| # To avoid this and still extract lineage, the raw query text is parsed as a fallback. | ||
| if not column_name or not column_name.strip(): | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for addressing also the case of non-empty, but containing only white-spaces column names! | ||
| query_id = res["query_id"] | ||
| self.report.num_queries_with_empty_column_name += 1 | ||
| self.report.queries_with_empty_column_name.append(query_id) | ||
| logger.info(f"Query {query_id} has empty column name in audit log.") | ||
|  | ||
| return ObservedQuery( | ||
| query=query_text, | ||
| session_id=res["session_id"], | ||
| timestamp=timestamp, | ||
| user=user, | ||
| default_db=res["default_db"], | ||
| default_schema=res["default_schema"], | ||
| query_hash=get_query_fingerprint( | ||
| query_text, self.identifiers.platform, fast=True | ||
| ), | ||
| extra_info=extra_info, | ||
| ) | ||
| columns.add(self.identifiers.snowflake_identifier(column_name)) | ||
|  | ||
| upstreams.append(dataset) | ||
| column_usage[dataset] = columns | ||
|  | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -168,13 +168,28 @@ def get_subjects( | |
| query_subject_urns.add(upstream) | ||
| if include_fields: | ||
| for column in sorted(self.column_usage.get(upstream, [])): | ||
| # Skip empty column names to avoid creating invalid URNs | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to print a message here, either  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Warning logs added. Thank you for the suggestion. | ||
| if not column or not column.strip(): | ||
| logger.warning( | ||
| f"Skipping empty upstream column name for query {self.query_id} on upstream {upstream}" | ||
| ) | ||
| continue | ||
| query_subject_urns.add( | ||
| builder.make_schema_field_urn(upstream, column) | ||
| ) | ||
| if downstream_urn: | ||
| query_subject_urns.add(downstream_urn) | ||
| if include_fields: | ||
| for column_lineage in self.column_lineage: | ||
| # Skip empty downstream columns to avoid creating invalid URNs | ||
| if ( | ||
| not column_lineage.downstream.column | ||
| or not column_lineage.downstream.column.strip() | ||
| ): | ||
| logger.warning( | ||
| f"Skipping empty downstream column name for query {self.query_id} on downstream {downstream_urn}" | ||
| ) | ||
| continue | ||
| query_subject_urns.add( | ||
| builder.make_schema_field_urn( | ||
| downstream_urn, column_lineage.downstream.column | ||
|  | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make it very much visible that we decided to parse the query, for which we would otherwise use info coming directly from audit log, this is for 2 reasons:
So to meet above conditions we need to:
reportobject for Snowflake source, so that we can keep count of queries. Maybe savingquery_idfor each query which was forced to be parsed would be a good idea - useLossyListto not store too many. Suchquery_idcould be used to retrieve actual query from the warehouse.infolevel should be used, maybe evenwarning. It is an open question whether we should go as far as usingself.report.warning- in such case this message would appear in the Managed Ingestion UI, maybe that would be an overkill. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion. It makes sense to me. I've extended the SnowflakeQueriesExtractorReport to track queries with empty column names. Specifically, I've added a counter (num_queries_with_empty_column_name) and a LossyList (queries_with_empty_column_name) to store the IDs of affected queries. When an empty column name is detected, the source now logs an informational message including the query ID and a note about the performance impact, and updates the new report fields before falling back to SQL parsing.