-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Feat: Add BigQuery lineage support for PowerBI .pbit files #25511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat: Add BigQuery lineage support for PowerBI .pbit files #25511
Conversation
Fixes open-metadata#25483 Problem: - PowerBI .pbit files store multiline DAX expressions as JSON arrays (one string per line) - Pydantic validation failed with 'Input should be a valid string [type=string_type, input_value=[...], input_type=list]' - Parser could not ingest .pbit files with multiline DAX measures, source expressions, or dataset expressions Solution: - Added Pydantic field_validator decorators to normalize list expressions to multiline strings - Updated PowerBiMeasures.expression to accept Union[str, List[str]] - Updated PowerBITableSource.expression to accept Union[str, List[str]] - Updated DatasetExpression.expression to accept Union[str, List[str]] - Made expression optional for PowerBiMeasures to handle measures without expressions - Updated _get_child_measures() to handle None expression values Testing: - Successfully parsed .pbit file with 41 tables, 93 measures (72 multiline), 32 multiline sources - Added 9 new unit tests for validators - Added integration test case for multiline DAX expressions - All existing tests pass (backward compatible) Files changed: - ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py - ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py - ingestion/tests/unit/test_powerbi_table_measures.py
Adds support for parsing BigQuery connections in PowerBI .pbit files to
enable lineage tracking from BigQuery tables/views to PowerBI tables.
Features:
- Parses GoogleBigQuery.Database() Power Query M expressions
- Resolves dataset expression references (e.g., Source = S_PJ_CODE)
- Extracts BigQuery project, dataset, and table information
- Handles both direct connections and indirect references through expressions
- Follows existing pattern for Snowflake, Redshift, and Databricks
Implementation:
- Added _parse_bigquery_source() method to parse BigQuery M expressions
- Integrated into parse_table_name_from_source() lineage flow
- Recursively resolves expression references to find BigQuery connections
- Uses regex patterns to extract: [Name="project"], [Name="dataset",Kind="Schema"], [Name="table",Kind="Table"]
Testing:
- Verified with Monthly Financial_trusted.pbit file containing BigQuery connections
- Successfully parsed lineage: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE → Map PJ Code Master
- Expression resolution tested and working
Example lineage chain:
BigQuery: project.dataset.table
↓ (via dataset expression)
Expression: S_PJ_CODE
↓
PowerBI Table: Map PJ Code Master
Files changed:
- ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py
.pbit files store table source information in partitions[0].source instead of directly in the table.source field. This commit adds partition support and automatically extracts source from partitions when the source field is empty. Changes: - Added PowerBIPartition model to represent table partitions in .pbit files - Added partitions field to PowerBiTable model - Added model_validator to extract source from partitions[0] when table.source is None - This enables lineage parsing for .pbit files where source is in partitions This fixes the issue where BigQuery lineage was not being detected even though the parsing logic was correct - the source field was simply not being populated. Testing: - Verified Map PJ Code Master table now has source populated from partitions - Confirmed BigQuery lineage detection works: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review 👍 Approved with suggestions 0 resolved / 3 findingsWell-structured BigQuery lineage parser following existing patterns. Three minor suggestions for defensive coding around edge cases in regex parsing and partition source extraction. 💡 Edge Case: Partition source extraction assumes dict structure📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py:196-200 The If Suggested fix: @model_validator(mode='before')
@classmethod
def extract_source_from_partitions(cls, values):
if isinstance(values, dict):
if values.get("source") is None and values.get("partitions"):
partitions = values.get("partitions", [])
if partitions and len(partitions) > 0:
first_partition = partitions[0]
if isinstance(first_partition, dict):
partition_source = first_partition.get("source")
elif hasattr(first_partition, "source"):
partition_source = first_partition.source
else:
partition_source = None
if partition_source:
values["source"] = [partition_source]
return valuesThis is likely low-impact since 💡 Quality: Regex pattern may capture trailing spaces or quotes📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py:980-985 The regex pattern For example, if the matched string is
However, patterns like Suggested improvement: ref_name = source_ref_match.group(1).strip()
# Remove surrounding quotes and hash symbols commonly found in M expressions
ref_name = re.sub(r'^[#"]+|[#"]+$', '', ref_name).strip()This is minor since the current implementation likely works for common cases, but it could cause issues with certain M expression naming conventions. 💡 Edge Case: BigQuery parser extracts first Name without Kind as project📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py:1022-1033 The logic assumes the first For example, consider an edge case where the expression contains metadata or comments with The current implementation would incorrectly identify "metadata" as the project. Suggested improvement: # Find the BigQuery portion of the expression
bq_start = source_expression.find("GoogleBigQuery.Database")
if bq_start >= 0:
bq_expression = source_expression[bq_start:]
name_matches = re.findall(r'\[Name="([^"]+)"(?:,Kind="([^"]+)")?\]', bq_expression)This is minor risk as real-world .pbit files likely don't have this pattern, but it would make the parser more robust. OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
Summary
Adds BigQuery lineage extraction support for PowerBI .pbit template files.
Relates to #25509
Motivation
PowerBI .pbit files containing BigQuery data sources currently have no lineage extracted. The parser supports Snowflake, Redshift, and Databricks, but BigQuery was missing. This adds parity for multi-cloud environments.
Implementation
1. BigQuery Expression Parser (
metadata.py)Added
_parse_bigquery_source()method following the same pattern as_parse_snowflake_source()and_parse_redshift_source().Key features:
[Name="...", Kind="..."]patterns from GoogleBigQuery.Database() expressionsSource = S_PJ_CODE[Name="project"](no Kind attribute)[Name="dataset", Kind="Schema"][Name="table", Kind="Table"]orKind="View"Example expression parsed:
2. Partitions Support (
models.py)Many BigQuery connections in .pbit files use partitions instead of direct source references.
Added:
PowerBIPartitionPydantic model with name/mode/source fieldspartitionsfield toPowerBiTablemodel@model_validatorto automatically extract source frompartitions[0].sourcewhen main source is NoneWhy this matters:
Without partition support, tables with partition-based sources would have
source=None, preventing lineage extraction.Changes
Commit 1: BigQuery Parser
_parse_bigquery_source()method with docstring and examplesparse_table_name_from_source()flowCommit 2: Partitions Support
PowerBIPartitionmodelpartitions: Optional[List[PowerBIPartition]]toPowerBiTable@model_validator(mode='before')for automatic source extractionTesting
Docker Integration Testing
Tested with production .pbit file (Monthly Financial_trusted.pbit):
BigQuery lineage detected:
Partitions extraction:
Data source summary:
Code Quality
_parse_snowflake_source,_parse_redshift_source)mode='before',@classmethod)Backward Compatibility
✅ 100% backward compatible - Only affects .pbit files with BigQuery sources. All existing Snowflake/Redshift/Databricks parsing unchanged.
Checklist