Feat: Add BigQuery lineage support for PowerBI .pbit files #25511

AntoineGlacet · 2026-01-25T10:00:07Z

Summary

Adds BigQuery lineage extraction support for PowerBI .pbit template files.

Relates to #25509

Motivation

PowerBI .pbit files containing BigQuery data sources currently have no lineage extracted. The parser supports Snowflake, Redshift, and Databricks, but BigQuery was missing. This adds parity for multi-cloud environments.

Implementation

1. BigQuery Expression Parser (`metadata.py`)

Added _parse_bigquery_source() method following the same pattern as _parse_snowflake_source() and _parse_redshift_source().

Key features:

Pattern matching: Extracts [Name="...", Kind="..."] patterns from GoogleBigQuery.Database() expressions
Expression reference resolution: Recursively resolves indirect references like Source = S_PJ_CODE
Database/Schema/Table extraction:
- Project: [Name="project"] (no Kind attribute)
- Dataset: [Name="dataset", Kind="Schema"]
- Table: [Name="table", Kind="Table"] or Kind="View"

Example expression parsed:

GoogleBigQuery.Database(
  [Name="kap-nami-prod"],
  [Name="iruca_aligned", Kind="Schema"],  
  [Name="dbo_S_PJ_CODE", Kind="Table"]
)
# Returns: [{"database": "kap-nami-prod", "schema": "iruca_aligned", "table": "dbo_S_PJ_CODE"}]

2. Partitions Support (`models.py`)

Many BigQuery connections in .pbit files use partitions instead of direct source references.

Added:

PowerBIPartition Pydantic model with name/mode/source fields
partitions field to PowerBiTable model
@model_validator to automatically extract source from partitions[0].source when main source is None

Why this matters:
Without partition support, tables with partition-based sources would have source=None, preventing lineage extraction.

Changes

Commit 1: BigQuery Parser

ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py (+97)
- Added _parse_bigquery_source() method with docstring and examples
- Integrated into parse_table_name_from_source() flow
- Proper error handling and debug logging

Commit 2: Partitions Support

ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py (+25, -1)
- Added PowerBIPartition model
- Added partitions: Optional[List[PowerBIPartition]] to PowerBiTable
- Added @model_validator(mode='before') for automatic source extraction

Testing

Docker Integration Testing

Tested with production .pbit file (Monthly Financial_trusted.pbit):

BigQuery lineage detected:

✅ Source: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE
   Table: Map PJ Code Master
   Method: Expression reference resolution (Source = S_PJ_CODE → GoogleBigQuery.Database())

Partitions extraction:

✅ Table: Map PJ Code Master
   - Source extracted from partitions[0].source
   - Expression normalized from list to string (via PR #25510)
   - Lineage parsed successfully

Data source summary:

1 BigQuery connection - ✅ 100% lineage coverage
11 SharePoint/Excel sources - No DB lineage (expected)
29 embedded/calculated tables - No DB lineage (expected)

Code Quality

✅ Follows existing parser patterns (_parse_snowflake_source, _parse_redshift_source)
✅ Pydantic v2 best practices (mode='before', @classmethod)
✅ Comprehensive error handling with try/except
✅ Debug logging at key decision points
✅ Recursive reference resolution for indirect sources
✅ Non-breaking change (only activates for BigQuery expressions)

Backward Compatibility

✅ 100% backward compatible - Only affects .pbit files with BigQuery sources. All existing Snowflake/Redshift/Databricks parsing unchanged.

Checklist

Follows established code patterns
Integration tested with real .pbit files
Proper error handling and logging
Pydantic v2 compliant
Non-breaking addition
Documented with docstrings and examples

Fixes open-metadata#25483 Problem: - PowerBI .pbit files store multiline DAX expressions as JSON arrays (one string per line) - Pydantic validation failed with 'Input should be a valid string [type=string_type, input_value=[...], input_type=list]' - Parser could not ingest .pbit files with multiline DAX measures, source expressions, or dataset expressions Solution: - Added Pydantic field_validator decorators to normalize list expressions to multiline strings - Updated PowerBiMeasures.expression to accept Union[str, List[str]] - Updated PowerBITableSource.expression to accept Union[str, List[str]] - Updated DatasetExpression.expression to accept Union[str, List[str]] - Made expression optional for PowerBiMeasures to handle measures without expressions - Updated _get_child_measures() to handle None expression values Testing: - Successfully parsed .pbit file with 41 tables, 93 measures (72 multiline), 32 multiline sources - Added 9 new unit tests for validators - Added integration test case for multiline DAX expressions - All existing tests pass (backward compatible) Files changed: - ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py - ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py - ingestion/tests/unit/test_powerbi_table_measures.py

Adds support for parsing BigQuery connections in PowerBI .pbit files to enable lineage tracking from BigQuery tables/views to PowerBI tables. Features: - Parses GoogleBigQuery.Database() Power Query M expressions - Resolves dataset expression references (e.g., Source = S_PJ_CODE) - Extracts BigQuery project, dataset, and table information - Handles both direct connections and indirect references through expressions - Follows existing pattern for Snowflake, Redshift, and Databricks Implementation: - Added _parse_bigquery_source() method to parse BigQuery M expressions - Integrated into parse_table_name_from_source() lineage flow - Recursively resolves expression references to find BigQuery connections - Uses regex patterns to extract: [Name="project"], [Name="dataset",Kind="Schema"], [Name="table",Kind="Table"] Testing: - Verified with Monthly Financial_trusted.pbit file containing BigQuery connections - Successfully parsed lineage: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE → Map PJ Code Master - Expression resolution tested and working Example lineage chain: BigQuery: project.dataset.table ↓ (via dataset expression) Expression: S_PJ_CODE ↓ PowerBI Table: Map PJ Code Master Files changed: - ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py

.pbit files store table source information in partitions[0].source instead of directly in the table.source field. This commit adds partition support and automatically extracts source from partitions when the source field is empty. Changes: - Added PowerBIPartition model to represent table partitions in .pbit files - Added partitions field to PowerBiTable model - Added model_validator to extract source from partitions[0] when table.source is None - This enables lineage parsing for .pbit files where source is in partitions This fixes the issue where BigQuery lineage was not being detected even though the parsing logic was correct - the source field was simply not being populated. Testing: - Verified Map PJ Code Master table now has source populated from partitions - Confirmed BigQuery lineage detection works: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE

github-actions · 2026-01-25T10:00:29Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

gitar-bot · 2026-01-25T10:04:45Z

Code Review 👍 Approved with suggestions 0 resolved / 3 findings

Well-structured BigQuery lineage parser following existing patterns. Three minor suggestions for defensive coding around edge cases in regex parsing and partition source extraction.

💡 Edge Case: Partition source extraction assumes dict structure

📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py:196-200

The extract_source_from_partitions validator accesses partitions[0].get("source") assuming the partition is a dict. However, when Pydantic processes nested models in mode='before' validators, the inner objects might already be parsed into Pydantic models (e.g., PowerBIPartition instances) rather than dicts, depending on how the data is constructed.

If partitions[0] is already a PowerBIPartition instance (not a dict), calling .get("source") will raise an AttributeError.

Suggested fix:
Handle both dict and model instance cases:

@model_validator(mode='before')
@classmethod
def extract_source_from_partitions(cls, values):
    if isinstance(values, dict):
        if values.get("source") is None and values.get("partitions"):
            partitions = values.get("partitions", [])
            if partitions and len(partitions) > 0:
                first_partition = partitions[0]
                if isinstance(first_partition, dict):
                    partition_source = first_partition.get("source")
                elif hasattr(first_partition, "source"):
                    partition_source = first_partition.source
                else:
                    partition_source = None
                if partition_source:
                    values["source"] = [partition_source]
    return values

This is likely low-impact since mode='before' typically receives raw data, but defensive coding would prevent future regressions.

💡 Quality: Regex pattern may capture trailing spaces or quotes

📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py:980-985

The regex pattern r'Source\s*=\s*([A-Za-z0-9_#"&\s]+?)\s*,' uses a character class that includes whitespace (\s) and quotes ("). The subsequent cleanup with .strip().strip('"').strip('#').strip('"') handles some cases, but the order of operations may not fully clean all edge cases.

For example, if the matched string is " MyRef ", the current cleanup chain:

.strip() → "MyRef" (quotes remain)
.strip('"') → MyRef (outer quotes removed)
.strip('#') → MyRef
.strip('"') → MyRef

However, patterns like #"My Ref" would result in My Ref" after processing because .strip('#') only removes leading/trailing #, not the quote that follows.

Suggested improvement:
Use a more targeted regex or refine the cleanup:

ref_name = source_ref_match.group(1).strip()
# Remove surrounding quotes and hash symbols commonly found in M expressions
ref_name = re.sub(r'^[#"]+|[#"]+$', '', ref_name).strip()

This is minor since the current implementation likely works for common cases, but it could cause issues with certain M expression naming conventions.

💡 Edge Case: BigQuery parser extracts first Name without Kind as project

📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py:1022-1033

The logic assumes the first [Name="..."] pattern without a Kind attribute is the project. This works for the documented pattern but may incorrectly identify the project if the expression contains other Name patterns without Kind before the actual project identifier.

For example, consider an edge case where the expression contains metadata or comments with [Name="something"] patterns before the actual BigQuery connection:

/* Config [Name="metadata"] */ GoogleBigQuery.Database()[Name="actual-project"]...

The current implementation would incorrectly identify "metadata" as the project.

Suggested improvement:
Consider parsing only after detecting GoogleBigQuery.Database by splitting/finding that substring first:

# Find the BigQuery portion of the expression
bq_start = source_expression.find("GoogleBigQuery.Database")
if bq_start >= 0:
    bq_expression = source_expression[bq_start:]
    name_matches = re.findall(r'\[Name="([^"]+)"(?:,Kind="([^"]+)")?\]', bq_expression)

This is minor risk as real-world .pbit files likely don't have this pattern, but it would make the parser more robust.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

AntoineGlacet added 3 commits January 25, 2026 17:37

AntoineGlacet marked this pull request as ready for review January 25, 2026 10:24

AntoineGlacet requested a review from a team as a code owner January 25, 2026 10:24

AntoineGlacet had a problem deploying to test January 25, 2026 10:24 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add BigQuery lineage support for PowerBI .pbit files #25511

Feat: Add BigQuery lineage support for PowerBI .pbit files #25511

AntoineGlacet commented Jan 25, 2026

Uh oh!

github-actions bot commented Jan 25, 2026

Uh oh!

gitar-bot bot commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat: Add BigQuery lineage support for PowerBI .pbit files #25511

Are you sure you want to change the base?

Feat: Add BigQuery lineage support for PowerBI .pbit files #25511

Conversation

AntoineGlacet commented Jan 25, 2026

Summary

Motivation

Implementation

1. BigQuery Expression Parser (metadata.py)

2. Partitions Support (models.py)

Changes

Commit 1: BigQuery Parser

Commit 2: Partitions Support

Testing

Docker Integration Testing

Code Quality

Backward Compatibility

Checklist

Uh oh!

github-actions bot commented Jan 25, 2026

Uh oh!

gitar-bot bot commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. BigQuery Expression Parser (`metadata.py`)

2. Partitions Support (`models.py`)