Skip to content

Conversation

@AntoineGlacet
Copy link
Contributor

Summary

Adds BigQuery lineage extraction support for PowerBI .pbit template files.

Relates to #25509

Motivation

PowerBI .pbit files containing BigQuery data sources currently have no lineage extracted. The parser supports Snowflake, Redshift, and Databricks, but BigQuery was missing. This adds parity for multi-cloud environments.

Implementation

1. BigQuery Expression Parser (metadata.py)

Added _parse_bigquery_source() method following the same pattern as _parse_snowflake_source() and _parse_redshift_source().

Key features:

  • Pattern matching: Extracts [Name="...", Kind="..."] patterns from GoogleBigQuery.Database() expressions
  • Expression reference resolution: Recursively resolves indirect references like Source = S_PJ_CODE
  • Database/Schema/Table extraction:
    • Project: [Name="project"] (no Kind attribute)
    • Dataset: [Name="dataset", Kind="Schema"]
    • Table: [Name="table", Kind="Table"] or Kind="View"

Example expression parsed:

GoogleBigQuery.Database(
  [Name="kap-nami-prod"],
  [Name="iruca_aligned", Kind="Schema"],  
  [Name="dbo_S_PJ_CODE", Kind="Table"]
)
# Returns: [{"database": "kap-nami-prod", "schema": "iruca_aligned", "table": "dbo_S_PJ_CODE"}]

2. Partitions Support (models.py)

Many BigQuery connections in .pbit files use partitions instead of direct source references.

Added:

  • PowerBIPartition Pydantic model with name/mode/source fields
  • partitions field to PowerBiTable model
  • @model_validator to automatically extract source from partitions[0].source when main source is None

Why this matters:
Without partition support, tables with partition-based sources would have source=None, preventing lineage extraction.

Changes

Commit 1: BigQuery Parser

  • ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py (+97)
    • Added _parse_bigquery_source() method with docstring and examples
    • Integrated into parse_table_name_from_source() flow
    • Proper error handling and debug logging

Commit 2: Partitions Support

  • ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py (+25, -1)
    • Added PowerBIPartition model
    • Added partitions: Optional[List[PowerBIPartition]] to PowerBiTable
    • Added @model_validator(mode='before') for automatic source extraction

Testing

Docker Integration Testing

Tested with production .pbit file (Monthly Financial_trusted.pbit):

BigQuery lineage detected:

✅ Source: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE
   Table: Map PJ Code Master
   Method: Expression reference resolution (Source = S_PJ_CODE → GoogleBigQuery.Database())

Partitions extraction:

✅ Table: Map PJ Code Master
   - Source extracted from partitions[0].source
   - Expression normalized from list to string (via PR #25510)
   - Lineage parsed successfully

Data source summary:

  • 1 BigQuery connection - ✅ 100% lineage coverage
  • 11 SharePoint/Excel sources - No DB lineage (expected)
  • 29 embedded/calculated tables - No DB lineage (expected)

Code Quality

  • ✅ Follows existing parser patterns (_parse_snowflake_source, _parse_redshift_source)
  • ✅ Pydantic v2 best practices (mode='before', @classmethod)
  • ✅ Comprehensive error handling with try/except
  • ✅ Debug logging at key decision points
  • ✅ Recursive reference resolution for indirect sources
  • ✅ Non-breaking change (only activates for BigQuery expressions)

Backward Compatibility

100% backward compatible - Only affects .pbit files with BigQuery sources. All existing Snowflake/Redshift/Databricks parsing unchanged.

Checklist

  • Follows established code patterns
  • Integration tested with real .pbit files
  • Proper error handling and logging
  • Pydantic v2 compliant
  • Non-breaking addition
  • Documented with docstrings and examples

Fixes open-metadata#25483

Problem:
- PowerBI .pbit files store multiline DAX expressions as JSON arrays (one string per line)
- Pydantic validation failed with 'Input should be a valid string [type=string_type, input_value=[...], input_type=list]'
- Parser could not ingest .pbit files with multiline DAX measures, source expressions, or dataset expressions

Solution:
- Added Pydantic field_validator decorators to normalize list expressions to multiline strings
- Updated PowerBiMeasures.expression to accept Union[str, List[str]]
- Updated PowerBITableSource.expression to accept Union[str, List[str]]
- Updated DatasetExpression.expression to accept Union[str, List[str]]
- Made expression optional for PowerBiMeasures to handle measures without expressions
- Updated _get_child_measures() to handle None expression values

Testing:
- Successfully parsed .pbit file with 41 tables, 93 measures (72 multiline), 32 multiline sources
- Added 9 new unit tests for validators
- Added integration test case for multiline DAX expressions
- All existing tests pass (backward compatible)

Files changed:
- ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py
- ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py
- ingestion/tests/unit/test_powerbi_table_measures.py
Adds support for parsing BigQuery connections in PowerBI .pbit files to
enable lineage tracking from BigQuery tables/views to PowerBI tables.

Features:
- Parses GoogleBigQuery.Database() Power Query M expressions
- Resolves dataset expression references (e.g., Source = S_PJ_CODE)
- Extracts BigQuery project, dataset, and table information
- Handles both direct connections and indirect references through expressions
- Follows existing pattern for Snowflake, Redshift, and Databricks

Implementation:
- Added _parse_bigquery_source() method to parse BigQuery M expressions
- Integrated into parse_table_name_from_source() lineage flow
- Recursively resolves expression references to find BigQuery connections
- Uses regex patterns to extract: [Name="project"], [Name="dataset",Kind="Schema"], [Name="table",Kind="Table"]

Testing:
- Verified with Monthly Financial_trusted.pbit file containing BigQuery connections
- Successfully parsed lineage: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE → Map PJ Code Master
- Expression resolution tested and working

Example lineage chain:
  BigQuery: project.dataset.table
     ↓ (via dataset expression)
  Expression: S_PJ_CODE
     ↓
  PowerBI Table: Map PJ Code Master

Files changed:
- ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py
.pbit files store table source information in partitions[0].source instead of
directly in the table.source field. This commit adds partition support and
automatically extracts source from partitions when the source field is empty.

Changes:
- Added PowerBIPartition model to represent table partitions in .pbit files
- Added partitions field to PowerBiTable model
- Added model_validator to extract source from partitions[0] when table.source is None
- This enables lineage parsing for .pbit files where source is in partitions

This fixes the issue where BigQuery lineage was not being detected even though
the parsing logic was correct - the source field was simply not being populated.

Testing:
- Verified Map PJ Code Master table now has source populated from partitions
- Confirmed BigQuery lineage detection works: kap-nami-prod.iruca_aligned.dbo_S_PJ_CODE
@github-actions
Copy link
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link

gitar-bot bot commented Jan 25, 2026

Code Review 👍 Approved with suggestions 0 resolved / 3 findings

Well-structured BigQuery lineage parser following existing patterns. Three minor suggestions for defensive coding around edge cases in regex parsing and partition source extraction.

💡 Edge Case: Partition source extraction assumes dict structure

📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/models.py:196-200

The extract_source_from_partitions validator accesses partitions[0].get("source") assuming the partition is a dict. However, when Pydantic processes nested models in mode='before' validators, the inner objects might already be parsed into Pydantic models (e.g., PowerBIPartition instances) rather than dicts, depending on how the data is constructed.

If partitions[0] is already a PowerBIPartition instance (not a dict), calling .get("source") will raise an AttributeError.

Suggested fix:
Handle both dict and model instance cases:

@model_validator(mode='before')
@classmethod
def extract_source_from_partitions(cls, values):
    if isinstance(values, dict):
        if values.get("source") is None and values.get("partitions"):
            partitions = values.get("partitions", [])
            if partitions and len(partitions) > 0:
                first_partition = partitions[0]
                if isinstance(first_partition, dict):
                    partition_source = first_partition.get("source")
                elif hasattr(first_partition, "source"):
                    partition_source = first_partition.source
                else:
                    partition_source = None
                if partition_source:
                    values["source"] = [partition_source]
    return values

This is likely low-impact since mode='before' typically receives raw data, but defensive coding would prevent future regressions.

💡 Quality: Regex pattern may capture trailing spaces or quotes

📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py:980-985

The regex pattern r'Source\s*=\s*([A-Za-z0-9_#"&\s]+?)\s*,' uses a character class that includes whitespace (\s) and quotes ("). The subsequent cleanup with .strip().strip('"').strip('#').strip('"') handles some cases, but the order of operations may not fully clean all edge cases.

For example, if the matched string is " MyRef ", the current cleanup chain:

  1. .strip()"MyRef" (quotes remain)
  2. .strip('"')MyRef (outer quotes removed)
  3. .strip('#')MyRef
  4. .strip('"')MyRef

However, patterns like #"My Ref" would result in My Ref" after processing because .strip('#') only removes leading/trailing #, not the quote that follows.

Suggested improvement:
Use a more targeted regex or refine the cleanup:

ref_name = source_ref_match.group(1).strip()
# Remove surrounding quotes and hash symbols commonly found in M expressions
ref_name = re.sub(r'^[#"]+|[#"]+$', '', ref_name).strip()

This is minor since the current implementation likely works for common cases, but it could cause issues with certain M expression naming conventions.

💡 Edge Case: BigQuery parser extracts first Name without Kind as project

📄 ingestion/src/metadata/ingestion/source/dashboard/powerbi/metadata.py:1022-1033

The logic assumes the first [Name="..."] pattern without a Kind attribute is the project. This works for the documented pattern but may incorrectly identify the project if the expression contains other Name patterns without Kind before the actual project identifier.

For example, consider an edge case where the expression contains metadata or comments with [Name="something"] patterns before the actual BigQuery connection:

/* Config [Name="metadata"] */ GoogleBigQuery.Database()[Name="actual-project"]...

The current implementation would incorrectly identify "metadata" as the project.

Suggested improvement:
Consider parsing only after detecting GoogleBigQuery.Database by splitting/finding that substring first:

# Find the BigQuery portion of the expression
bq_start = source_expression.find("GoogleBigQuery.Database")
if bq_start >= 0:
    bq_expression = source_expression[bq_start:]
    name_matches = re.findall(r'\[Name="([^"]+)"(?:,Kind="([^"]+)")?\]', bq_expression)

This is minor risk as real-world .pbit files likely don't have this pattern, but it would make the parser more robust.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant