Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
9994f7d
chore(powerbi): configure Git LFS for m-query bridge binaries
askumar27 Mar 18, 2026
e7555b1
feat(powerbi): add TypeScript bridge source for MSFT powerquery-parser
askumar27 Mar 18, 2026
e080a41
fix(powerbi): improve index.ts input validation and error handling
askumar27 Mar 18, 2026
ebcf7d7
feat(powerbi): add compiled mquery-parser SEA binaries
askumar27 Mar 18, 2026
85f9fca
fix(powerbi): harden build.sh with Node version check and error handling
askumar27 Mar 19, 2026
502a02c
feat(powerbi): add MQueryBridge subprocess manager with gzip decompre…
askumar27 Mar 19, 2026
4838876
fix(powerbi): harden _bridge.py error handling and resource management
askumar27 Mar 19, 2026
11b5446
test(powerbi): add AST fixtures generated from MSFT powerquery-parser
askumar27 Mar 19, 2026
b47379f
feat(powerbi): add ast_utils.py for NodeIdMap navigation
askumar27 Mar 19, 2026
f42c3ca
fix(powerbi): harden ast_utils.py edge case handling
askumar27 Mar 19, 2026
c322ca7
feat(powerbi): rewrite resolver.py to walk powerquery-parser NodeIdMap
askumar27 Mar 19, 2026
16d3fde
fix(powerbi): add type annotations and quoted identifier handling in …
askumar27 Mar 19, 2026
139441f
feat(powerbi): rewrite pattern_handler.py to use ast_utils NodeIdMap …
askumar27 Mar 19, 2026
4e76543
feat(powerbi): wire bridge into parser.py, fix native_query_parsing flag
askumar27 Mar 19, 2026
b3a6aff
fix(powerbi): narrow BaseException to Exception in parser.py resolver…
askumar27 Mar 19, 2026
6be81cc
chore(powerbi): remove Lark parser, update packaging for SEA binaries
askumar27 Mar 19, 2026
be769ea
docs(powerbi): document native_query_parsing behaviour fix in updatin…
askumar27 Mar 19, 2026
57c4bd4
refactor(powerbi): expose parseExpression global function for PyMiniR…
askumar27 Mar 23, 2026
02396c8
refactor(powerbi): simplify build.sh to esbuild+gzip, commit bundle.j…
askumar27 Mar 23, 2026
9f4675e
refactor(powerbi): replace subprocess SEA bridge with py_mini_racer M…
askumar27 Mar 23, 2026
5bdd0fd
chore(powerbi): add mini-racer dep, update package_data to bundle.js.…
askumar27 Mar 23, 2026
b8e80c7
refactor(powerbi): update generate_fixtures.py to use _bridge.py dire…
askumar27 Mar 23, 2026
b8e5c81
chore(powerbi): delete SEA binaries and sea-config.json
askumar27 Mar 23, 2026
eda9d3f
fix(powerbi): fix mypy errors in _bridge.py and test_native_query_fla…
askumar27 Mar 23, 2026
200fd11
chore(deps): add mini-racer==0.14.1 to constraints.txt
askumar27 Mar 23, 2026
c7d9943
chore(deps): update pyproject.toml and uv.lock for mini-racer, remove…
askumar27 Mar 23, 2026
ed86cc2
docs(powerbi): add PR number and mini-racer dep note to updating-data…
askumar27 Mar 23, 2026
3de26f4
Merge branch 'master' into powerbi_mquery_parser
askumar27 Mar 23, 2026
f307790
fix(powerbi): fix two CI failures after mini-racer migration
askumar27 Mar 23, 2026
7447205
fix(powerbi): replace JSON load_fixture with bridge-backed module fix…
askumar27 Mar 23, 2026
2a7a8c9
fix(powerbi): move ast_utils imports to top, use NodeIdMap alias in f…
askumar27 Mar 23, 2026
f1f12f3
fix(powerbi): delete committed JSON AST fixtures (generated at test t…
askumar27 Mar 23, 2026
e1647b3
fix(powerbi): repurpose generate_fixtures.py as dev-only tool, update…
askumar27 Mar 23, 2026
c01a0f5
fix(powerbi): move index.ts to mquery_bridge root, fix wheel package …
askumar27 Mar 23, 2026
6f965af
fix(powerbi): suppress warnings for non-M-Query (DAX/empty) expressio…
askumar27 Mar 23, 2026
3fb0805
fix(powerbi): inline Snowflake M expression and add return types to f…
askumar27 Mar 23, 2026
6869d7e
docs(powerbi): improve docstrings and comments across M-Query bridge …
askumar27 Mar 23, 2026
ba85ee5
fix(powerbi): restore partial lineage for unresolved server parameters
askumar27 Mar 23, 2026
52e3617
feat(powerbi): add fine-grained M-Query metrics and debug logging for…
askumar27 Mar 23, 2026
dc085da
docs: update breaking changes section for PowerBI M-Query lineage ext…
askumar27 Mar 23, 2026
f8f2e22
chore: merge oss/master into powerbi_mquery_parser
askumar27 Mar 24, 2026
f1ed6cb
fix(powerbi): fix critical error handling bugs in M-Query bridge
askumar27 Mar 24, 2026
9b36819
refactor(powerbi): improve error handling in M-Query bridge
askumar27 Mar 24, 2026
a12dc7e
fix(powerbi): address review comments on M-Query bridge and pattern h…
askumar27 Mar 24, 2026
ea5a705
Merge branch 'master' into powerbi_mquery_parser
askumar27 Mar 24, 2026
70573e8
fix(powerbi): explicitly close MiniRacer context in _clear_bridge() t…
askumar27 Mar 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
*.tsx text eol=lf
gradlew text eol=lf
metadata-utils/src/test/resources/filterQuery/* text eol=lf

4 changes: 2 additions & 2 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe

### Breaking Changes

- #16341 (Ingestion) SQL parsing: View query IDs are now generated using a SHA-256 hash instead of URL-encoding the view URN. This affects all connectors that use view lineage tracking (Snowflake, Oracle, BigQuery, Postgres, MySQL, Hive, Trino, ClickHouse, DB2, Dremio, SQL Server, and others). Previously, query entities had URNs like `urn:li:query:view_urn%3Ali%3Adataset%3A%28...%29`; they now use `urn:li:query:view_<sha256hash>`. After upgrading, the old URL-encoded query entities will become stale and orphaned. To clean them up, enable stateful ingestion with stale entity removal in your recipe and re-run ingestion.

- #16685:(Ingestion) PowerBI M-Query lineage extraction has been rewritten using Microsoft's official `@microsoft/powerquery-parser`. As part of this change, the `native_query_parsing: false` configuration flag now suppresses only expressions containing `Value.NativeQuery`. Previously it suppressed all M-Query lineage extraction. Users who set this flag to block all lineage extraction will now see lineage produced for non-NativeQuery sources (Snowflake, PostgreSQL, MSSQL, etc.). To restore the suppress-all behaviour, add a `table_pattern.deny` rule in your recipe.
- #16341:(Ingestion) SQL parsing: View query IDs are now generated using a SHA-256 hash instead of URL-encoding the view URN. This affects all connectors that use view lineage tracking (Snowflake, Oracle, BigQuery, Postgres, MySQL, Hive, Trino, ClickHouse, DB2, Dremio, SQL Server, and others). Previously, query entities had URNs like `urn:li:query:view_urn%3Ali%3Adataset%3A%28...%29`; they now use `urn:li:query:view_<sha256hash>`. After upgrading, the old URL-encoded query entities will become stale and orphaned. To clean them up, enable stateful ingestion with stale entity removal in your recipe and re-run ingestion.
- #16396: Oracle connector: When connecting via `service_name` to a multitenant Oracle database, the database name used in URNs will now reflect the Pluggable Database (PDB) name instead of the Container Database (CDB) name. In Oracle Multitenant architecture, a CDB is the top-level container (e.g. `cdb`) and a PDB is an individual tenant database within it (e.g. `mypdb`); `service_name` typically routes to the PDB, so the PDB name is the correct identifier for your datasets. This affects both dataset URNs (when `add_database_name_to_urn: true`) and database/schema container URNs (always, since containers always include the database name). If your existing metadata was ingested with the old CDB-based URNs, re-ingesting will create new entities under the corrected URNs. To preserve the old URN shape and avoid re-creating entities, set `urn_db_name` explicitly in your recipe to match your previous CDB name.
- #16628 (Ingestion) Fabric OneLake source: Workspace containers now use the `fabric` platform instead of `fabric-onelake`. This changes workspace container URNs and the `dataPlatformInstance.platform` emitted for workspace entities. Lakehouse, warehouse, schema, and dataset entities remain on `fabric-onelake`.
- **Retention service disabled: only current version retained.** When the retention service is not enabled (not configured or unavailable), the write path now retains only the current version (version 0) and does not create version-history rows. Previously, version history was still written when retention was disabled. **Impact:** Deployments that run without retention enabled will no longer accumulate aspect version history; only the latest aspect value is stored. **Migration:** Enable and configure the retention service (e.g. ingest retention policies from `boot/retention.yaml`) if you need version history for any entity/aspect.
Expand Down
5 changes: 2 additions & 3 deletions metadata-ingestion/constraints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -782,8 +782,6 @@ langcodes==3.5.1
# via spacy
langdetect==1.0.9
# via unstructured
lark==1.1.4
# via acryl-datahub
leb128==1.0.9
# via asynch
linear-tsv==1.1.0
Expand Down Expand Up @@ -841,6 +839,8 @@ mdurl==0.1.2
# via markdown-it-py
memray==1.19.1
# via acryl-datahub
mini-racer==0.14.1
# via acryl-datahub
mistune==3.2.0
# via
# acryl-great-expectations
Expand Down Expand Up @@ -1370,7 +1370,6 @@ referencing==0.37.0
# jupyter-events
regex==2026.2.28
# via
# lark
# nltk
# tiktoken
requests==2.32.5
Expand Down
10 changes: 5 additions & 5 deletions metadata-ingestion/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -943,7 +943,7 @@ postgres = [
]

powerbi = [
"lark[regex]==1.1.4",
"mini-racer==0.14.1",
"more-itertools<11.0.0",
"msal>=1.31.1,<2.0.0",
"patchy==2.8.0",
Expand Down Expand Up @@ -1419,10 +1419,10 @@ all = [
"jsonpath-ng==1.7.0",
"jupyter_server>=2.14.1,<3.0.0",
"kerberos>=1.3.0,<2.0.0",
"lark[regex]==1.1.4",
"litellm==1.80.5",
"lkml>=1.3.4,<2.0.0",
"looker-sdk>=23.0.0,<26.0.0",
"mini-racer==0.14.1",
"mlflow-skinny>=2.3.0,<2.21.0",
"more-itertools>=8.12.0,<11.0.0",
"moto[s3]>=5.0.0,<6.0.0",
Expand Down Expand Up @@ -1570,10 +1570,10 @@ dev = [
"jsonschema<5.0.0",
"jupyter_server>=2.14.1,<3.0.0",
"kerberos>=1.3.0,<2.0.0",
"lark[regex]==1.1.4",
"litellm==1.80.5",
"lkml>=1.3.4,<2.0.0",
"looker-sdk>=23.0.0,<26.0.0",
"mini-racer==0.14.1",
"mixpanel>=4.9.0,<6.0.0",
"mlflow-skinny>=2.3.0,<2.21.0",
"more-itertools>=8.12.0,<11.0.0",
Expand Down Expand Up @@ -1761,10 +1761,10 @@ docs = [
"jsonschema<5.0.0",
"jupyter_server>=2.14.1,<3.0.0",
"kerberos>=1.3.0,<2.0.0",
"lark[regex]==1.1.4",
"litellm==1.80.5",
"lkml>=1.3.4,<2.0.0",
"looker-sdk>=23.0.0,<26.0.0",
"mini-racer==0.14.1",
"mixpanel>=4.9.0,<6.0.0",
"mlflow-skinny>=2.3.0,<2.21.0",
"more-itertools>=8.12.0,<11.0.0",
Expand Down Expand Up @@ -2147,7 +2147,7 @@ datahub = ["py.typed"]
"datahub.cli.gql" = ["*.gql"]
"datahub.cli.resources" = ["*.md"]
"datahub.ingestion.autogenerated" = ["*.json"]
"datahub.ingestion.source.powerbi" = ["powerbi-lexical-grammar.rule"]
"datahub.ingestion.source.powerbi.m_query.mquery_bridge" = ["bundle.js.gz"]
"datahub.metadata" = ["schema.avsc"]
"datahub.metadata.schemas" = ["*.avsc"]

Expand Down
4 changes: 2 additions & 2 deletions metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -744,7 +744,7 @@
"nifi": {"requests<3.0.0", "packaging<26.0.0", "requests-gssapi<2.0.0"},
"powerbi": (
microsoft_common
| {"lark[regex]==1.1.4", "sqlparse<1.0.0", "more-itertools<11.0.0"}
| {"sqlparse<1.0.0", "more-itertools<11.0.0", "mini-racer==0.14.1"}
| sqlglot_lib
| threading_timeout_common
),
Expand Down Expand Up @@ -1204,7 +1204,7 @@
"datahub": ["py.typed"],
"datahub.metadata": ["schema.avsc"],
"datahub.metadata.schemas": ["*.avsc"],
"datahub.ingestion.source.powerbi": ["powerbi-lexical-grammar.rule"],
"datahub.ingestion.source.powerbi.m_query.mquery_bridge": ["bundle.js.gz"],
"datahub.ingestion.autogenerated": ["*.json"],
"datahub.cli.gql": ["*.gql"],
"datahub.cli.resources": ["*.md"],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -247,8 +247,14 @@ class PowerBiDashboardSourceReport(StaleEntityRemovalSourceReport):
m_query_parse_attempts: int = 0
m_query_parse_successes: int = 0
m_query_parse_timeouts: int = 0
m_query_native_query_skipped: int = 0
# Expressions that reached the parser but are not M-Query at all
# (e.g. DAX computed-table expressions, empty strings, label rows).
# These fail with MQueryParseError but are expected and logged at INFO.
m_query_non_mquery_expressions: int = 0
m_query_parse_validation_errors: int = 0
m_query_parse_unexpected_character_errors: int = 0
# Genuine M-Query expressions that the parser could not handle.
m_query_parse_unknown_errors: int = 0
m_query_resolver_errors: int = 0
m_query_resolver_no_lineage: int = 0
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
import gzip
import json
import logging
import threading
from pathlib import Path
from typing import Optional

logger = logging.getLogger(__name__)

NodeIdMap = dict[int, dict]

_BUNDLE_PATH = Path(__file__).parent / "mquery_bridge" / "bundle.js.gz"


class MQueryBridgeError(RuntimeError):
"""V8 context error or malformed response from the M-Query bridge."""

pass


class MQueryParseError(RuntimeError):
"""Parser returned a structured parse error for a specific expression."""

def __init__(self, message: str, expression: str = "") -> None:
super().__init__(message)
self.expression = expression


class MQueryBridge:
def __init__(self) -> None:
if not _BUNDLE_PATH.exists():
raise ImportError(
f"M-Query bridge bundle not found at {_BUNDLE_PATH}. "
"Re-installing acryl-datahub[powerbi] may fix this."
)
try:
from py_mini_racer import MiniRacer
except ImportError as e:
raise ImportError(
"PowerBI M-Query parsing requires 'mini-racer'. "
"Install it with: pip install 'acryl-datahub[powerbi]'"
) from e

# Decompress bundle.js.gz in memory β€” fast (~50ms for 500KB) and happens once per process.
try:
bundle_js = gzip.decompress(_BUNDLE_PATH.read_bytes()).decode("utf-8")
except (gzip.BadGzipFile, OSError, EOFError) as e:
raise ImportError(
f"M-Query bridge bundle at {_BUNDLE_PATH} appears to be corrupt: {e}. "
"Re-installing acryl-datahub[powerbi] may fix this."
) from e
self._ctx = MiniRacer()
self._ctx.eval(bundle_js)

def parse(self, expression: str) -> NodeIdMap:
"""
Parse an M-Query expression and return a flat node map.

Each key is a node ID (int); each value is a node dict with at least
``kind`` (NodeKind string) and ``id``. Child nodes are embedded inline,
not as ID references, so you can walk them directly or look up any
node by ID via the returned map.

Example β€” the LetExpression root for ``let x = 1 in x`` is at the
root of the returned dict and looks roughly like::

{1: {"kind": "LetExpression", "id": 1, "variableList": {...}, ...},
2: {"kind": "ArrayWrapper", "id": 2, ...},
...}

Not thread-safe β€” callers must be single-threaded.

Raises:
MQueryParseError: parser returned a structured error for this expression.
MQueryBridgeError: V8 context error or malformed response.
"""
# JSPromise is available: __init__ already guaranteed py_mini_racer is installed.
from py_mini_racer import JSPromise

try:
# parseExpression is async, so ctx.call() returns an unresolved plain dict.
# Use ctx.eval() instead, which returns a JSPromise; call .get() to await it.
result = self._ctx.eval(f"parseExpression({json.dumps(expression)})")
if not isinstance(result, JSPromise):
raise MQueryBridgeError(
f"M-Query bridge: expected JSPromise from parseExpression, got {type(result).__name__}"
)
raw = result.get()
except MQueryBridgeError:
raise
except Exception as e:
# Catches all py_mini_racer errors (JSEvalException, JSTimeoutException, etc.)
# MiniRacerBaseException is not exported from the top-level namespace in mini-racer.
raise MQueryBridgeError(f"M-Query bridge V8 error: {e}") from e

if not isinstance(raw, str):
raise MQueryBridgeError(
f"M-Query bridge returned non-string result: {type(raw).__name__}"
)

try:
result = json.loads(raw)
except (json.JSONDecodeError, TypeError) as e:
raise MQueryBridgeError(
f"M-Query bridge returned malformed JSON: {e}. Raw: {raw!r}"
) from e

if not result.get("ok"):
raise MQueryParseError(
result.get("error", "unknown error"), expression=expression
)

node_id_map = result.get("nodeIdMap")
if node_id_map is None:
raise MQueryBridgeError(
"M-Query bridge returned ok=true but 'nodeIdMap' is missing from response"
)

return {int(node_id): node for node_id, node in node_id_map}


_bridge_instance: Optional[MQueryBridge] = None
_bridge_lock = threading.Lock()


def get_bridge() -> MQueryBridge:
"""Return the process-wide MQueryBridge, creating it on first call."""
global _bridge_instance
if _bridge_instance is None:
with _bridge_lock:
if _bridge_instance is None:
_bridge_instance = MQueryBridge()
return _bridge_instance


def _clear_bridge() -> None:
"""Drop the singleton so the next call to get_bridge() starts fresh.

Called after a V8 crash to avoid reusing a broken context, and in tests
to ensure each test module gets an isolated bridge.
"""
global _bridge_instance
with _bridge_lock:
if _bridge_instance is not None:
# Explicitly close the V8 context before dropping the reference.
# If we just set _bridge_instance = None here, Python's GC decides
# when to finalize the MiniRacer object. If that happens while other
# threads are active (e.g. in a later, unrelated test), MiniRacer's
# __del__ -> close() path segfaults. Closing synchronously here, while
# the lock is held and no concurrent parse() calls are in flight,
# shuts down V8 cleanly.
try:
_bridge_instance._ctx.close()
except Exception:
pass
_bridge_instance = None
Loading
Loading