Skip to content

feat(powerbi): replace Lark M-Query parser with Microsoft powerquery-parser#16685

Merged
askumar27 merged 46 commits intomasterfrom
powerbi_mquery_parser
Mar 24, 2026
Merged

feat(powerbi): replace Lark M-Query parser with Microsoft powerquery-parser#16685
askumar27 merged 46 commits intomasterfrom
powerbi_mquery_parser

Conversation

@askumar27
Copy link
Copy Markdown
Contributor

@askumar27 askumar27 commented Mar 19, 2026

📋 Summary

Replace DataHub's hand-written Lark M-Query parser with Microsoft's official @microsoft/powerquery-parser — the same TypeScript parser used in the VS Code Power Query extension. The bridge is compiled to a gzip-compressed JS bundle (bundle.js.gz, ~125 KB) evaluated in-process by py_mini_racer (V8). No Node.js runtime required from users.

🎯 Motivation

The existing Lark-based parser is a partial, manually-maintained approximation of the M-Query specification with several known correctness problems:

  • Parse timeouts / silent lineage loss: The Earley parser runs in O(n³) on ambiguous inputs, causing m_query_parse_timeouts on complex real-world expressions
  • BaseException catch-all: parser.py silently converted unexpected errors (including KeyboardInterrupt) into "Unknown Parsing Error" warnings
  • native_query_parsing=False bug: The flag suppressed all M-Query parsing, not just Value.NativeQuery expressions — users could not suppress native queries without also losing all other lineage
  • Grammar deviations from spec: Silent failures on valid M-Query expressions that the grammar couldn't handle
  • Maintenance burden: pattern_handler.py was ~1,600 lines of defensive token-chain walking to compensate for Lark's flat, semantically-weak tree representation

Microsoft maintains @microsoft/powerquery-parser with full spec coverage, structured error recovery, and a rich typed AST. This PR replaces the Lark parser with it while preserving the parser.get_upstream_tables() public API signature exactly.

⚖️ Approach: Why PyMiniRacer over a Node.js SEA binary

The initial implementation shipped @microsoft/powerquery-parser as a platform-native Node.js SEA (Single Executable Application) binary — similar to how DataHub bundles jdk4py. While functional, that approach had significant drawbacks that led us to switch to py_mini_racer before merge.

Old approach: Node.js SEA binary

index.ts ──tsc+esbuild──► bundle.js ──node --experimental-sea-config──► binary
binary ──gzip──► mquery-parser-{platform}.gz   (~17 MB each, 5 platforms → ~85 MB in wheel)

Cons:

  • Size: Each platform binary is ~17 MB gzip-compressed (~66 MB uncompressed). Five platforms (linux-x64, linux-aarch64, darwin-x64, darwin-arm64, win32-x64) would produce an ~85 MB wheel — close to PyPI's 100 MB limit, with no headroom for future @microsoft/powerquery-parser version bumps.
  • Git LFS required: Binaries had to be stored in Git LFS, adding CI complexity (lfs: true on every checkout, git lfs pull required) and an operational dependency on LFS infrastructure.
  • Subprocess overhead: Each process spawned a subprocess, managed a stdin/stdout NDJSON protocol, tracked PIDs, handled crash-and-restart, and cleaned up temp dirs. ~250 lines of lifecycle code.
  • Platform matrix: Required running build.sh on five separate CI runners to produce all platform binaries. The PR was blocked on darwin-arm64 only.
  • No binary for new platforms: Adding a platform (e.g., linux-aarch64-musl for Alpine) required a new CI build matrix entry.

New approach: PyMiniRacer in-process V8

index.ts ──tsc──► dist/index.js ──esbuild --minify --platform=browser --format=iife──► bundle.js
bundle.js ──gzip -9──► bundle.js.gz   (~125 KB — 136× smaller than one SEA binary)

At runtime:

bundle_js = gzip.decompress(_BUNDLE_PATH.read_bytes())  # ~50 ms, once per process
ctx = MiniRacer()
ctx.eval(bundle_js)
# Per expression:
promise = ctx.eval(f"parseExpression({json.dumps(expression)})")
raw = promise.get()  # awaits the JS Promise

Advantages over the old approach:

  • 136× smaller: 125 KB vs 17 MB. The entire bundle.js.gz is a plain git file — no LFS, no size concerns, fits trivially in any wheel.
  • No subprocess: Parsing is in-process. No PID tracking, no crash-restart, no temp dirs, no NDJSON framing. _bridge.py is ~60 lines vs ~250.
  • Platform-agnostic: mini-racer ships pre-built V8 wheels for all five platforms from PyPI — no build matrix needed. Adding a new platform is a pip install away.
  • Single-step build: build.sh runs npm ci → tsc → esbuild → gzip and produces one committed artifact. No cross-compilation, no CI runners needed.
  • No Git LFS: bundle.js.gz is 125 KB — comfortably committed as a regular git file.

The only trade-off is one new runtime dependency (mini-racer>=0.12.0) added to the powerbi extras, which ships its own V8 wheels from PyPI and requires no system-level Node.js.

🔧 Changes Overview

New components:

  • mquery_bridge/ — TypeScript bridge wrapping @microsoft/powerquery-parser; index.ts exposes parseExpression() via globalThis for py_mini_racer's bare V8 context; compiled to bundle.js.gz via esbuild --platform=browser --format=iife --minify + gzip
  • bundle.js.gz — 125 KB committed artifact; no LFS; loaded once per process by _bridge.py
  • _bridge.py — Decompresses bundle.js.gz in memory, evaluates it in MiniRacer, exposes MQueryBridge.parse() returning NodeIdMap = dict[int, dict]; singleton via get_bridge()
  • ast_utils.py — Typed navigation helpers (find_nodes_by_kind, get_invoke_callee_name, get_literal_value, get_record_field_values, resolve_identifier) over the flat NodeIdMap format

Rewritten (same public API, new implementation):

  • resolver.py — Forward traversal of the typed AST; dispatch table over ~80 node kinds; circular-reference guard via (let_node_id, name) pairs
  • pattern_handler.py — ~50% smaller; argument extraction uses ast_utils helpers instead of Lark token-chain walking; all 14 platform handler classes retained

Modified:

  • parser.py — Bridge call replaces Lark; native_query_parsing=False now correctly filters only Value.NativeQuery expressions; BaseException narrowed to Exception
  • setup.pylark[regex]==1.1.4 removed from powerbi extras; mini-racer>=0.12.0 added; package_data points to bundle.js.gz instead of binaries/*.gz; .gitattributes LFS config removed
  • config.py — Added observability counters to PowerBiDashboardSourceReport: m_query_parse_attempts, m_query_parse_successes, m_query_parse_timeouts, m_query_native_query_skipped, m_query_non_mquery_expressions (DAX/empty expressions that were previously silently pre-filtered), m_query_parse_unknown_errors (genuine M-Query failures only), m_query_resolver_successes, m_query_resolver_no_lineage, m_query_resolver_errors

Deleted:

  • validator.py — Logic re-implemented correctly in parser.py
  • tree_function.py — Replaced by ast_utils.py
  • powerbi-lexical-grammar.rule — No longer needed
  • sea-config.json and binaries/*.gz — Replaced by bundle.js.gz

🏗️ Architecture / Design Notes

┌──────────────────────────────────────────────────────────┐
│  Layer 1: In-Process JS Bridge                           │
│                                                          │
│  mquery_bridge/index.ts ──build.sh──► bundle.js.gz      │
│                                              │           │
│  _bridge.py: gzip.decompress() → MiniRacer.eval()       │
│              MiniRacer.eval("parseExpression(...)").get()│
└──────────────────────────────────────────────────────────┘
                      │ NodeIdMap (dict[int, dict])
                      ▼
┌──────────────────────────────────────────────────────────┐
│  Layer 2: Python Tree-Walker                             │
│                                                          │
│  ast_utils.py ──► resolver.py ──► pattern_handler.py    │
│                        │                                 │
│                        ▼                                 │
│                  data_classes.py (Lineage — unchanged)   │
└──────────────────────────────────────────────────────────┘
image

Key decisions:

  • --platform=browser --format=iife for esbuildpy_mini_racer's V8 context has no Node.js globals (exports, require, process). IIFE wraps the bundle in (()=>{...})(), making it self-contained. --platform=node would produce CommonJS which V8 can't evaluate.
  • globalThis not global — Bare V8 has globalThis but not Node.js's global. The bridge registers parseExpression on globalThis so py_mini_racer can find it.
  • ctx.eval() + promise.get() not ctx.call() — In mini-racer 0.14, ctx.call() does not resolve async JS Promises (returns {} instead). ctx.eval() returns a JSPromise object; calling .get() blocks until the Promise resolves.
  • Lazy py_mini_racer importfrom py_mini_racer import MiniRacer is inside MQueryBridge.__init__(), not at module top. The PowerBI connector can be imported without mini-racer installed; only instantiation fails, with a clear ImportError.
  • Gzip decompression in memorybundle.js.gz is decompressed via gzip.decompress() on first MQueryBridge instantiation (~50 ms). No temp files, no cache dirs, no cleanup.
  • Hard-fail if bundle missing — Clean ImportError with actionable message > silent degradation.

Updating the parser version: Run build.sh (requires Node.js 16+ on the developer's machine) after bumping @microsoft/powerquery-parser in package.json, commit the new bundle.js.gz and package-lock.json. No platform-specific builds required.

🧪 Testing

Existing suite (unchanged, used as acceptance criteria):

  • tests/integration/powerbi/test_m_parser.py — 38 M-Query expressions across all supported platforms (Snowflake, MSSQL, PostgreSQL, BigQuery, Redshift, Databricks, Athena, Oracle), native queries, Table.Combine, parameterised queries, DROP stripping, encoding edge cases — 39 passed, 1 xfailed
  • tests/integration/powerbi/test_ingest.py — golden-file end-to-end MCP shape and URN generation

New tests:

  • tests/unit/test_mquery_bridge.py — bridge lifecycle, valid/invalid expression handling, singleton behaviour; tests hit the real V8 engine (no mocking)
  • tests/unit/test_ast_utils.py — all ast_utils helpers; bridge-backed (parses live via V8 at test time, no static fixtures); covers null literals, empty records, identifier resolution, node-kind search
  • tests/unit/test_native_query_flag.py — explicitly validates the corrected native_query_parsing semantics: false suppresses only Value.NativeQuery expressions; non-native expressions produce lineage normally

📊 Impact Assessment

Affected components: metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/ only. powerbi.py and all callers of get_upstream_tables() are untouched.

Breaking changes:

⚠️ native_query_parsing: false behaviour change

Previously this flag suppressed all M-Query lineage extraction. It now suppresses only expressions containing Value.NativeQuery. Users who set this flag to block all lineage will now see lineage from non-NativeQuery sources (Snowflake, PostgreSQL, MSSQL, etc.). This is documented in docs/how/updating-datahub.md.

Performance: See Perf Comparison section below for measured numbers from a production ingestion run. The headline: M-Query parse time drops from 2,816s to 28s (101×), total wall time drops from 62 min to ~27 min (~2.3×), and parse success rate increases from 30.5% to 75.5%. Timeout incidents drop to near zero — the MSFT parser runs in linear time vs. Earley's O(n³). The one-time cost of loading bundle.js.gz (~50 ms) is paid once per ingestion process, not per expression.

Risk level: Medium — core lineage extraction logic is fully rewritten; mitigated by the existing 38-test corpus passing without modification.

🚀 Deployment Notes

No new user-facing runtime dependencies beyond the powerbi extras. mini-racer>=0.12.0 is added to the powerbi extras in setup.py. It ships pre-built V8 wheels for linux-x86_64, linux-aarch64, darwin-x86_64, darwin-arm64, and win32-x86_64 from PyPI — no system Node.js required.

lark[regex] removed. The Lark grammar dependency is dropped from the powerbi extras.

📈 Performance comparison: Lark vs MSFT

Measured on a production PowerBI ingestion run (~2,900 M-Query expressions).

Speed

Metric Lark MSFT Change
Total wall time 3,707s (62 min) ~1,600s (~27 min) ~2.3× faster
M-Query parse time 2,816s 28s 101× faster

Parse quality

Metric Lark MSFT Notes
Parse attempts 1,486 2,890 Lark's pre-filter hid 1,456 expressions before counting
Parse successes 453 (30.5%) 2,182 (75.5%) +45 percentage points
Parse validation errors 1,404 0 Lark-specific error class, eliminated
Parse unexpected char errors 792 0 Lark-specific error class, eliminated
Genuine M-Query parse failures 241 27 Real failures only, after separating non-M-Query expressions
Non-M-Query expressions (DAX, empty) hidden 681 Now counted separately, logged at INFO not WARNING

Lineage output (semantic diff via datahub check metadata-diff)

Metric Result
Net upstream lineage vs Lark +42 entries (new lineage Lark never found)
Column lineage accuracy 30 URNs had wrong database name in Lark; now correct
Raw MCP count −159 vs Lark — caused entirely by Lark over-emitting duplicate MCPs, not by MSFT dropping anything

Add Git LFS tracking for compressed Node.js SEA binaries that will be used
by the @microsoft/powerquery-parser integration. Each platform-specific
binary (linux-x64, linux-aarch64, darwin-x64, darwin-arm64, win32-x64)
will be gzip-compressed to 12-18 MB and stored via LFS.
Build the Node.js SEA binary for darwin-arm64 using the TypeScript bridge
source. Adds build.sh, package-lock.json, .gitignore (ignoring node_modules/,
dist/, sea-prep.blob, and uncompressed binaries), and fixes src/index.ts to
use the async tryLexParse API with ResultKind instead of the non-existent
isError property.
- Add Node.js 20+ version validation at script start
- Add error handling for SEA blob generation failure
- Add mkdir -p binaries before binary copy
…helpers

Replace all Lark tree_function calls with _get_arg_values / _get_record_args
helpers that operate directly on the embedded NodeIdMap dicts produced by
the Microsoft powerquery-parser bridge. Add _get_data_source_tokens for
NativeQueryLineage first-arg resolution through the let scope.

Preserve DataAccessFunctionDetail.parameters field threading through all
handler classes. Restore xfail marker on dangling-comma test — the MSFT
parser correctly rejects it.
- Rewrite parser.py to drop Lark/validator, call _bridge.get_bridge() directly
- Fix native_query_parsing=False semantics: now suppresses only expressions
  containing Value.NativeQuery instead of all M-Query parsing
- Add m_query_native_query_skipped counter to PowerBiDashboardSourceReport
- Add tests/unit/test_native_query_flag.py with 3 tests for the fixed semantics
- Remove 13 Lark-specific test_parse_m_query* tests from test_m_parser.py
- Update test_unsupported_data_platform to reflect new bridge-based behavior
- Delete tree_function.py, validator.py, and powerbi-lexical-grammar.rule
- Remove lark[regex]==1.1.4 from powerbi extras in setup.py
- Update package_data to include mquery_bridge binaries instead of grammar rule
- Update test_powerbi_parser.py to use dict-based AST nodes instead of lark Tree/Token
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-1997

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Mar 19, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 73.64341% with 136 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...powerbi/m_query/mquery_bridge/generate_fixtures.py 0.00% 34 Missing ⚠️
...ngestion/source/powerbi/m_query/pattern_handler.py 80.00% 31 Missing ⚠️
...tahub/ingestion/source/powerbi/m_query/resolver.py 74.76% 27 Missing ⚠️
...atahub/ingestion/source/powerbi/m_query/_bridge.py 76.47% 16 Missing ⚠️
...ahub/ingestion/source/powerbi/m_query/ast_utils.py 82.41% 16 Missing ⚠️
...datahub/ingestion/source/powerbi/m_query/parser.py 76.92% 12 Missing ⚠️

❌ Your patch status has failed because the patch coverage (73.64%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous bot commented Mar 19, 2026

🔴 Meticulous spotted visual differences in 17 of 1811 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit 52e3617. This comment will update as new commits are pushed.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Bundle Report

Changes will decrease total bundle size by 284.22kB (-1.24%) ⬇️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 22.68MB -284.22kB (-1.24%) ⬇️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js -365.31kB 12.45MB -2.85%
assets/flinklogo-*.svg (New) 81.09kB 81.09kB 100.0% 🚀

…iniRacer

- Rewrites _bridge.py to load bundle.js.gz into a MiniRacer V8 context
  instead of spawning a Node.js SEA subprocess
- Fixes index.ts to assign parseExpression to globalThis (not Node's global)
  so it is accessible in py_mini_racer's V8 context
- Fixes build.sh to emit IIFE output (--platform=browser --format=iife)
  so the bundle runs without Node.js built-ins (exports, require, process)
- Uses ctx.eval() + JSPromise.get() to await the async parseExpression
  function, since ctx.call() does not resolve Promises in mini-racer 0.14
- Regenerates bundle.js.gz from the corrected build
… no-lineage cases

- Split m_query_parse_unknown_errors into m_query_non_mquery_expressions
  (DAX/empty/label expressions) and m_query_parse_unknown_errors (genuine
  M-Query parse failures), making the 241→708 metric jump self-explanatory
- Log full expression at DEBUG when parse succeeds but no lineage is extracted,
  covering both "unsupported function" and "handler returned empty" cases, so
  expressions can be copy-pasted directly into local tests for investigation
- Add debug logs at every silent Lineage.empty() return in pattern_handler.py
  (Redshift, MySQL, MSSql, TwoStepDataAccessPattern, create_reference_table)
  and in resolver.py (missing output expression), explaining why each case
  produced no lineage rather than silently dropping it
Resolved conflict: kept deletion of powerbi-lexical-grammar.rule (Lark
parser was removed in this branch).
- Clear MiniRacer singleton on TimeoutException to avoid reusing a
  corrupted V8 context after async_raise interrupts mid-eval
- Guard result["nodeIdMap"] access with .get() + explicit check to raise
  MQueryBridgeError instead of a KeyError misclassified as resolver error
- Fix TRACE_POWERBI_MQUERY_PARSER env var to use get_trace_powerbi_mquery_parser()
  from env_vars.py instead of a raw os.getenv call
- Replace parameters=None + type:ignore + __post_init__ with
  field(default_factory=dict) in DataAccessFunctionDetail
- Updated bundling process to avoid minification, preserving error message details.
- Enhanced error formatting to include stage information (Lex/Parse) for better debugging.
- Added new tests to validate error handling and parsing of minimal section documents.
- Updated existing tests to reflect changes in error message structure.
Copy link
Copy Markdown
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice pr with huge improvement in the mquery parser. Good job!
I approved it with minimal comments.
The onyl concerning one is the get_bridge thread-safety

…andler

- Make get_bridge() thread-safe via double-checked locking with threading.Lock()
- Add try/except around gzip decompression to give a clear error on corrupt bundle
- Use dict.get() instead of dict[] for accessor.items lookups to avoid KeyError
  on malformed AST nodes (Oracle, TwoStepDataAccessPattern, MySQL handlers)
…o prevent GC segfault

py_mini_racer's __del__ -> close() is not thread-safe. When _clear_bridge() simply
set _bridge_instance = None, Python's GC could finalize the MiniRacer object later
(in an unrelated test) while other threads were active, causing a segfault. Explicitly
calling _ctx.close() while holding the lock shuts down the V8 context synchronously
and cleanly, before any threads can interfere.
@askumar27 askumar27 merged commit 32cfa00 into master Mar 24, 2026
71 of 72 checks passed
@askumar27 askumar27 deleted the powerbi_mquery_parser branch March 24, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants