feat(powerbi): replace Lark M-Query parser with Microsoft powerquery-parser#16685
feat(powerbi): replace Lark M-Query parser with Microsoft powerquery-parser#16685
Conversation
Add Git LFS tracking for compressed Node.js SEA binaries that will be used by the @microsoft/powerquery-parser integration. Each platform-specific binary (linux-x64, linux-aarch64, darwin-x64, darwin-arm64, win32-x64) will be gzip-compressed to 12-18 MB and stored via LFS.
Build the Node.js SEA binary for darwin-arm64 using the TypeScript bridge source. Adds build.sh, package-lock.json, .gitignore (ignoring node_modules/, dist/, sea-prep.blob, and uncompressed binaries), and fixes src/index.ts to use the async tryLexParse API with ResultKind instead of the non-existent isError property.
- Add Node.js 20+ version validation at script start - Add error handling for SEA blob generation failure - Add mkdir -p binaries before binary copy
…helpers Replace all Lark tree_function calls with _get_arg_values / _get_record_args helpers that operate directly on the embedded NodeIdMap dicts produced by the Microsoft powerquery-parser bridge. Add _get_data_source_tokens for NativeQueryLineage first-arg resolution through the let scope. Preserve DataAccessFunctionDetail.parameters field threading through all handler classes. Restore xfail marker on dangling-comma test — the MSFT parser correctly rejects it.
- Rewrite parser.py to drop Lark/validator, call _bridge.get_bridge() directly - Fix native_query_parsing=False semantics: now suppresses only expressions containing Value.NativeQuery instead of all M-Query parsing - Add m_query_native_query_skipped counter to PowerBiDashboardSourceReport - Add tests/unit/test_native_query_flag.py with 3 tests for the fixed semantics - Remove 13 Lark-specific test_parse_m_query* tests from test_m_parser.py - Update test_unsupported_data_platform to reflect new bridge-based behavior
- Delete tree_function.py, validator.py, and powerbi-lexical-grammar.rule - Remove lark[regex]==1.1.4 from powerbi extras in setup.py - Update package_data to include mquery_bridge binaries instead of grammar rule - Update test_powerbi_parser.py to use dict-based AST nodes instead of lark Tree/Token
|
Linear: ING-1997 |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (73.64%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! |
|
🔴 Meticulous spotted visual differences in 17 of 1811 screens tested: view and approve differences detected. Meticulous evaluated ~8 hours of user flows against your PR. Last updated for commit 52e3617. This comment will update as new commits are pushed. |
Bundle ReportChanges will decrease total bundle size by 284.22kB (-1.24%) ⬇️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
|
…iniRacer - Rewrites _bridge.py to load bundle.js.gz into a MiniRacer V8 context instead of spawning a Node.js SEA subprocess - Fixes index.ts to assign parseExpression to globalThis (not Node's global) so it is accessible in py_mini_racer's V8 context - Fixes build.sh to emit IIFE output (--platform=browser --format=iife) so the bundle runs without Node.js built-ins (exports, require, process) - Uses ctx.eval() + JSPromise.get() to await the async parseExpression function, since ctx.call() does not resolve Promises in mini-racer 0.14 - Regenerates bundle.js.gz from the corrected build
…gz, remove LFS config
… no-lineage cases - Split m_query_parse_unknown_errors into m_query_non_mquery_expressions (DAX/empty/label expressions) and m_query_parse_unknown_errors (genuine M-Query parse failures), making the 241→708 metric jump self-explanatory - Log full expression at DEBUG when parse succeeds but no lineage is extracted, covering both "unsupported function" and "handler returned empty" cases, so expressions can be copy-pasted directly into local tests for investigation - Add debug logs at every silent Lineage.empty() return in pattern_handler.py (Redshift, MySQL, MSSql, TwoStepDataAccessPattern, create_reference_table) and in resolver.py (missing output expression), explaining why each case produced no lineage rather than silently dropping it
Resolved conflict: kept deletion of powerbi-lexical-grammar.rule (Lark parser was removed in this branch).
- Clear MiniRacer singleton on TimeoutException to avoid reusing a corrupted V8 context after async_raise interrupts mid-eval - Guard result["nodeIdMap"] access with .get() + explicit check to raise MQueryBridgeError instead of a KeyError misclassified as resolver error - Fix TRACE_POWERBI_MQUERY_PARSER env var to use get_trace_powerbi_mquery_parser() from env_vars.py instead of a raw os.getenv call - Replace parameters=None + type:ignore + __post_init__ with field(default_factory=dict) in DataAccessFunctionDetail
- Updated bundling process to avoid minification, preserving error message details. - Enhanced error formatting to include stage information (Lex/Parse) for better debugging. - Added new tests to validate error handling and parsing of minimal section documents. - Updated existing tests to reflect changes in error message structure.
treff7es
left a comment
There was a problem hiding this comment.
This is a very nice pr with huge improvement in the mquery parser. Good job!
I approved it with minimal comments.
The onyl concerning one is the get_bridge thread-safety
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/_bridge.py
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/_bridge.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/pattern_handler.py
Outdated
Show resolved
Hide resolved
…andler - Make get_bridge() thread-safe via double-checked locking with threading.Lock() - Add try/except around gzip decompression to give a clear error on corrupt bundle - Use dict.get() instead of dict[] for accessor.items lookups to avoid KeyError on malformed AST nodes (Oracle, TwoStepDataAccessPattern, MySQL handlers)
…o prevent GC segfault py_mini_racer's __del__ -> close() is not thread-safe. When _clear_bridge() simply set _bridge_instance = None, Python's GC could finalize the MiniRacer object later (in an unrelated test) while other threads were active, causing a segfault. Explicitly calling _ctx.close() while holding the lock shuts down the V8 context synchronously and cleanly, before any threads can interfere.
📋 Summary
Replace DataHub's hand-written Lark M-Query parser with Microsoft's official
@microsoft/powerquery-parser— the same TypeScript parser used in the VS Code Power Query extension. The bridge is compiled to a gzip-compressed JS bundle (bundle.js.gz, ~125 KB) evaluated in-process bypy_mini_racer(V8). No Node.js runtime required from users.🎯 Motivation
The existing Lark-based parser is a partial, manually-maintained approximation of the M-Query specification with several known correctness problems:
m_query_parse_timeoutson complex real-world expressionsBaseExceptioncatch-all:parser.pysilently converted unexpected errors (includingKeyboardInterrupt) into "Unknown Parsing Error" warningsnative_query_parsing=Falsebug: The flag suppressed all M-Query parsing, not justValue.NativeQueryexpressions — users could not suppress native queries without also losing all other lineagepattern_handler.pywas ~1,600 lines of defensive token-chain walking to compensate for Lark's flat, semantically-weak tree representationMicrosoft maintains
@microsoft/powerquery-parserwith full spec coverage, structured error recovery, and a rich typed AST. This PR replaces the Lark parser with it while preserving theparser.get_upstream_tables()public API signature exactly.⚖️ Approach: Why PyMiniRacer over a Node.js SEA binary
The initial implementation shipped
@microsoft/powerquery-parseras a platform-native Node.js SEA (Single Executable Application) binary — similar to how DataHub bundlesjdk4py. While functional, that approach had significant drawbacks that led us to switch topy_mini_racerbefore merge.Old approach: Node.js SEA binary
Cons:
@microsoft/powerquery-parserversion bumps.lfs: trueon every checkout,git lfs pullrequired) and an operational dependency on LFS infrastructure.build.shon five separate CI runners to produce all platform binaries. The PR was blocked on darwin-arm64 only.linux-aarch64-muslfor Alpine) required a new CI build matrix entry.New approach: PyMiniRacer in-process V8
At runtime:
Advantages over the old approach:
bundle.js.gzis a plain git file — no LFS, no size concerns, fits trivially in any wheel._bridge.pyis ~60 lines vs ~250.mini-racerships pre-built V8 wheels for all five platforms from PyPI — no build matrix needed. Adding a new platform is apip installaway.build.shrunsnpm ci → tsc → esbuild → gzipand produces one committed artifact. No cross-compilation, no CI runners needed.bundle.js.gzis 125 KB — comfortably committed as a regular git file.The only trade-off is one new runtime dependency (
mini-racer>=0.12.0) added to thepowerbiextras, which ships its own V8 wheels from PyPI and requires no system-level Node.js.🔧 Changes Overview
New components:
mquery_bridge/— TypeScript bridge wrapping@microsoft/powerquery-parser;index.tsexposesparseExpression()viaglobalThisforpy_mini_racer's bare V8 context; compiled tobundle.js.gzviaesbuild --platform=browser --format=iife --minify+ gzipbundle.js.gz— 125 KB committed artifact; no LFS; loaded once per process by_bridge.py_bridge.py— Decompressesbundle.js.gzin memory, evaluates it inMiniRacer, exposesMQueryBridge.parse()returningNodeIdMap = dict[int, dict]; singleton viaget_bridge()ast_utils.py— Typed navigation helpers (find_nodes_by_kind,get_invoke_callee_name,get_literal_value,get_record_field_values,resolve_identifier) over the flatNodeIdMapformatRewritten (same public API, new implementation):
resolver.py— Forward traversal of the typed AST; dispatch table over ~80 node kinds; circular-reference guard via(let_node_id, name)pairspattern_handler.py— ~50% smaller; argument extraction usesast_utilshelpers instead of Lark token-chain walking; all 14 platform handler classes retainedModified:
parser.py— Bridge call replaces Lark;native_query_parsing=Falsenow correctly filters onlyValue.NativeQueryexpressions;BaseExceptionnarrowed toExceptionsetup.py—lark[regex]==1.1.4removed from powerbi extras;mini-racer>=0.12.0added;package_datapoints tobundle.js.gzinstead ofbinaries/*.gz;.gitattributesLFS config removedconfig.py— Added observability counters toPowerBiDashboardSourceReport:m_query_parse_attempts,m_query_parse_successes,m_query_parse_timeouts,m_query_native_query_skipped,m_query_non_mquery_expressions(DAX/empty expressions that were previously silently pre-filtered),m_query_parse_unknown_errors(genuine M-Query failures only),m_query_resolver_successes,m_query_resolver_no_lineage,m_query_resolver_errorsDeleted:
validator.py— Logic re-implemented correctly inparser.pytree_function.py— Replaced byast_utils.pypowerbi-lexical-grammar.rule— No longer neededsea-config.jsonandbinaries/*.gz— Replaced bybundle.js.gz🏗️ Architecture / Design Notes
Key decisions:
--platform=browser --format=iifefor esbuild —py_mini_racer's V8 context has no Node.js globals (exports,require,process). IIFE wraps the bundle in(()=>{...})(), making it self-contained.--platform=nodewould produce CommonJS which V8 can't evaluate.globalThisnotglobal— Bare V8 hasglobalThisbut not Node.js'sglobal. The bridge registersparseExpressiononglobalThissopy_mini_racercan find it.ctx.eval()+promise.get()notctx.call()— Inmini-racer0.14,ctx.call()does not resolve async JS Promises (returns{}instead).ctx.eval()returns aJSPromiseobject; calling.get()blocks until the Promise resolves.py_mini_racerimport —from py_mini_racer import MiniRaceris insideMQueryBridge.__init__(), not at module top. The PowerBI connector can be imported withoutmini-racerinstalled; only instantiation fails, with a clearImportError.bundle.js.gzis decompressed viagzip.decompress()on firstMQueryBridgeinstantiation (~50 ms). No temp files, no cache dirs, no cleanup.ImportErrorwith actionable message > silent degradation.Updating the parser version: Run
build.sh(requires Node.js 16+ on the developer's machine) after bumping@microsoft/powerquery-parserinpackage.json, commit the newbundle.js.gzandpackage-lock.json. No platform-specific builds required.🧪 Testing
Existing suite (unchanged, used as acceptance criteria):
tests/integration/powerbi/test_m_parser.py— 38 M-Query expressions across all supported platforms (Snowflake, MSSQL, PostgreSQL, BigQuery, Redshift, Databricks, Athena, Oracle), native queries,Table.Combine, parameterised queries, DROP stripping, encoding edge cases — 39 passed, 1 xfailedtests/integration/powerbi/test_ingest.py— golden-file end-to-end MCP shape and URN generationNew tests:
tests/unit/test_mquery_bridge.py— bridge lifecycle, valid/invalid expression handling, singleton behaviour; tests hit the real V8 engine (no mocking)tests/unit/test_ast_utils.py— allast_utilshelpers; bridge-backed (parses live via V8 at test time, no static fixtures); covers null literals, empty records, identifier resolution, node-kind searchtests/unit/test_native_query_flag.py— explicitly validates the correctednative_query_parsingsemantics:falsesuppresses onlyValue.NativeQueryexpressions; non-native expressions produce lineage normally📊 Impact Assessment
Affected components:
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/only.powerbi.pyand all callers ofget_upstream_tables()are untouched.Breaking changes:
Performance: See Perf Comparison section below for measured numbers from a production ingestion run. The headline: M-Query parse time drops from 2,816s to 28s (101×), total wall time drops from 62 min to ~27 min (~2.3×), and parse success rate increases from 30.5% to 75.5%. Timeout incidents drop to near zero — the MSFT parser runs in linear time vs. Earley's O(n³). The one-time cost of loading
bundle.js.gz(~50 ms) is paid once per ingestion process, not per expression.Risk level: Medium — core lineage extraction logic is fully rewritten; mitigated by the existing 38-test corpus passing without modification.
🚀 Deployment Notes
No new user-facing runtime dependencies beyond the
powerbiextras.mini-racer>=0.12.0is added to thepowerbiextras insetup.py. It ships pre-built V8 wheels forlinux-x86_64,linux-aarch64,darwin-x86_64,darwin-arm64, andwin32-x86_64from PyPI — no system Node.js required.lark[regex]removed. The Lark grammar dependency is dropped from thepowerbiextras.📈 Performance comparison: Lark vs MSFT
Measured on a production PowerBI ingestion run (~2,900 M-Query expressions).
Speed
Parse quality
Lineage output (semantic diff via
datahub check metadata-diff)