-
Notifications
You must be signed in to change notification settings - Fork 3.4k
feat(powerbi): replace Lark M-Query parser with Microsoft powerquery-parser #16685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
9994f7d
chore(powerbi): configure Git LFS for m-query bridge binaries
askumar27 e7555b1
feat(powerbi): add TypeScript bridge source for MSFT powerquery-parser
askumar27 e080a41
fix(powerbi): improve index.ts input validation and error handling
askumar27 ebcf7d7
feat(powerbi): add compiled mquery-parser SEA binaries
askumar27 85f9fca
fix(powerbi): harden build.sh with Node version check and error handling
askumar27 502a02c
feat(powerbi): add MQueryBridge subprocess manager with gzip decompreβ¦
askumar27 4838876
fix(powerbi): harden _bridge.py error handling and resource management
askumar27 11b5446
test(powerbi): add AST fixtures generated from MSFT powerquery-parser
askumar27 b47379f
feat(powerbi): add ast_utils.py for NodeIdMap navigation
askumar27 f42c3ca
fix(powerbi): harden ast_utils.py edge case handling
askumar27 c322ca7
feat(powerbi): rewrite resolver.py to walk powerquery-parser NodeIdMap
askumar27 16d3fde
fix(powerbi): add type annotations and quoted identifier handling in β¦
askumar27 139441f
feat(powerbi): rewrite pattern_handler.py to use ast_utils NodeIdMap β¦
askumar27 4e76543
feat(powerbi): wire bridge into parser.py, fix native_query_parsing flag
askumar27 b3a6aff
fix(powerbi): narrow BaseException to Exception in parser.py resolverβ¦
askumar27 6be81cc
chore(powerbi): remove Lark parser, update packaging for SEA binaries
askumar27 be769ea
docs(powerbi): document native_query_parsing behaviour fix in updatinβ¦
askumar27 57c4bd4
refactor(powerbi): expose parseExpression global function for PyMiniRβ¦
askumar27 02396c8
refactor(powerbi): simplify build.sh to esbuild+gzip, commit bundle.jβ¦
askumar27 9f4675e
refactor(powerbi): replace subprocess SEA bridge with py_mini_racer Mβ¦
askumar27 5bdd0fd
chore(powerbi): add mini-racer dep, update package_data to bundle.js.β¦
askumar27 b8e80c7
refactor(powerbi): update generate_fixtures.py to use _bridge.py direβ¦
askumar27 b8e5c81
chore(powerbi): delete SEA binaries and sea-config.json
askumar27 eda9d3f
fix(powerbi): fix mypy errors in _bridge.py and test_native_query_flaβ¦
askumar27 200fd11
chore(deps): add mini-racer==0.14.1 to constraints.txt
askumar27 c7d9943
chore(deps): update pyproject.toml and uv.lock for mini-racer, removeβ¦
askumar27 ed86cc2
docs(powerbi): add PR number and mini-racer dep note to updating-dataβ¦
askumar27 3de26f4
Merge branch 'master' into powerbi_mquery_parser
askumar27 f307790
fix(powerbi): fix two CI failures after mini-racer migration
askumar27 7447205
fix(powerbi): replace JSON load_fixture with bridge-backed module fixβ¦
askumar27 2a7a8c9
fix(powerbi): move ast_utils imports to top, use NodeIdMap alias in fβ¦
askumar27 f1f12f3
fix(powerbi): delete committed JSON AST fixtures (generated at test tβ¦
askumar27 e1647b3
fix(powerbi): repurpose generate_fixtures.py as dev-only tool, updateβ¦
askumar27 c01a0f5
fix(powerbi): move index.ts to mquery_bridge root, fix wheel package β¦
askumar27 6f965af
fix(powerbi): suppress warnings for non-M-Query (DAX/empty) expressioβ¦
askumar27 3fb0805
fix(powerbi): inline Snowflake M expression and add return types to fβ¦
askumar27 6869d7e
docs(powerbi): improve docstrings and comments across M-Query bridge β¦
askumar27 ba85ee5
fix(powerbi): restore partial lineage for unresolved server parameters
askumar27 52e3617
feat(powerbi): add fine-grained M-Query metrics and debug logging forβ¦
askumar27 dc085da
docs: update breaking changes section for PowerBI M-Query lineage extβ¦
askumar27 f8f2e22
chore: merge oss/master into powerbi_mquery_parser
askumar27 f1ed6cb
fix(powerbi): fix critical error handling bugs in M-Query bridge
askumar27 9b36819
refactor(powerbi): improve error handling in M-Query bridge
askumar27 a12dc7e
fix(powerbi): address review comments on M-Query bridge and pattern hβ¦
askumar27 ea5a705
Merge branch 'master' into powerbi_mquery_parser
askumar27 70573e8
fix(powerbi): explicitly close MiniRacer context in _clear_bridge() tβ¦
askumar27 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,3 +3,4 @@ | |
| *.tsx text eol=lf | ||
| gradlew text eol=lf | ||
| metadata-utils/src/test/resources/filterQuery/* text eol=lf | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
156 changes: 156 additions & 0 deletions
156
metadata-ingestion/src/datahub/ingestion/source/powerbi/m_query/_bridge.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
| import gzip | ||
| import json | ||
| import logging | ||
| import threading | ||
| from pathlib import Path | ||
| from typing import Optional | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| NodeIdMap = dict[int, dict] | ||
|
|
||
| _BUNDLE_PATH = Path(__file__).parent / "mquery_bridge" / "bundle.js.gz" | ||
|
|
||
|
|
||
| class MQueryBridgeError(RuntimeError): | ||
| """V8 context error or malformed response from the M-Query bridge.""" | ||
|
|
||
| pass | ||
|
|
||
|
|
||
| class MQueryParseError(RuntimeError): | ||
| """Parser returned a structured parse error for a specific expression.""" | ||
|
|
||
| def __init__(self, message: str, expression: str = "") -> None: | ||
| super().__init__(message) | ||
| self.expression = expression | ||
|
|
||
|
|
||
| class MQueryBridge: | ||
| def __init__(self) -> None: | ||
| if not _BUNDLE_PATH.exists(): | ||
| raise ImportError( | ||
| f"M-Query bridge bundle not found at {_BUNDLE_PATH}. " | ||
| "Re-installing acryl-datahub[powerbi] may fix this." | ||
| ) | ||
| try: | ||
| from py_mini_racer import MiniRacer | ||
| except ImportError as e: | ||
| raise ImportError( | ||
| "PowerBI M-Query parsing requires 'mini-racer'. " | ||
| "Install it with: pip install 'acryl-datahub[powerbi]'" | ||
| ) from e | ||
|
|
||
| # Decompress bundle.js.gz in memory β fast (~50ms for 500KB) and happens once per process. | ||
| try: | ||
| bundle_js = gzip.decompress(_BUNDLE_PATH.read_bytes()).decode("utf-8") | ||
| except (gzip.BadGzipFile, OSError, EOFError) as e: | ||
| raise ImportError( | ||
| f"M-Query bridge bundle at {_BUNDLE_PATH} appears to be corrupt: {e}. " | ||
| "Re-installing acryl-datahub[powerbi] may fix this." | ||
| ) from e | ||
| self._ctx = MiniRacer() | ||
| self._ctx.eval(bundle_js) | ||
|
|
||
| def parse(self, expression: str) -> NodeIdMap: | ||
| """ | ||
| Parse an M-Query expression and return a flat node map. | ||
|
|
||
| Each key is a node ID (int); each value is a node dict with at least | ||
| ``kind`` (NodeKind string) and ``id``. Child nodes are embedded inline, | ||
| not as ID references, so you can walk them directly or look up any | ||
| node by ID via the returned map. | ||
|
|
||
| Example β the LetExpression root for ``let x = 1 in x`` is at the | ||
| root of the returned dict and looks roughly like:: | ||
|
|
||
| {1: {"kind": "LetExpression", "id": 1, "variableList": {...}, ...}, | ||
| 2: {"kind": "ArrayWrapper", "id": 2, ...}, | ||
| ...} | ||
|
|
||
| Not thread-safe β callers must be single-threaded. | ||
|
|
||
| Raises: | ||
| MQueryParseError: parser returned a structured error for this expression. | ||
| MQueryBridgeError: V8 context error or malformed response. | ||
| """ | ||
| # JSPromise is available: __init__ already guaranteed py_mini_racer is installed. | ||
| from py_mini_racer import JSPromise | ||
|
|
||
| try: | ||
| # parseExpression is async, so ctx.call() returns an unresolved plain dict. | ||
| # Use ctx.eval() instead, which returns a JSPromise; call .get() to await it. | ||
| result = self._ctx.eval(f"parseExpression({json.dumps(expression)})") | ||
| if not isinstance(result, JSPromise): | ||
| raise MQueryBridgeError( | ||
| f"M-Query bridge: expected JSPromise from parseExpression, got {type(result).__name__}" | ||
| ) | ||
| raw = result.get() | ||
| except MQueryBridgeError: | ||
| raise | ||
| except Exception as e: | ||
| # Catches all py_mini_racer errors (JSEvalException, JSTimeoutException, etc.) | ||
| # MiniRacerBaseException is not exported from the top-level namespace in mini-racer. | ||
| raise MQueryBridgeError(f"M-Query bridge V8 error: {e}") from e | ||
|
|
||
| if not isinstance(raw, str): | ||
| raise MQueryBridgeError( | ||
| f"M-Query bridge returned non-string result: {type(raw).__name__}" | ||
| ) | ||
|
|
||
| try: | ||
| result = json.loads(raw) | ||
| except (json.JSONDecodeError, TypeError) as e: | ||
| raise MQueryBridgeError( | ||
| f"M-Query bridge returned malformed JSON: {e}. Raw: {raw!r}" | ||
| ) from e | ||
|
|
||
| if not result.get("ok"): | ||
| raise MQueryParseError( | ||
| result.get("error", "unknown error"), expression=expression | ||
| ) | ||
|
|
||
| node_id_map = result.get("nodeIdMap") | ||
| if node_id_map is None: | ||
| raise MQueryBridgeError( | ||
| "M-Query bridge returned ok=true but 'nodeIdMap' is missing from response" | ||
| ) | ||
|
|
||
| return {int(node_id): node for node_id, node in node_id_map} | ||
|
|
||
|
|
||
| _bridge_instance: Optional[MQueryBridge] = None | ||
| _bridge_lock = threading.Lock() | ||
|
|
||
|
|
||
| def get_bridge() -> MQueryBridge: | ||
| """Return the process-wide MQueryBridge, creating it on first call.""" | ||
| global _bridge_instance | ||
| if _bridge_instance is None: | ||
| with _bridge_lock: | ||
| if _bridge_instance is None: | ||
| _bridge_instance = MQueryBridge() | ||
| return _bridge_instance | ||
|
|
||
|
|
||
| def _clear_bridge() -> None: | ||
| """Drop the singleton so the next call to get_bridge() starts fresh. | ||
|
|
||
| Called after a V8 crash to avoid reusing a broken context, and in tests | ||
| to ensure each test module gets an isolated bridge. | ||
| """ | ||
| global _bridge_instance | ||
| with _bridge_lock: | ||
| if _bridge_instance is not None: | ||
| # Explicitly close the V8 context before dropping the reference. | ||
| # If we just set _bridge_instance = None here, Python's GC decides | ||
| # when to finalize the MiniRacer object. If that happens while other | ||
| # threads are active (e.g. in a later, unrelated test), MiniRacer's | ||
| # __del__ -> close() path segfaults. Closing synchronously here, while | ||
| # the lock is held and no concurrent parse() calls are in flight, | ||
| # shuts down V8 cleanly. | ||
| try: | ||
| _bridge_instance._ctx.close() | ||
| except Exception: | ||
| pass | ||
| _bridge_instance = None | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.