[ML] Harden pytorch_inference with TorchScript model graph validation#2936
Merged
edsavage merged 30 commits intoelastic:mainfrom Mar 12, 2026
Merged
[ML] Harden pytorch_inference with TorchScript model graph validation#2936edsavage merged 30 commits intoelastic:mainfrom
edsavage merged 30 commits intoelastic:mainfrom
Conversation
Add a static TorchScript graph validation layer that rejects models containing operations not observed in supported transformer architectures. This reduces the attack surface by ensuring only known-safe operation sets are permitted, complementing the existing Sandbox2/seccomp defenses. New files: - CSupportedOperations: allowlist of 71 ops from 10 reference architectures - CModelGraphValidator: recursive graph walker and validation logic - CModelGraphValidatorTest: 10 unit tests covering pass/fail/edge cases - extract_model_ops.py: developer tool to regenerate the allowlist Relates to elastic/ml-team#1770 Made-with: Cursor
…onfig - Move script to dev-tools/extract_model_ops/ subdirectory - Extract REFERENCE_MODELS dict to reference_models.json config file - Add requirements.txt for virtual environment setup - Add README.md with setup, usage, and configuration instructions - Update CSupportedOperations path references Made-with: Cursor
…ript - Add all 10 elastic/* models from HuggingFace to reference_models.json - Make extract_model_ops.py resilient to individual model load/trace failures (continues to next model instead of crashing) - Add sentencepiece and protobuf to requirements.txt - Add .gitignore for .venv directory - Update CSupportedOperations.cc comment with expanded model list - Op union remains 71 ops (Elastic models use same base architectures) Made-with: Cursor
Remove bart and elastic/multilingual-e5-small which cannot be traced or scripted with the current transformers/torch versions. Made-with: Cursor
Explain why both a short forbidden list and a broad allowed list are maintained: targeted error messages, safety net against accidental allowlist expansion, and defence-in-depth. Made-with: Cursor
Re-ran extraction with torch 2.7.1 (matching the libtorch version linked by ml-cpp) -- op set is identical to the 2.10.0 run. Pin torch version in requirements.txt and fix the comment. Made-with: Cursor
Aids debugging when a legitimate model is unexpectedly rejected after a PyTorch upgrade, and provides an audit trail of what was loaded. Made-with: Cursor
…Method Use torch::jit::Inline() to flatten method calls before collecting operations. This ensures ops hidden behind prim::CallMethod are surfaced for validation. After inlining, prim::CallMethod and prim::CallFunction should not appear; add them to the forbidden list so any unresolvable call is explicitly rejected. Made-with: Cursor
Reject models whose inlined computation graph exceeds 1M nodes. Typical transformer models have O(10k) nodes; the generous limit prevents pathologically crafted models from causing excessive memory or CPU usage during graph traversal. Made-with: Cursor
Construct scriptable modules with define() and validate them through the full CModelGraphValidator pipeline. Covers: a valid module with allowed ops, a module with unrecognised ops, node count tracking, and a parent/child module pair that exercises graph inlining. Made-with: Cursor
Made-with: Cursor
Adds validate_allowlist.py alongside extract_model_ops.py in dev-tools/extract_model_ops/. The script parses ALLOWED_OPERATIONS and FORBIDDEN_OPERATIONS directly from CSupportedOperations.cc, then traces every model in validation_models.json and checks for false positives. validation_models.json is a superset of reference_models.json that also includes task-specific models (NER, sentiment analysis) matching the bin/pytorch_inference/examples/ test data. A wrapper script (run_validation.sh) automatically creates the Python venv and installs dependencies on first run. A CMake target is registered for convenient invocation: cmake --build <build-dir> -t validate_pytorch_inference_models Made-with: Cursor
Extend the allowlist validation to cover models directly referenced in the Elasticsearch repo and its eland import tool: the packaged multilingual-e5-small, the cross-encoder reranker from the docs, the sentence-transformers embedding model from eland tests, and the DPR question encoder. All 24 models pass validation with no false positives. Made-with: Cursor
Extract the base64-encoded TorchScript models from PyTorchModelIT, TextExpansionQueryIT, and TextEmbeddingQueryIT in the Elasticsearch repo and validate them against our operation allowlist. These toy models use basic ops (aten::ones, aten::rand, aten::hash, prim::Loop, etc.) that weren't in the transformer-derived allowlist, so add them. All are safe tensor/control-flow operations with no I/O capability. The validation script now accepts --pt-dir to validate pre-saved .pt files alongside HuggingFace models. The CMake target passes the new es_it_models directory automatically. Made-with: Cursor
Create six malicious .pt model fixtures that exercise specific attack vectors the CModelGraphValidator must detect: - malicious_file_reader: uses aten::from_file to read arbitrary files - malicious_mixed_file_reader: hides aten::from_file among allowed ops - malicious_hidden_in_submodule: buries unrecognised ops 3 levels deep - malicious_conditional: hides unrecognised ops inside if-branches - malicious_many_unrecognised: uses sin/cos/tan/exp (unknown arch) - malicious_file_reader_in_submodule: forbidden op hidden in child module Each test loads the real .pt file via torch::jit::load and verifies the validator correctly identifies and rejects it. Includes the Python generator script for reproducibility. Made-with: Cursor
Replace the bash wrapper script with cmake/run-validation.cmake that works across all CI platforms (Linux, macOS, Windows). The CMake script searches for python3, python3.12, python3.11, python3.10, python3.9, and python — handling Linux build machines where Python is only available as python3.12 (via make altinstall) and Windows where the canonical name is python. It also prepends the venv's torch/lib directory to the dynamic library search path to avoid conflicts with any system-installed libtorch. Made-with: Cursor
Add the Python allowlist validation as a step in test_all_parallel (used by CI) and precommit (used by developers). Both use OPTIONAL=TRUE so the validation is gracefully skipped with a warning when Python 3 is not available or pip cannot install dependencies (e.g. in Docker containers without network access). The standalone validate_pytorch_inference_models target remains a hard failure for explicit use. Made-with: Cursor
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
Made-with: Cursor
…gram Made-with: Cursor
Replace relative "../Foo.h" includes with <Foo.h> by adding the parent source directory to the test target's include path. Also remove unnecessary backslash escapes in extract_model_ops README. Made-with: Cursor
…sion Made-with: Cursor
Deduplicate collect_graph_ops, graph inlining, and HuggingFace model loading/tracing logic shared between extract_model_ops.py and validate_allowlist.py into a common module. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
- Check MAX_NODE_COUNT during graph traversal to prevent resource exhaustion on pathologically large models (bail out immediately in collectBlockOps and collectModuleOps). - Two-pass validation: check forbidden ops first, skip unrecognised op scan when forbidden ops are found. - Add aten::as_strided to FORBIDDEN_OPERATIONS (key enabler of heap-leak and ROP chain attacks). - Change LOG_FATAL to HANDLE_FATAL in the c10::Error catch block so an exception during validation terminates the process. - Fix CHANGELOG asciidoc link syntax. - Move generate_malicious_models.py to dev-tools/. - Remove redundant Python test scripts now that C++ integration tests cover the same attack models. - Remove PR cross-references from comments per reviewer request. Made-with: Cursor
Add a C++ test (testAllowlistCoversReferenceModels) that loads a
golden JSON file containing per-architecture TorchScript op sets
extracted from 18 reference HuggingFace models and verifies every
op is in ALLOWED_OPERATIONS and none are in FORBIDDEN_OPERATIONS.
This catches allowlist regressions in CI without requiring Python
or network access. When PyTorch is upgraded, regenerate the golden
file with:
python3 extract_model_ops.py --golden \
bin/pytorch_inference/unittest/testfiles/reference_model_ops.json
The --golden flag is a new addition to extract_model_ops.py that
outputs per-model op sets as structured JSON.
Made-with: Cursor
Made-with: Cursor
valeriy42
approved these changes
Mar 12, 2026
Contributor
valeriy42
left a comment
There was a problem hiding this comment.
LGTM. Great work!
I left only one comment regarding automatic running against the PyTorch edge branch.
Contributor
|
@edsavage , although it's an enhancement, for obvious reasons, I added backporting to all supported versions. |
Contributor
Author
Yes, that's a really good idea, I'll look into doing that in a separate PR. |
edsavage
added a commit
to edsavage/ml-cpp
that referenced
this pull request
Mar 12, 2026
…elastic#2936) Add a static TorchScript graph validation layer that rejects models containing operations not observed in supported transformer architectures. This reduces the attack surface by ensuring only known-safe operation sets are permitted, complementing the existing Sandbox2/seccomp defenses. (cherry picked from commit 38f6653)
edsavage
added a commit
to edsavage/ml-cpp
that referenced
this pull request
Mar 12, 2026
…elastic#2936) Add a static TorchScript graph validation layer that rejects models containing operations not observed in supported transformer architectures. This reduces the attack surface by ensuring only known-safe operation sets are permitted, complementing the existing Sandbox2/seccomp defenses. (cherry picked from commit 38f6653)
edsavage
added a commit
to edsavage/ml-cpp
that referenced
this pull request
Mar 12, 2026
…elastic#2936) Add a static TorchScript graph validation layer that rejects models containing operations not observed in supported transformer architectures. This reduces the attack surface by ensuring only known-safe operation sets are permitted, complementing the existing Sandbox2/seccomp defenses. (cherry picked from commit 38f6653)
This was referenced Mar 12, 2026
edsavage
added a commit
that referenced
this pull request
Mar 12, 2026
…idation (#2936) (#2988) Add a static TorchScript graph validation layer that rejects models containing operations not observed in supported transformer architectures. This reduces the attack surface by ensuring only known-safe operation sets are permitted, complementing the existing Sandbox2/seccomp defenses. Backports #2936
edsavage
added a commit
that referenced
this pull request
Mar 12, 2026
…#2936) (#2987) Add a static TorchScript graph validation layer that rejects models containing operations not observed in supported transformer architectures. This reduces the attack surface by ensuring only known-safe operation sets are permitted, complementing the existing Sandbox2/seccomp defenses. Backports #2936
edsavage
added a commit
that referenced
this pull request
Mar 13, 2026
…#2936) (#2986) Add a static TorchScript graph validation layer that rejects models containing operations not observed in supported transformer architectures. This reduces the attack surface by ensuring only known-safe operation sets are permitted, complementing the existing Sandbox2/seccomp defenses. Backports #2936
4 tasks
valeriy42
added a commit
to valeriy42/ml-cpp
that referenced
this pull request
Mar 13, 2026
…lidation (elastic#2936)" This reverts commit 38f6653.
edsavage
added a commit
to edsavage/ml-cpp
that referenced
this pull request
Mar 15, 2026
This reverts commit 4f1ec3e.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements security hardening for
pytorch_inferenceby validating TorchScript model graphs before execution, addressing elastic/ml-team#1770.Model Graph Validation (C++)
CModelGraphValidator: Validates TorchScript model graphs by inlining all method calls (torch::jit::Inline) and recursively inspecting every node (including sub-blocks insideprim::If/prim::Loop).CSupportedOperations: Defines a dual-list security model:FORBIDDEN_OPERATIONS(4 ops):aten::execute_with_args,aten::from_file,prim::CallFunction,prim::CallMethod— rejected immediately with a clear error.ALLOWED_OPERATIONS(82 ops): Exhaustive allowlist of safe tensor/control-flow ops derived from tracing reference models against PyTorch 2.7.1.MAX_NODE_COUNT = 1,000,000): Guards against resource exhaustion from excessively large graphs.DEBUGlevel during validation.Operation Allowlist Tooling (Python)
dev-tools/extract_model_ops/: Self-contained tooling directory with:extract_model_ops.py— generates the C++ allowlist from reference HuggingFace modelsvalidate_allowlist.py— integration test verifying no false positives against 24 HuggingFace models + 3 Elasticsearch integration test modelsreference_models.json/validation_models.json— model configurationses_it_models/— extracted.ptmodels from Elasticsearch'sPyTorchModelIT,TextExpansionQueryIT,TextEmbeddingQueryITrequirements.txt— pinned totorch==2.7.1matching the libtorch build versionCI Integration
cmake/run-validation.cmake: Portable CMake script that locates Python 3 (searchingpython3,python3.12, ...,python), manages a virtual environment, handlesDYLD_LIBRARY_PATH/LD_LIBRARY_PATHfor libtorch conflicts, and runs the validation. SupportsOPTIONAL=TRUEfor graceful skip when Python or network is unavailable.test_all_parallelandprecommitwithOPTIONAL=TRUE— runs automatically when Python is available, skips with a warning otherwise (e.g. in Docker containers without network).validate_pytorch_inference_modelsavailable for explicit verification (hard failure mode).C++ Tests
CModelGraphValidatorTest.cc): Tests for allowed/forbidden/unrecognised ops, graph inlining, node count enforcement, and integration tests usingtorch::jit::Module::define()..ptfixtures testing detection ofaten::from_file, hidden ops in submodules, conditional branches, and mixed scenarios.Test plan
cmake --build ... -t test)CModelGraphValidatortests pass (50 pytorch_inference test cases).pt)OPTIONAL=TRUEgracefully skips when Python unavailableMade with Cursor