-
Notifications
You must be signed in to change notification settings - Fork 161
Add llm_ptq PR test, Cleanup dependency installation in modelopt docker #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new GPU-enabled E2E GitHub Actions workflow with PR change detection and gating, adjusts CI triggers/timeouts, switches Docker builds to install the local repository and expands CUDA archs, moves Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant GH as GitHub
participant WF as "E2E Example tests"
participant CF as check-file-changes
participant WC as wait-checks
participant PRJ as example-tests-pr
participant NPJ as example-tests-non-pr
participant RC as example-pr-required-check
Dev->>GH: Push / PR / Manual dispatch
GH->>WF: Trigger workflow
rect rgba(230,240,255,0.5)
note right of WF: Change detection gate
WF->>CF: compute merge-base & detect changed files
CF-->>WF: any_changed = true/false
end
alt any_changed == true
rect rgba(235,255,235,0.5)
note over WF,WC: Wait for prerequisite checks (DCO or linux)
WF->>WC: await match_pattern ^DCO$|^linux$
WC-->>WF: prereqs passed/failed
end
alt Context is PR and prereqs passed
WF->>PRJ: run GPU container tests (matrix: llm_ptq)
PRJ-->>WF: results
else Context is non-PR
WF->>NPJ: run GPU container tests (matrix: llm_ptq)
NPJ-->>WF: results
end
else any_changed == false
note right of WF: Skip example test jobs
end
rect rgba(255,245,230,0.6)
note right of RC: Required-check gate enforces PR status
WF->>RC: evaluate outcomes
RC-->>GH: report pass/fail status
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #338 +/- ##
=======================================
Coverage 73.82% 73.82%
=======================================
Files 172 172
Lines 17438 17438
=======================================
Hits 12874 12874
Misses 4564 4564 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (8)
examples/pruning/cifar_resnet.ipynb (1)
25-33
: Make the install cell idempotent and dev‑env friendly.Avoid reinstalling over editable installs; gate on missing packages.
Apply:
-! pip install nvidia-modelopt torchvision +import importlib.util, subprocess, sys +missing = [m for m in ("modelopt", "torchvision") if importlib.util.find_spec(m) is None] +if missing: + subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "nvidia-modelopt", "torchvision"]) +else: + print("Dependencies already present; skipping install")README.md (1)
69-73
: Provide HTTPS clone as default and keep SSH as optional.SSH requires keys; HTTPS works for most users.
- git clone [email protected]:NVIDIA/TensorRT-Model-Optimizer.git + # HTTPS (recommended for most users) + git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git + # Or SSH (requires GitHub SSH keys) + # git clone [email protected]:NVIDIA/TensorRT-Model-Optimizer.gitexamples/onnx_ptq/docker/Dockerfile (2)
8-12
: Drop SSH host scan unless you actually fetch via SSH.It adds network dependency and build variability.
- && mkdir -p -m 0600 ~/.ssh \ - && ssh-keyscan github.com >> ~/.ssh/known_hosts + # SSH known_hosts not needed since we don't fetch via SSH
28-32
: Tighten pip steps and validate env — TensorRT Python not presentRunning
import tensorrt
raised ModuleNotFoundError — the Python TensorRT package isn't installed in the current environment, so duplication with the base image was not observed.File: examples/onnx_ptq/docker/Dockerfile Lines: 28-32
-RUN pip install -r TensorRT-Model-Optimizer/examples/onnx_ptq/requirements.txt +RUN pip install --no-cache-dir -r TensorRT-Model-Optimizer/examples/onnx_ptq/requirements.txt && pip check @@ -RUN pip install -e "./TensorRT-Model-Optimizer[hf,onnx]" +RUN pip install --no-cache-dir -e "./TensorRT-Model-Optimizer[hf,onnx]" && pip checkdocs/source/getting_started/_installation_for_Linux.rst (1)
40-41
: Minor grammar/clarity."examples's" → "examples'"; also link to a concrete file for discoverability.
- dependencies pre-installed. You may need to install additional dependencies from the examples's `requirements.txt` file. + dependencies pre-installed. You may need to install additional dependencies from the examples' `requirements.txt` file (e.g., ``examples/llm_ptq/requirements.txt``)..github/workflows/example_tests.yml (3)
63-65
: Runner label is custom; configure actionlint or add a comment.actionlint warns on unknown label. If you lint this repo, add the label to actionlint.yaml or gate actionlint for this workflow.
75-81
: Install order OK; consider caching to speed up.You already use a proxy cache. Optionally add “pip install -U pip wheel” first and “--upgrade” to ensure fresh extras.
- pip install ".[all]" + python -m pip install -U pip wheel + pip install -U ".[all]"
82-91
: YAML anchors trip actionlint; inline container/steps for non‑PR job.GitHub Actions supports anchors, but actionlint’s diagnostics indicate it flags alias usage for mapping nodes here. To keep lint green, duplicate the mapping and steps.
example-tests-non-pr: if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }} # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md runs-on: linux-amd64-gpu-h100-latest-1 timeout-minutes: 90 strategy: matrix: EXAMPLE: [llm_ptq] - container: *example_container - steps: *example_steps + container: + image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2 + env: + LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" + PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}" + PIP_CONSTRAINT: "" + steps: + - uses: actions/checkout@v4 + - uses: nv-gha-runners/setup-proxy-cache@main + - name: Run example tests + run: | + python -m pip install -U pip wheel + pip install -U ".[all]" + if [ -f examples/$EXAMPLE/requirements.txt ]; then pip install -r examples/$EXAMPLE/requirements.txt; fi + pytest -s tests/examples/$EXAMPLE
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.github/workflows/example_tests.yml
(1 hunks).github/workflows/gpu_tests.yml
(0 hunks).github/workflows/unit_tests.yml
(0 hunks).gitlab/tests.yml
(1 hunks)README.md
(1 hunks)docker/Dockerfile
(2 hunks)docs/source/getting_started/_installation_for_Linux.rst
(2 hunks)examples/cnn_qat/requirements.txt
(1 hunks)examples/llm_eval/requirements.txt
(1 hunks)examples/onnx_ptq/docker/Dockerfile
(1 hunks)examples/onnx_ptq/requirements.txt
(1 hunks)examples/pruning/cifar_resnet.ipynb
(1 hunks)setup.py
(1 hunks)
💤 Files with no reviewable changes (2)
- .github/workflows/unit_tests.yml
- .github/workflows/gpu_tests.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml
63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
82-82: "steps" section is missing in job "example-tests-non-pr"
(syntax-check)
85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
90-90: "container" section is alias node but mapping node is expected
(syntax-check)
91-91: "steps" section must be sequence node but got alias node with "" tag
(syntax-check)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (9)
setup.py (1)
81-82
: LGTM: move torchvision out of base install.This aligns with per‑example deps and reduces accidental torch churn from base installs.
Please confirm CI images satisfy torch>=2.6 to avoid resolver surprises when examples add torchvision.
examples/llm_eval/requirements.txt (1)
7-7
: Keep torchvision — it's used in examples/llm_eval/modeling.py (import:from torchvision.datasets.utils import download_url
at line 61).Likely an incorrect or invalid review comment.
examples/onnx_ptq/requirements.txt (1)
5-5
: Keep torchvision in examples/onnx_ptq/requirements.txt — it's required.Imported in examples/onnx_ptq/evaluation.py (import torchvision.transforms as transforms; from torchvision.datasets import ImageNet) and examples/onnx_ptq/evaluate.py (from torchvision.datasets import ImageNet). Keep the dependency.
examples/cnn_qat/requirements.txt (1)
1-1
: Pin torchvision to a torch‑compatible range (or verify in CI)examples/cnn_qat/requirements.txt:1 lists unpinned
torchvision
. I ran the suggested verification but it failed with: ModuleNotFoundError: No module named 'torch' — verification couldn't be completed in this environment. Either pintorchvision
to the Torch used in CI (to avoid CUDA/Torch mismatches) or re-run the check inside CI/an environment withtorch
installed and add a post-install pairing check.docker/Dockerfile (1)
26-26
: Precompilation step: good call.Precompiling extensions avoids per-run latency. Keep.
Consider caching pip wheels in the image (or via a proxy cache) to further reduce CI runtimes.
docs/source/getting_started/_installation_for_Linux.rst (2)
15-26
: Version matrix needs verification against current releases.Please confirm Python (>=3.10,<3.13), CUDA (>=12.0), PyTorch (>=2.6), TensorRT-LLM (1.1.0rc2.post2), ONNX Runtime (1.22), TensorRT (>=10.0) are accurate minima and reflect what ModelOpt tests against today.
I can add a small “Tested with” table if you share the CI matrix.
165-168
: Good: document precompilation.This matches the Dockerfile step and sets user expectations.
.github/workflows/example_tests.yml (2)
69-74
: Container env and PIP_CONSTRAINT reset look right.Helps avoid pin conflicts when upgrading deps in job.
93-100
: Required-check gate looks good.Enforces failure when tests are expected but skipped/failed.
f83a0fe
to
d44fb7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
docker/Dockerfile (2)
21-27
: Don’t preinstall example deps in the base image; it contradicts PR goals.This bakes llm_ptq deps globally and can clash with user torch/torchvision installs. Install at runtime in workflows/examples only.
RUN pip install -e "./TensorRT-Model-Optimizer[all,dev-test]" RUN rm -rf TensorRT-Model-Optimizer/.git RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()" -RUN pip install -r TensorRT-Model-Optimizer/examples/llm_ptq/requirements.txt
7-7
: Fix invalid TORCH_CUDA_ARCH_LIST entry (“12.0+PTX”) and stray “10.0”.PyTorch expects SM versions; “12.0+PTX” is not a valid SM, and adding non‑supported SMs slows/ breaks builds. Keep only supported SMs and add +PTX to the highest.
- TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0 12.0+PTX" + TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX"If you truly target SM 10.0, confirm toolchain support and use
10.0+PTX
as the highest instead of12.0+PTX
.
🧹 Nitpick comments (3)
.github/workflows/code_quality.yml (1)
18-18
: Timeout bump looks fine; consider pip caching to keep it under 30 mins.Add setup-python’s pip cache to speed up tox/pre-commit runs and reduce flakiness.
- uses: actions/setup-python@v5 with: python-version: "3.12" + cache: 'pip' + cache-dependency-path: | + requirements*.txt + pyproject.toml + tox.inidocker/Dockerfile (1)
3-6
: Optional: enable pip no‑cache to reduce layer size.You set
PIP_NO_CACHE_DIR=off
. Consider enabling for image builds.- PIP_NO_CACHE_DIR=off \ + PIP_NO_CACHE_DIR=on \.github/workflows/example_tests.yml (1)
71-74
: Consider exporting PATH to TRT bin if tests need CLI tools.If tests call trtexec or other TRT binaries, uncomment PATH; otherwise this is fine.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.github/workflows/code_quality.yml
(1 hunks).github/workflows/example_tests.yml
(1 hunks).github/workflows/gpu_tests.yml
(0 hunks).github/workflows/pages.yml
(1 hunks).github/workflows/unit_tests.yml
(0 hunks).gitlab/tests.yml
(1 hunks)docker/Dockerfile
(2 hunks)tests/examples/speculative_decoding/test_medusa.py
(0 hunks)
💤 Files with no reviewable changes (3)
- tests/examples/speculative_decoding/test_medusa.py
- .github/workflows/gpu_tests.yml
- .github/workflows/unit_tests.yml
✅ Files skipped from review due to trivial changes (1)
- .github/workflows/pages.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- .gitlab/tests.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml
63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
82-82: "steps" section is missing in job "example-tests-non-pr"
(syntax-check)
85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
90-90: "container" section is alias node but mapping node is expected
(syntax-check)
91-91: "steps" section must be sequence node but got alias node with "" tag
(syntax-check)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (4)
.github/workflows/example_tests.yml (4)
79-81
: Runtime example dep install matches PR intent. LGTM.Installs project first, then per‑example requirements; aligns with avoiding baked deps.
If multiple requirements.txt files exist under an example, confirm install order doesn’t cause conflicts.
92-100
: Required‑check guard is solid.The gate fails the PR when relevant changes occur but GPU tests didn’t run/pass. Good safety net.
63-64
: Confirm runner label exists or add it to actionlint config.No actionlint config found in .github; workflows reference runs-on: linux-amd64-gpu-h100-latest-1 — ensure that self-hosted runner label exists in your org or add it to .github/actionlint.yaml to silence actionlint warnings.
Locations:
- .github/workflows/example_tests.yml — runs-on: linux-amd64-gpu-h100-latest-1 (lines 63, 85)
- .github/workflows/gpu_tests.yml — runs-on: linux-amd64-gpu-h100-latest-1 (line 79)
57-58
: Gate waits verified — reusable workflow present and pattern matches job names.
- .github/workflows/_wait_for_checks.yml exists and declares workflow_call inputs: match_pattern and delay.
- example_tests.yml, gpu_tests.yml and unit_tests.yml call it with match_pattern '^DCO$|^linux$' (unit_tests defines the linux job and a check-dco that matches ^DCO$).
- _wait_for_checks.yml does not export outputs — not required here because callers only use needs to depend on the job, not consume outputs.
d44fb7e
to
c01a56f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
docker/Dockerfile (3)
7-7
: Fix TORCH_CUDA_ARCH_LIST; "12.0+PTX" is not a valid SM and only the highest SM should have +PTX.
Use SMs (8.0/8.6/8.7/8.9/9.0). Add 10.0 only if your toolchain supports it.- TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0 12.0+PTX" + TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX" +# If SM 10.0 is supported end-to-end, use: +# TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"
27-27
: Remove baked llm_ptq requirements; install at runtime in workflows/examples.
This contradicts “install example deps at runtime” and can clash with user torch builds.-RUN pip install -r TensorRT-Model-Optimizer/examples/llm_ptq/requirements.txt +# Example-specific dependencies are installed at runtime in CI or example entrypoints.
29-30
: Avoid chmod 777; switch to non-root user with proper ownership.
777 is a security smell and breaks umask expectations.-# Allow users to run without root -RUN chmod -R 777 /workspace +# Allow users to run without root (least privilege) +RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /workspace +USER appuser.github/workflows/example_tests.yml (1)
82-91
: Inline anchors for container/steps in non‑PR job; actionlint flags alias/syntax errors.
Replace YAML aliases with concrete blocks.example-tests-non-pr: if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }} # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md runs-on: linux-amd64-gpu-h100-latest-1 timeout-minutes: 90 strategy: matrix: EXAMPLE: [llm_ptq] - container: *example_container - steps: *example_steps + container: + image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2 + env: + LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib" + PIP_CONSTRAINT: "" + steps: + - uses: actions/checkout@v4 + - uses: nv-gha-runners/setup-proxy-cache@main + - name: Run example tests + run: | + pip install ".[all,dev-test]" + find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done + pytest -s tests/examples/$EXAMPLE
🧹 Nitpick comments (2)
.github/workflows/pages.yml (1)
26-26
: Timeout bump acknowledged; add pip cache to reduce need for longer timeouts.
Caching pip deps will speed up tox builds and may let you keep the old timeout.- - uses: actions/setup-python@v5 - with: - python-version: "3.12" + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + cache: pip + cache-dependency-path: tox.ini.github/workflows/example_tests.yml (1)
71-74
: Env var expansion won’t happen here; LD_LIBRARY_PATH will literally contain “${LD_LIBRARY_PATH}”.
Set the full value explicitly or export in a run step.- LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" + LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.github/workflows/code_quality.yml
(1 hunks).github/workflows/example_tests.yml
(1 hunks).github/workflows/gpu_tests.yml
(0 hunks).github/workflows/pages.yml
(1 hunks).github/workflows/unit_tests.yml
(0 hunks).gitlab/tests.yml
(1 hunks)docker/Dockerfile
(2 hunks)tests/examples/speculative_decoding/test_medusa.py
(0 hunks)
💤 Files with no reviewable changes (3)
- .github/workflows/gpu_tests.yml
- tests/examples/speculative_decoding/test_medusa.py
- .github/workflows/unit_tests.yml
🚧 Files skipped from review as they are similar to previous changes (2)
- .gitlab/tests.yml
- .github/workflows/code_quality.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml
63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
82-82: "steps" section is missing in job "example-tests-non-pr"
(syntax-check)
85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
90-90: "container" section is alias node but mapping node is expected
(syntax-check)
91-91: "steps" section must be sequence node but got alias node with "" tag
(syntax-check)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (1)
.github/workflows/example_tests.yml (1)
79-81
: CI installs example deps at runtime (good); align Dockerfile accordingly.
Keep this pattern; remove baked example deps and dev-test extras from the base image as suggested.
docker/Dockerfile
Outdated
pip install -r "$req_file" || exit 1; \ | ||
done | ||
RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()" | ||
RUN pip install -r TensorRT-Model-Optimizer/examples/llm_ptq/requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can skip this too. If we start with the TRT LLM docker, most likely the dependencies are met
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Maybe we can reconsider our docker strategy. Should we just not provide a separate dockerfile and directly use a trtllm docker?
Only change on top of trtllm is setting some env vars (for Onnx TensorRT) and install libgl1
which was causing some import errors previously as some other examples may need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on dockerfile removal as a separate PR after merging this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM;
c01a56f
to
11835b1
Compare
Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Keval Morabia <[email protected]>
11835b1
to
b423dbb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
.github/workflows/example_tests.yml (1)
82-91
: Inline container/steps for non‑PR job to satisfy actionlint (anchors here trip parser).actionlint flags “container alias mapping expected” and “steps must be sequence” for the aliases. Duplicate the blocks inline for
example-tests-non-pr
.example-tests-non-pr: if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }} # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md - runs-on: linux-amd64-gpu-h100-latest-1 + runs-on: linux-amd64-gpu-h100-latest-1 timeout-minutes: 120 strategy: matrix: EXAMPLE: [llm_ptq] - container: *example_container - steps: *example_steps + container: + image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2 + env: + LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" + # PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}" + PIP_CONSTRAINT: "" + steps: + - uses: actions/checkout@v4 + - uses: nv-gha-runners/setup-proxy-cache@main + - name: Run example tests + run: | + pip install ".[all,dev-test]" + find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done + pytest -s tests/examples/${{ matrix.EXAMPLE }}
🧹 Nitpick comments (4)
.github/workflows/example_tests.yml (4)
63-63
: actionlint “unknown runner label”: add linter config or include self‑hosted labels.Either:
- Add an actionlint config whitelisting the custom label; or
- Use a label array including self-hosted + linux to be explicit.
Option A (add config file at .github/actionlint.yaml):
runnerLabels: - self-hosted - linux - linux-amd64-gpu-h100-latest-1Option B (explicit labels):
- runs-on: linux-amd64-gpu-h100-latest-1 + runs-on: [self-hosted, linux, linux-amd64-gpu-h100-latest-1]Run actionlint with the config to confirm it’s clean.
Also applies to: 85-85
69-74
: LD_LIBRARY_PATH/PATH expansion in container.env won’t interpolate shell vars.In the container env block,
${LD_LIBRARY_PATH}
and${PATH}
are treated as literals, not expanded. Either drop the suffix or export/merge in a step.Minimal safe fix: set in a step before installs so you preserve existing values:
container: &example_container image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2 - env: - LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" - # PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}" - PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages + env: + PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages @@ steps: &example_steps - uses: actions/checkout@v4 - uses: nv-gha-runners/setup-proxy-cache@main + - name: Merge env paths + shell: bash + run: | + echo "LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" >> "$GITHUB_ENV" + echo "PATH=/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}" >> "$GITHUB_ENV"Apply the same change in the non‑PR job if you inline steps there.
36-48
: Broaden change detection to catch packaging/test dependency edits.Include pyproject.toml and example requirements variants so PR gating doesn’t miss impactful changes.
files: | .github/workflows/example_tests.yml examples/llm_ptq/** modelopt/torch/** tests/examples/llm_ptq/** setup.py + pyproject.toml + examples/**/requirements*.txt
75-81
: Harden steps: pin external actions, upgrade pip toolchain, fail fast.
- Pin nv-gha-runners action to a commit SHA (avoid @main drift).
- Upgrade pip/setuptools/wheel before installing.
- Prefer
python -m pip
andset -euo pipefail
.- uses: actions/checkout@v4 - - uses: nv-gha-runners/setup-proxy-cache@main + - uses: nv-gha-runners/setup-proxy-cache@<commit-sha> - name: Run example tests - run: | - pip install ".[all,dev-test]" - find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done - pytest -s tests/examples/${{ matrix.EXAMPLE }} + shell: bash + run: | + set -euo pipefail + python -m pip install -U pip setuptools wheel + python -m pip install ".[all,dev-test]" + find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do python -m pip install -r "$req_file" || exit 1; done + pytest -s tests/examples/${{ matrix.EXAMPLE }}If you want, I can resolve the exact commit SHA and raise a follow-up PR to pin it.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.github/workflows/code_quality.yml
(1 hunks).github/workflows/example_tests.yml
(1 hunks).github/workflows/gpu_tests.yml
(0 hunks).github/workflows/pages.yml
(1 hunks).github/workflows/unit_tests.yml
(0 hunks).gitlab/tests.yml
(1 hunks)docker/Dockerfile
(2 hunks)tests/examples/speculative_decoding/test_medusa.py
(0 hunks)
💤 Files with no reviewable changes (3)
- .github/workflows/unit_tests.yml
- .github/workflows/gpu_tests.yml
- tests/examples/speculative_decoding/test_medusa.py
🚧 Files skipped from review as they are similar to previous changes (4)
- .github/workflows/pages.yml
- .github/workflows/code_quality.yml
- .gitlab/tests.yml
- docker/Dockerfile
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml
63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
82-82: "steps" section is missing in job "example-tests-non-pr"
(syntax-check)
85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
90-90: "container" section is alias node but mapping node is expected
(syntax-check)
91-91: "steps" section must be sequence node but got alias node with "" tag
(syntax-check)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (3)
.github/workflows/example_tests.yml (3)
12-16
: Concurrency group LGTM.The group key cleanly de-duplicates runs per PR branch vs. SHA; cancel-in-progress is appropriate.
49-59
: Wait‑gate logic LGTM.Gating on DCO and linux unit tests before GPU runs is sound; reusable workflow usage looks correct.
92-100
: Required‑check guard LGTM with skip handling.Condition correctly fails PR when relevant changes occur and GPU tests didn’t succeed; handles skipped paths.
Signed-off-by: Keval Morabia <[email protected]>
b423dbb
to
24db1b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/pruning/cifar_resnet.ipynb (1)
339-343
: Bug: misuse of load_state_dict arguments.
load_state_dict
doesn’t take a device argument; this passesdevice
asstrict
. Also missesmap_location
ontorch.load
.Apply:
- model.load_state_dict(torch.load(ckpt_path), device) + state = torch.load(ckpt_path, map_location=device) + model.load_state_dict(state)
♻️ Duplicate comments (3)
docker/Dockerfile (2)
3-6
: Invalid/overbroad TORCH_CUDA_ARCH_LIST; drop 12.0 and only +PTX on highest SM.“12.0+PTX” is not recognized by many current PyTorch/CUDA combos and bloats build. Keep supported SMs; add +PTX only to the highest.
Apply:
- TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0 12.0+PTX" + TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX" +# If your toolchain truly supports SM 10.0, use: +# TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"
27-27
: Avoid world-writable perms; use non-root user with ownership.
chmod -R 777 /workspace
is a security smell and breaks shared volumes’ perms.Apply:
-# Allow users to run without root -RUN chmod -R 777 /workspace +# Allow users to run without root +RUN useradd -m -u 1000 appuser && \ + chown -R appuser:appuser /workspace +USER appuser.github/workflows/example_tests.yml (1)
82-91
: Fix YAML anchor usage for container/steps — actionlint errors; inline blocks.Anchors here trigger “container alias mapping expected” and “steps must be sequence” in actionlint; inline the container and steps for the non-PR job.
Apply:
example-tests-non-pr: if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }} # Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md runs-on: linux-amd64-gpu-h100-latest-1 timeout-minutes: 90 strategy: matrix: EXAMPLE: [llm_ptq] - container: *example_container - steps: *example_steps + container: + image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2.post2 + env: + LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib" + PIP_CONSTRAINT: "" + steps: + - uses: actions/checkout@v4 + - uses: nv-gha-runners/setup-proxy-cache@main + - name: Run example tests + run: | + pip install ".[all,dev-test]" + find examples/${{ matrix.EXAMPLE }} -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done + pytest -s tests/examples/${{ matrix.EXAMPLE }}
🧹 Nitpick comments (5)
examples/pruning/cifar_resnet.ipynb (4)
25-33
: Avoid unpinned, unconditional installs; gate with import checks and use %pip.Unconditionally installing nvidia-modelopt can override the editable/source install in your Docker/CI and create version skew. Prefer conditional installs and Jupyter-native %pip; only install torchvision explicitly here.
Apply:
-! pip install nvidia-modelopt torchvision +%pip install -q torchvision +import importlib +if importlib.util.find_spec("modelopt") is None: + %pip install -q nvidia-modelopt
60-61
: Graceful CPU fallback for device selection.
torch.device("cuda")
will hard-fail on CPU-only envs (e.g., CI preview, Colab T4 quota exhaustion).Apply:
-device = torch.device("cuda") +device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
287-288
: Loss averaging mismatch.
epoch_loss
sums per-batch mean losses; dividing by number of samples underestimates the value. Divide by number of batches or accumulate sum of per-sample losses.Apply:
- epoch_loss /= len(train_loader.dataset) + epoch_loss /= len(train_loader)
662-666
: Typo in checkpoint filename (“seaarch”).This will create inconsistent artifacts and breaks resume-by-name tooling.
Apply:
- "checkpoint": "modelopt_seaarch_checkpoint_fastnas.pth", + "checkpoint": "modelopt_search_checkpoint_fastnas.pth",.github/workflows/example_tests.yml (1)
71-74
: Env var expansion in container.env won’t interpolate ${LD_LIBRARY_PATH}.GitHub Actions doesn’t expand env references inside the container env block; the literal string is passed. Either set a fixed value or export in a run step.
Apply:
- env: - LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" - # PATH: "/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}" - PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages + env: + LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib" + PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packagesOptionally append in the step:
- name: Append LD_LIBRARY_PATH run: echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib" >> $GITHUB_ENV
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
.github/workflows/code_quality.yml
(1 hunks).github/workflows/example_tests.yml
(1 hunks).github/workflows/gpu_tests.yml
(0 hunks).github/workflows/pages.yml
(1 hunks).github/workflows/unit_tests.yml
(0 hunks).gitlab/tests.yml
(1 hunks)README.md
(1 hunks)docker/Dockerfile
(2 hunks)docs/source/getting_started/_installation_for_Linux.rst
(2 hunks)examples/cnn_qat/requirements.txt
(1 hunks)examples/llm_eval/requirements.txt
(1 hunks)examples/onnx_ptq/docker/Dockerfile
(1 hunks)examples/onnx_ptq/requirements.txt
(1 hunks)examples/pruning/cifar_resnet.ipynb
(1 hunks)setup.py
(1 hunks)tests/examples/speculative_decoding/test_medusa.py
(0 hunks)
💤 Files with no reviewable changes (3)
- .github/workflows/gpu_tests.yml
- .github/workflows/unit_tests.yml
- tests/examples/speculative_decoding/test_medusa.py
✅ Files skipped from review due to trivial changes (1)
- examples/cnn_qat/requirements.txt
🚧 Files skipped from review as they are similar to previous changes (9)
- examples/onnx_ptq/requirements.txt
- examples/llm_eval/requirements.txt
- setup.py
- .github/workflows/pages.yml
- README.md
- docs/source/getting_started/_installation_for_Linux.rst
- .github/workflows/code_quality.yml
- examples/onnx_ptq/docker/Dockerfile
- .gitlab/tests.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/example_tests.yml
63-63: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
82-82: "steps" section is missing in job "example-tests-non-pr"
(syntax-check)
85-85: label "linux-amd64-gpu-h100-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
90-90: "container" section is alias node but mapping node is expected
(syntax-check)
91-91: "steps" section must be sequence node but got alias node with "" tag
(syntax-check)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
🔇 Additional comments (2)
docker/Dockerfile (1)
20-25
: Nice: local editable install + precompile improves dev ergonomics.Please confirm
TensorRT-Model-Optimizer[all]
excludes example/test-only deps (e.g., torchvision) per PR goal; CI workflow installs dev-test separately..github/workflows/example_tests.yml (1)
79-81
: Install order looks good; keep example deps runtime-only.Repo + dev-test first, then example-specific requirements, then pytest — matches PR goals.
What does this PR do?
Testing
Summary by CodeRabbit
Documentation
Chores
Tests