Skip to content

Conversation

i-riyad
Copy link
Contributor

@i-riyad i-riyad commented Sep 9, 2025

What does this PR do?

Type of change: Bug fix

Overview: CuDNN lib path is not set in docker. Hence cuda and trt EP was not working

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

  • Bug Fixes

    • Improved GPU library path handling in the ONNX PTQ Docker image to reduce runtime linking errors and enhance container stability.
  • Chores

    • Simplified Docker image setup by using fixed environment paths and streamlining package installation for more predictable dependency resolution and consistent tool availability.

@i-riyad i-riyad requested a review from a team as a code owner September 9, 2025 06:33
@i-riyad i-riyad requested a review from ajrasane September 9, 2025 06:33
@i-riyad i-riyad force-pushed the rislam/cudnn-path-fix branch from e117c8b to a1842ad Compare September 9, 2025 06:33
Copy link

coderabbitai bot commented Sep 9, 2025

Walkthrough

Replace dynamic Python-based discovery of TensorRT/CuDNN paths in the Dockerfile with static environment variables: TRT_PATH set to /usr/local/lib/python3.12/dist-packages/tensorrt and CUDNN_LIB_DIR set to /usr/local/lib/python3.12/dist-packages/nvidia/cudnn/lib. LD_LIBRARY_PATH and PATH are set via ENV; RUN installs tensorrt==10.13.2.6 and removes Python introspection.

Changes

Cohort / File(s) Summary of changes
Docker env path updates & RUN simplification
examples/onnx_ptq/docker/Dockerfile
Replace runtime Python-based discovery and shell exports with static ENV declarations: set TRT_PATH=/usr/local/lib/python3.12/dist-packages/tensorrt, add CUDNN_LIB_DIR=/usr/local/lib/python3.12/dist-packages/nvidia/cudnn/lib, rebuild LD_LIBRARY_PATH to prepend CUDNN_LIB_DIR and TRT_PATH/lib, update PATH to include TRT_PATH/bin; simplify RUN to install tensorrt==10.13.2.6 and remove Python introspection and export steps.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I nibble bytes where libraries hide,
Static paths tuck CuDNN by my side.
TRT found, no longer lost in night,
Docker carrots lined up just right.
Hop on, code—my build runs light. 🐇


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0466bf0 and c4474ec.

📒 Files selected for processing (1)
  • examples/onnx_ptq/docker/Dockerfile (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (2)
examples/onnx_ptq/docker/Dockerfile (2)

19-19: Ensure trtexec is in PATH or symlink it
NGC images usually install trtexec at /usr/src/tensorrt/bin, so ENV PATH="${TRT_PATH}/bin:${PATH}" may not include it—manually verify in your built container (e.g. docker run --rm trt-ptq:dev which trtexec) and either symlink the binary into /usr/local/bin (as in the prior diff) or conditionally export the actual directory.


15-15: Remove manual TensorRT pip install to avoid version skew
The base image nvcr.io/nvidia/tensorrt:25.08-py3 already bundles matching TensorRT libraries and Python wheels. Manually running pip install tensorrt==10.13.2.6 risks desynchronizing system libs and the wheel, leading to runtime loader errors. Remove this line or align the wheel version exactly with what’s shipped in the image. Unable to validate versions in this environment—please verify manually.

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly and concisely states the main change by specifying that the PR fixes the CuDNN library path in the TensorRT Docker setup, which accurately reflects the primary update to the Dockerfile.
Description Check ✅ Passed The PR description succinctly identifies the change as a bug fix, explains the issue with the missing CuDNN library path, and directly relates to the modifications in the Dockerfile.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rislam/cudnn-path-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
examples/onnx_ptq/docker/Dockerfile (1)

17-17: Use sysconfig over site.getsitepackages for robustness.

site.getsitepackages() can be absent/variant across venvs and distros. sysconfig.get_paths()['purelib'] is more reliable.

-export CUDNN_LIB_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/nvidia/cudnn/lib
+CUDNN_LIB_DIR=$(python -c "import sysconfig, os; print(os.path.join(sysconfig.get_paths()['purelib'], 'nvidia', 'cudnn', 'lib'))")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6fa34c and a1842ad.

📒 Files selected for processing (1)
  • examples/onnx_ptq/docker/Dockerfile (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (1)
examples/onnx_ptq/docker/Dockerfile (1)

18-18: Double-check library preference order to avoid version skew.

Prepending cuDNN ahead of TensorRT libs changes resolution precedence. Ensure this matches the cuDNN/TRT versions in 25.08 to avoid ABI mismatches at runtime.

Would you confirm that the resolved libcudnn*.so at runtime matches the expected version from the base image by running ldd on trtexec inside the built image and sharing the libcudnn paths?

Comment on lines 17 to 19
export CUDNN_LIB_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/nvidia/cudnn/lib && \
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${TRT_PATH}/lib:/usr/include:${LD_LIBRARY_PATH}" && \
export PATH="${TRT_PATH}/bin:${PATH}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Exports in RUN don’t persist; fix by registering lib dirs with ldconfig (and drop /usr/include).

The three exports only affect this build layer and won’t be present at runtime or later layers. Also, /usr/include does not belong in LD_LIBRARY_PATH. Persist by adding the library dirs to ld.so.conf.d and running ldconfig; optionally symlink trtexec to be on PATH.

Apply:

-    export CUDNN_LIB_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/nvidia/cudnn/lib && \
-    export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${TRT_PATH}/lib:/usr/include:${LD_LIBRARY_PATH}" && \
-    export PATH="${TRT_PATH}/bin:${PATH}"
+    CUDNN_LIB_DIR=$(python -c "import sysconfig, os; print(os.path.join(sysconfig.get_paths()['purelib'], 'nvidia', 'cudnn', 'lib'))") && \
+    echo "${TRT_PATH}/lib" > /etc/ld.so.conf.d/tensorrt.conf && \
+    if [ -d "${CUDNN_LIB_DIR}" ]; then echo "${CUDNN_LIB_DIR}" > /etc/ld.so.conf.d/cudnn.conf; fi && \
+    ldconfig && \
+    ln -sfn "${TRT_PATH}/bin/trtexec" /usr/local/bin/trtexec
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export CUDNN_LIB_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/nvidia/cudnn/lib && \
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${TRT_PATH}/lib:/usr/include:${LD_LIBRARY_PATH}" && \
export PATH="${TRT_PATH}/bin:${PATH}"
# Register TensorRT and cuDNN libs with ldconfig instead of ephemeral exports
CUDNN_LIB_DIR=$(python -c "import sysconfig, os; print(os.path.join(sysconfig.get_paths()['purelib'], 'nvidia', 'cudnn', 'lib'))") && \
echo "${TRT_PATH}/lib" > /etc/ld.so.conf.d/tensorrt.conf && \
if [ -d "${CUDNN_LIB_DIR}" ]; then echo "${CUDNN_LIB_DIR}" > /etc/ld.so.conf.d/cudnn.conf; fi && \
ldconfig && \
ln -sfn "${TRT_PATH}/bin/trtexec" /usr/local/bin/trtexec
🤖 Prompt for AI Agents
In examples/onnx_ptq/docker/Dockerfile around lines 17 to 19, the RUN-layer
exports (CUDNN_LIB_DIR, LD_LIBRARY_PATH, PATH) do not persist to later layers or
runtime and /usr/include should not be in LD_LIBRARY_PATH; instead create a file
in /etc/ld.so.conf.d (e.g., /etc/ld.so.conf.d/trt_cudnn.conf) containing the
resolved CUDNN lib directory and ${TRT_PATH}/lib, run ldconfig to register them,
remove /usr/include from the LD path commands, and if a persistent PATH change
is desired create a symlink for trtexec into /usr/local/bin (or add an ENV PATH
entry) so the runtime can find the binary.

export LD_LIBRARY_PATH="$TRT_PATH/lib:/usr/include:${LD_LIBRARY_PATH}" && \
export PATH="$TRT_PATH/bin:${PATH}"
export CUDNN_LIB_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/nvidia/cudnn/lib && \
export LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${TRT_PATH}/lib:/usr/include:${LD_LIBRARY_PATH}" && \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove /usr/include from LD_LIBRARY_PATH.

The dynamic linker doesn’t search header directories. Keeping it here is incorrect and can mask configuration issues.

🤖 Prompt for AI Agents
In examples/onnx_ptq/docker/Dockerfile around line 18, remove "/usr/include"
from the LD_LIBRARY_PATH assignment because header directories should not be in
the dynamic linker search path; update the assignment to only include the actual
library directories (CUDNN_LIB_DIR and TRT_PATH/lib and any existing
LD_LIBRARY_PATH), ensuring you don't introduce empty path entries or duplicate
colons when expanding LD_LIBRARY_PATH.

@i-riyad i-riyad force-pushed the rislam/cudnn-path-fix branch from a1842ad to 0466bf0 Compare September 9, 2025 06:47
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
examples/onnx_ptq/docker/Dockerfile (1)

18-18: Remove /usr/include from LD_LIBRARY_PATH.

Header directories are not searched by the dynamic linker and keeping them in LD_LIBRARY_PATH is incorrect.

-ENV LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${TRT_PATH}/lib:/usr/include:${LD_LIBRARY_PATH}"
+ENV LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${TRT_PATH}/lib:${LD_LIBRARY_PATH}"

If you adopt ldconfig per my other comment, you can drop this ENV entirely.

🧹 Nitpick comments (2)
examples/onnx_ptq/docker/Dockerfile (2)

15-15: Avoid reinstalling TensorRT inside a TensorRT base image; verify version alignment.

The base image nvcr.io/nvidia/tensorrt:25.08-py3 typically ships with a matched TRT stack. Layering pip-installed tensorrt==10.13.2.6 can cause ABI/version skew with preinstalled CUDA/cuDNN/TRT libs.

Please confirm what trtexec --version and Python import tensorrt; print(tensorrt.__version__) report in a container built from this Dockerfile. If they differ, drop the pip install and rely on the image’s TRT, or upgrade the base image to the matching tag.

If you keep the pip install, prefer avoiding cache bloat:

-RUN pip install tensorrt==10.13.2.6
+RUN pip install --no-cache-dir tensorrt==10.13.2.6

19-19: Verify trtexec location; the pip wheel may not provide bin/trtexec.

${TRT_PATH}/bin might not exist in the Python wheel. In NVIDIA TRT containers, trtexec usually lives under /usr/src/tensorrt/bin. Prefer a symlink to /usr/local/bin to make it discoverable instead of PATH editing.

Apply (already included in the earlier diff):

-ENV PATH="${TRT_PATH}/bin:${PATH}"
+RUN ln -sf /usr/src/tensorrt/bin/trtexec /usr/local/bin/trtexec

Please confirm which trtexec succeeds in a built image.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1842ad and 0466bf0.

📒 Files selected for processing (1)
  • examples/onnx_ptq/docker/Dockerfile (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality

Comment on lines 16 to 17
ENV TRT_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt
ENV CUDNN_LIB_DIR=/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hardcoding site-packages paths is brittle; resolve at build-time and register with ldconfig.

The paths will break if Python minor changes (e.g., 3.11) or if the wheel layout shifts. Prefer computing locations via Python and adding them to ld.so.conf.d, then run ldconfig, so runtime doesn’t depend on LD_LIBRARY_PATH.

Apply:

-RUN pip install tensorrt==10.13.2.6
-ENV TRT_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt
-ENV CUDNN_LIB_DIR=/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib
-ENV LD_LIBRARY_PATH="${CUDNN_LIB_DIR}:${TRT_PATH}/lib:/usr/include:${LD_LIBRARY_PATH}"
-ENV PATH="${TRT_PATH}/bin:${PATH}"
+RUN pip install --no-cache-dir tensorrt==10.13.2.6 && \
+    python - <<'PY'
+import os, sysconfig, pathlib
+pure = sysconfig.get_paths()['purelib']
+trt = pathlib.Path(pure) / 'tensorrt'
+cudnn = pathlib.Path(pure) / 'nvidia' / 'cudnn' / 'lib'
+paths = [str(trt / 'lib')]
+if cudnn.is_dir():
+    paths.append(str(cudnn))
+open('/etc/ld.so.conf.d/tensorrt_cudnn.conf','w').write('\n'.join(paths)+'\n')
+PY
+    && ldconfig && \
+    ln -sf /usr/src/tensorrt/bin/trtexec /usr/local/bin/trtexec
+# No LD_LIBRARY_PATH or PATH mutations needed

This is resilient to Python version changes and avoids fragile ENV-based linker config.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In examples/onnx_ptq/docker/Dockerfile around lines 16-17, avoid hardcoding
Python site-packages paths; instead run a small Python command at build time to
resolve the actual TensorRT and cuDNN library directories, write those resolved
library paths into a file under /etc/ld.so.conf.d/, and run ldconfig so the
dynamic linker finds them at runtime; remove the brittle ENV-assigned
site-packages paths and, if necessary, set only logical ENV names (e.g., TRT_PKG
or CUDNN_PKG) while relying on the ldconfig-registered library locations for
linking.

@i-riyad i-riyad force-pushed the rislam/cudnn-path-fix branch from 0466bf0 to c4474ec Compare September 9, 2025 06:58
Copy link

codecov bot commented Sep 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.87%. Comparing base (d6d2e75) to head (d0be96b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #306   +/-   ##
=======================================
  Coverage   73.86%   73.87%           
=======================================
  Files         172      172           
  Lines       17415    17415           
=======================================
+ Hits        12864    12865    +1     
+ Misses       4551     4550    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@i-riyad i-riyad enabled auto-merge (squash) September 9, 2025 22:50
@i-riyad i-riyad merged commit 669ae05 into main Sep 9, 2025
22 checks passed
@i-riyad i-riyad deleted the rislam/cudnn-path-fix branch September 9, 2025 23:18
jingyu-ml pushed a commit that referenced this pull request Sep 10, 2025
Signed-off-by: Riyad Islam <[email protected]>
Signed-off-by: Jingyu Xin <[email protected]>
benchislett pushed a commit that referenced this pull request Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants