Skip to content

Cuda/NVML Components: Dynamically Search for the Shared Objects#347

Merged
Treece-Burgess merged 1 commit intoicl-utk-edu:masterfrom
Treece-Burgess:04-22-2025-shared-libraries-search
Apr 30, 2025
Merged

Cuda/NVML Components: Dynamically Search for the Shared Objects#347
Treece-Burgess merged 1 commit intoicl-utk-edu:masterfrom
Treece-Burgess:04-22-2025-shared-libraries-search

Conversation

@Treece-Burgess
Copy link
Copy Markdown
Contributor

@Treece-Burgess Treece-Burgess commented Apr 24, 2025

Pull Request Description

This PR updates the closed PR #328 which requested to use numbered versions for the shared objects instead of unnumbered for runtime. Instead of hard coding the numbered versions, we now will dynamically search for the shared objects.

For libcuda, libcudart, libnvperf_host, libcupti, and libnvidia-ml, there will be three naming schemes searched for:

  • Unnumbered e.g. libcudart.so
  • Numbered with .1 e.g. libcudart.so.1
  • A catch all libcudart (for libcudart this would catch either libcudart.so.12 or libcudart.so.12.5.82)

Testing was done on Methane at ICL (1 * A100) and Athena at Oregon (4 * A100s) using the PAPI utilities to verify:

  • Setting PAPI_CUDA_ROOT to Cuda Toolkit install directory: ✅
  • Setting Cuda and NVML environment variables (PAPI_CUDA_RUNTIME, PAPI_CUDA_CUPTI, PAPI_CUDA_PERFWORKS, and PAPI_NVML_MAIN): ✅
  • Searching for other so variations:
    • Removed libcudart.so and libcupti.so: Found numbered version
    • Moved libnvidia-ml.so: Found numbered version

Note: Removed the function linked_cuda_rt as this did not function properly and would return PAPI_EMISC. Removing the function did not seem to alter functionality from testing.

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@Treece-Burgess Treece-Burgess force-pushed the 04-22-2025-shared-libraries-search branch from e5497e9 to d0d1761 Compare April 24, 2025 15:44
@Treece-Burgess
Copy link
Copy Markdown
Contributor Author

Hello @scaronni, I have updated the Cuda and NVML component code to now search for variations of the different shared objects. If you have time to review/test the code and notice any issues please let me know!

@tokey-tahmid
Copy link
Copy Markdown

I am reviewing this PR.

Copy link
Copy Markdown

@tokey-tahmid tokey-tahmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR after performing the following tests:

  • Methane (1 * A100) + cuda/12.5.1
    • PAPI Utilities: ✅
    • Cuda tests: ✅
  • Hexane (1 * H100 && 1 * V100) + cuda/12.5.1
    • PAPI Utilities: ✅
    • Cuda tests: ✅
  • Guyot (8 * A100) + cuda/12.5.1
    • PAPI Utilities: ✅
    • Cuda tests: ✅

@Treece-Burgess Treece-Burgess force-pushed the 04-22-2025-shared-libraries-search branch 2 times, most recently from 57b56b5 to 2839665 Compare April 30, 2025 00:55
…udart.so, libcudart.so.1 or libcudart (catch all).
@Treece-Burgess Treece-Burgess force-pushed the 04-22-2025-shared-libraries-search branch from 2839665 to a959c67 Compare April 30, 2025 00:55
@Treece-Burgess Treece-Burgess merged commit 757b32e into icl-utk-edu:master Apr 30, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants