-
Notifications
You must be signed in to change notification settings - Fork 105
Description
π Bug Report: CuOPT 25.12+ Fails to Load on Databricks Due to nvJitLink Version Requirement
Summary
CuOPT 25.12+ introduces a backward compatibility breaking change that prevents it from running on Databricks ML Runtime 16.4 (and likely other managed environments) due to a hard dependency on nvidia-nvjitlink-cu12 >= 12.9.79. This blocks all Databricks users from using GPU-accelerated routing optimization with CuOPT.
Environment
Platform:
- Databricks ML Runtime 16.4 (GPU)
- Databricks Serverless GPU Compute
Hardware:
- GPU: NVIDIA A10G (23028 MiB, Compute Capability 8.6)
- Architecture: x86_64
Software Versions:
- CuOPT: 25.12.0 (latest)
- CUDA Runtime: 12.6
- CUDA Driver: 535.161.07
- Python: 3.12.3
- nvidia-nvjitlink-cu12: 12.4.127 (provided by Databricks, unfixable by users)
- Required by CuOPT: >= 12.9.79
Databricks Runtime Components:
- nvidia-cuda-runtime-cu12: 12.9.79
- nvidia-cublas-cu12: 12.9.1.4
- nvidia-cusolver-cu12: 11.7.5.82
- nvidia-cudnn-cu12: 9.1.0.70
Issue Description
When attempting to install and use CuOPT 25.12+ on Databricks ML Runtime 16.4, the library fails to load with the following error:
RuntimeWarning: Failed to load libcuopt library: libcuopt.so.
Error: /local_disk0/.ephemeral_nfs/envs/pythonEnv-.../lib/python3.12/site-packages/libcuopt/lib64/../../nvidia/cusolver/lib/libcusolver.so.11:
undefined symbol: cublasSetEnvironmentMode, version libcublas.so.12.
Falling back to relying on system loader. cuOpt functionality may be unavailable.
Root Cause
Breaking Change in CuOPT 25.12+:
- CuOPT 25.12+ now requires
nvidia-nvjitlink-cu12 >= 12.9.79(based on testing and error analysis) - Previous CuOPT versions worked with older nvJitLink versions
Databricks Environment Limitation:
- Databricks ML Runtime 16.4 provides
nvidia-nvjitlink-cu12 12.4.127 - This is a managed runtime package that users cannot upgrade
- Attempting to upgrade via pip fails due to runtime conflicts
Impact:
- Complete blockage of CuOPT functionality on Databricks
- Affects all Databricks ML Runtime 16.x users
- Affects Databricks Serverless GPU Compute users
Reproduction Steps
On Databricks ML Runtime 16.4:
- Create a Databricks cluster with ML Runtime 16.4 (GPU)
- Install CuOPT:
%pip install --extra-index-url=https://pypi.nvidia.com cuopt-server-cu12 cuopt-sh-client
dbutils.library.restartPython()
3. Attempt to import and use CuOPT:
from cuopt import routing
4. Result: RuntimeWarning about failed library load, CuOPT is non-functional
Verify nvJitLink Version:
import subprocess
result = subprocess.run(["pip", "show", "nvidia-nvjitlink-cu12"], capture_output=True, text=True)
print(result.stdout)Output:
from cuopt import routingFails):**
%pip install --upgrade nvidia-nvjitlink-cu12>=12.9.79Result: Version conflicts with Databricks runtime dependencies, installation fails or reverts
Expected Behavior
Option 1: Backward Compatibility
- CuOPT 25.12+ should maintain backward compatibility with nvJitLink 12.4.x
- Or provide a compatibility layer / fallback mechanism
Option 2: Clear Version Requirements
- Document minimum nvJitLink version requirement prominently in installation docs
- Add runtime version check with clear error message
- Provide alternative installation instructions for older environments
Option 3: Version-Specific Packages
- Offer CuOPT builds for different nvJitLink versions
- e.g.,
cuopt-server-cu12-nvjitlink124andcuopt-server-cu12-nvjitlink129
Actual Behavior
- CuOPT 25.12+ silently fails to load
- Error message is cryptic (
undefined symbol: cublasSetEnvironmentMode) - No clear indication that nvJitLink version is the issue
- No workaround available for users on managed environments
Impact Assessment
Severity: π΄ Critical - Complete functionality loss
Affected Users:
- All Databricks ML Runtime 16.x users
- Databricks Serverless GPU Compute users
- Other managed GPU environments with nvJitLink 12.4.x
Business Impact:
- Blocks adoption of CuOPT for Databricks routing optimization workloads
- Forces users to choose between:
- Using Databricks (but no CuOPT)
- Using CuOPT (but not on Databricks)
- Impacts Databricks Routing Accelerator integration
User Time Impact:
- 2-4 hours wasted per user attempting installation and debugging
- No clear error message makes root cause difficult to identify
Proposed Solutions
Short-Term (Immediate):
-
Document the requirement prominently:
- Add nvJitLink version requirement to installation docs
- Update release notes for 25.12.0
- Add compatibility matrix to README
-
Improve error message:
- Add runtime check for nvJitLink version during library load
- Provide clear error message:
import subprocess
result = subprocess.run(["pip", "show", "nvidia-nvjitlink-cu12"], capture_output=True, text=True)
print(result.stdout)on path in release notes
Medium-Term:
-
Restore backward compatibility:
- Investigate if nvJitLink 12.4.x support can be maintained
- Use runtime detection and conditional code paths if needed
-
Version-specific builds:
- Provide separate builds for different nvJitLink versions
- Allow users to install appropriate version for their environment
Long-Term:
- Coordinate with platform providers:
- Work with Databricks to upgrade nvJitLink in ML Runtime 17.0+
- Establish minimum version requirements for supported platforms
Workarounds
Currently, users have two unsatisfactory options:
Option A: Use OR-Tools (CPU-based)
%pip install ortools
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcpPros: Works on all Databricks runtimes
Cons: CPU-only, significantly slower for large problems
Option B: Wait for Databricks ML Runtime 17.0+
Expected to include nvJitLink >= 12.9.79, but release date unknown
Option C: Use older CuOPT version (if available)
CuOPT 25.10.x or earlier may work with nvJitLink 12.4.127 (needs verification)
Detection & Validation
We've developed an automatic detection tool that identifies this issue:
CUDA Healthcheck Tool for Databricks: https://github.com/TavnerJC/cuda-healthcheck-on-databricks
Usage:
%pip install git+https://github.com/TavnerJC/cuda-healthcheck-on-databricks.git
dbutils.library.restartPython()
from cuda_healthcheck import CUDADetector
detector = CUDADetector()
env = detector.detect_environment()
Automatically detects CuOPT incompatibility and provides guidanceValidation Report: https://github.com/TavnerJC/cuda-healthcheck-on-databricks/blob/main/NOTEBOOK1_VALIDATION_SUCCESS.md
Additional Context
Testing:
- Tested on Databricks Classic ML Runtime 16.4 with NVIDIA A10G
- Confirmed issue on Databricks Serverless GPU Compute
- Issue documented in our use case study: https://github.com/TavnerJC/cuda-healthcheck-on-databricks/blob/main/docs/USE_CASE_ROUTING_OPTIMIZATION.md
Related Issues:
- This issue blocks integration with Databricks Routing Accelerator
- Affects GPU-accelerated VRP (Vehicle Routing Problem) workloads on Databricks
CuOPT Version Info:
import cuopt
print(cuopt.version) # 25.12.0nvJitLink Info:
pip show nvidia-nvjitlink-cu12
Name: nvidia-nvjitlink-cu12
Version: 12.4.127
Location: /databricks/python3/lib/python3.12/site-packages### Questions for NVIDIA CuOPT Team
- Was the nvJitLink >= 12.9.79 requirement intentional in 25.12.0?
- Can backward compatibility with nvJitLink 12.4.x be restored?
- Is there a CuOPT version that supports nvJitLink 12.4.x?
- Are there plans to provide version-specific builds?
- What is the recommended path forward for Databricks users?
References
- CuOPT GitHub: https://github.com/NVIDIA/cuopt
- CuOPT Documentation: https://docs.nvidia.com/cuopt/
- Databricks ML Runtime 16.4: https://docs.databricks.com/release-notes/runtime/16.4ml.html
- Detection Tool: https://github.com/TavnerJC/cuda-healthcheck-on-databricks
- Use Case Documentation: https://github.com/TavnerJC/cuda-healthcheck-on-databricks/blob/main/docs/USE_CASE_ROUTING_OPTIMIZATION.md
System Information
Click to expand full environment details
Databricks ML Runtime 16.4
Python: 3.12.3
CUDA Runtime: 12.6
CUDA Driver: 535.161.07
GPU: NVIDIA A10G
Memory: 23028 MiB
Compute Capability: 8.6
CUDA Components
nvidia-cuda-runtime-cu12: 12.9.79
nvidia-cublas-cu12: 12.9.1.4
nvidia-cusolver-cu12: 11.7.5.82
nvidia-cudnn-cu12: 9.1.0.70
nvidia-nvjitlink-cu12: 12.4.127 β THE PROBLEM
CuOPT
cuopt-server-cu12: 25.12.0
cuopt-sh-client: 25.12.0
π Request
This issue blocks GPU-accelerated routing optimization for all Databricks users. We kindly request:
- Acknowledgment of this backward compatibility issue
- Guidance on the recommended path forward
- Timeline for a fix or workaround
- Documentation updates to prevent future users from encountering this
Thank you for your consideration and for developing CuOPT!
Submitted by: TavnerJC
Detection Tool: https://github.com/TavnerJC/cuda-healthcheck-on-databricks