[CI] Revert CUDA, PyTorch and ONNX upgrade#18787
Conversation
This reverts commit ac70260.
Summary of ChangesHello @mshr-h, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request aims to resolve recent OpenCL test failures observed in the CI pipeline, which are believed to stem from an incompatibility introduced by a recent CUDA version upgrade. The changes involve reverting the CUDA base image to a known stable version and subsequently downgrading dependent libraries like PyTorch, ONNX, and their runtimes to ensure full compatibility and restore the reliability of the continuous integration tests. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request reverts the CUDA and PyTorch versions to fix a CI failure, which is reflected in the changes to docker/Dockerfile.ci_gpu and docker/install/ubuntu_install_onnx.sh. The changes seem correct for the stated purpose. I've added one suggestion to refactor the installation script in docker/install/ubuntu_install_onnx.sh to reduce code duplication and improve maintainability.
| if [ "$PYTHON_VERSION" == "3.9" ]; then | ||
| pip3 install \ | ||
| onnx==1.16.0 \ | ||
| onnxruntime==1.19.2 \ | ||
| onnxoptimizer==0.2.7 | ||
|
|
||
| if [ "$DEVICE" == "cuda" ]; then | ||
| if [ "$DEVICE" == "cuda" ]; then | ||
| pip3 install \ | ||
| torch==2.7.0 \ | ||
| torchvision==0.22.0 \ | ||
| --index-url https://download.pytorch.org/whl/cu118 | ||
| else | ||
| pip3 install \ | ||
| torch==2.7.0 \ | ||
| torchvision==0.22.0 \ | ||
| --extra-index-url https://download.pytorch.org/whl/cpu | ||
| fi | ||
| elif [ "$PYTHON_VERSION" == "3.11" ]; then | ||
| pip3 install \ | ||
| torch==2.10.0 \ | ||
| torchvision==0.25.0 | ||
| onnx==1.17.0 \ | ||
| onnxruntime==1.20.1 \ | ||
| onnxoptimizer==0.2.7 | ||
|
|
||
| if [ "$DEVICE" == "cuda" ]; then | ||
| pip3 install \ | ||
| torch==2.7.0 \ | ||
| torchvision==0.22.0 \ | ||
| --index-url https://download.pytorch.org/whl/cu118 | ||
| else | ||
| pip3 install \ | ||
| torch==2.7.0 \ | ||
| torchvision==0.22.0 \ | ||
| --extra-index-url https://download.pytorch.org/whl/cpu | ||
| fi | ||
| else | ||
| pip3 install \ | ||
| torch==2.10.0 \ | ||
| torchvision==0.25.0 \ | ||
| --extra-index-url https://download.pytorch.org/whl/cpu | ||
| onnx==1.12.0 \ | ||
| onnxruntime==1.12.1 \ | ||
| onnxoptimizer==0.2.7 | ||
|
|
||
| if [ "$DEVICE" == "cuda" ]; then | ||
| pip3 install \ | ||
| torch==2.4.1 \ | ||
| torchvision==0.19.1 | ||
| else | ||
| pip3 install \ | ||
| torch==2.4.1 \ | ||
| torchvision==0.19.1 \ | ||
| --extra-index-url https://download.pytorch.org/whl/cpu | ||
| fi | ||
| fi |
There was a problem hiding this comment.
This if/elif/else block contains significant code duplication for installing torch and onnx packages across different Python versions. You can refactor this by defining package version strings in variables based on the Python version, and then using these variables in shared pip3 install commands. This will make the script more readable and easier to maintain in the future.
if [ "$PYTHON_VERSION" == "3.9" ]; then
ONNX_PKGS="onnx==1.16.0 onnxruntime==1.19.2 onnxoptimizer==0.2.7"
elif [ "$PYTHON_VERSION" == "3.11" ]; then
ONNX_PKGS="onnx==1.17.0 onnxruntime==1.20.1 onnxoptimizer==0.2.7"
else
ONNX_PKGS="onnx==1.12.0 onnxruntime==1.12.1 onnxoptimizer==0.2.7"
fi
pip3 install ${ONNX_PKGS}
if [ "$PYTHON_VERSION" == "3.9" ] || [ "$PYTHON_VERSION" == "3.11" ]; then
TORCH_PKGS="torch==2.7.0 torchvision==0.22.0"
TORCH_CUDA_ARGS="--index-url https://download.pytorch.org/whl/cu118"
else
TORCH_PKGS="torch==2.4.1 torchvision==0.19.1"
TORCH_CUDA_ARGS=""
fi
if [ "$DEVICE" == "cuda" ]; then
pip3 install ${TORCH_PKGS} ${TORCH_CUDA_ARGS}
else
pip3 install ${TORCH_PKGS} --extra-index-url https://download.pytorch.org/whl/cpu
fi|
do we know which was the test failing? I feel it is important for the CI to be up to date in terms of cuda/torch versions. So for the case of opencl, perhapas we can temp skip some of the tests? |
|
I guess all of the opencl tests was failing. @tqchen
agree |
|
I'm trying to skip all opencl tests and see if it passes. https://ci.tlcpack.ai/blue/organizations/jenkins/tvm-gpu/detail/PR-18775/36/pipeline |
|
yes, i think it is ok to skip opencl tests for now |
| onnx==1.20.1 \ | ||
| onnxruntime==1.23.2 \ | ||
| onnxoptimizer==0.4.2 | ||
| future \ |
There was a problem hiding this comment.
Let us wait and see if skip works
|
closing as skip works |
With
20260214-152058-2a448ce4images, OpenCL tests are failing with segmentation fault. Can't reproduce on my local machine but I gues it's due to CUDA version upgrade. This PR revert it and also downgrade PyTorch to match the CUDA compatibility.