Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127
Open
ChrisRackauckas-Claude wants to merge 3 commits intoSciML:mainfrom
Open
Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127ChrisRackauckas-Claude wants to merge 3 commits intoSciML:mainfrom
ChrisRackauckas-Claude wants to merge 3 commits intoSciML:mainfrom
Conversation
- Switch GPU tests from gpu-t4 to gpu-v100 to address Julia codegen segfault (signal 11) during DeepBSDE test compilation - Add LocalPreferences.toml to docs/ with CUDA 12.6 runtime pinning and driver forward-compat disabled for V100 compatibility - Add CUDA_Driver_jll and CUDA_Runtime_jll to docs/Project.toml - Widen CUDA compat to '4, 5' in docs/Project.toml The T4 runner was experiencing segfaults in Julia's codegen during heavy AD (Zygote) compilation. V100 with more VRAM should resolve this. Docs segfault was caused by CUDA_Driver_jll v13+ dropping compute capability 7.0 (V100) support. Fixes: ChrisRackauckas/InternalJunk#23
- Add Optimisers.jl as a dependency - Add _copy and _get_eta overloads for Optimisers.AbstractRule - Add constructor for Optimisers.jl optimizers - Update docs to use Flux.Optimise.Adam explicitly
The two separate constructors for Flux.Optimise and Optimisers.jl were causing a method overwriting error during precompilation. Merged into a single constructor that works with both optimizer types since the _get_eta helper already dispatches correctly for both types.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses the GPU test segfault (signal 11) and documentation build failure reported in ChrisRackauckas/InternalJunk#23.
Changes
GPU Tests
gpu-t4togpu-v100runner — The T4 runner was experiencing segfaults in Julia's codegen (emit_unboxed_coercion) during the DeepBSDE test. The crash happens during Zygote gradient computation with complex types. The V100 with 32GB VRAM (vs T4's shared 15GB) should provide enough headroom for the heavy JIT compilation.Documentation
LocalPreferences.tomltodocs/directory to pin CUDA runtime to v12.6 and disable forward-compat driverCUDA_Driver_jllandCUDA_Runtime_jlltodocs/Project.toml4, 5to allow broader version rangeThe documentation was failing on demeter4 V100 runners because CUDA_Driver_jll v13+ drops compute capability 7.0 (V100) support. The LocalPreferences.toml fix follows the pattern established in OrdinaryDiffEq.jl.
Test Details
The segfault occurred at:
This is a Julia compiler crash during codegen, triggered by heavy AD compilation through Zygote/Flux in the DeepBSDE solver. Moving to V100 provides more resources for compilation.
References