Skip to content

Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127

Open
ChrisRackauckas-Claude wants to merge 3 commits intoSciML:mainfrom
ChrisRackauckas-Claude:fix-gpu-segfault-v100
Open

Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127
ChrisRackauckas-Claude wants to merge 3 commits intoSciML:mainfrom
ChrisRackauckas-Claude:fix-gpu-segfault-v100

Conversation

@ChrisRackauckas-Claude
Copy link
Contributor

Summary

This PR addresses the GPU test segfault (signal 11) and documentation build failure reported in ChrisRackauckas/InternalJunk#23.

Changes

GPU Tests

  • Switch from gpu-t4 to gpu-v100 runner — The T4 runner was experiencing segfaults in Julia's codegen (emit_unboxed_coercion) during the DeepBSDE test. The crash happens during Zygote gradient computation with complex types. The V100 with 32GB VRAM (vs T4's shared 15GB) should provide enough headroom for the heavy JIT compilation.

Documentation

  • Add LocalPreferences.toml to docs/ directory to pin CUDA runtime to v12.6 and disable forward-compat driver
  • Add CUDA_Driver_jll and CUDA_Runtime_jll to docs/Project.toml
  • Widen CUDA compat to 4, 5 to allow broader version range

The documentation was failing on demeter4 V100 runners because CUDA_Driver_jll v13+ drops compute capability 7.0 (V100) support. The LocalPreferences.toml fix follows the pattern established in OrdinaryDiffEq.jl.

Test Details

The segfault occurred at:

[1006912] signal 11 (1): Segmentation fault
in expression starting at .../test/DeepBSDE.jl:19
emit_unboxed_coercion at .../julia-release-1-dot-12/src/intrinsics.cpp:394 [inlined]
emit_unbox at .../julia-release-1-dot-12/src/intrinsics.cpp:458

This is a Julia compiler crash during codegen, triggered by heavy AD compilation through Zygote/Flux in the DeepBSDE solver. Moving to V100 provides more resources for compilation.

References

  • Fixes: ChrisRackauckas/InternalJunk#23
  • Related: ChrisRackauckas/InternalJunk#19 (CUDA compatibility pattern)

- Switch GPU tests from gpu-t4 to gpu-v100 to address Julia codegen
  segfault (signal 11) during DeepBSDE test compilation
- Add LocalPreferences.toml to docs/ with CUDA 12.6 runtime pinning
  and driver forward-compat disabled for V100 compatibility
- Add CUDA_Driver_jll and CUDA_Runtime_jll to docs/Project.toml
- Widen CUDA compat to '4, 5' in docs/Project.toml

The T4 runner was experiencing segfaults in Julia's codegen during
heavy AD (Zygote) compilation. V100 with more VRAM should resolve
this. Docs segfault was caused by CUDA_Driver_jll v13+ dropping
compute capability 7.0 (V100) support.

Fixes: ChrisRackauckas/InternalJunk#23
- Add Optimisers.jl as a dependency
- Add _copy and _get_eta overloads for Optimisers.AbstractRule
- Add constructor for Optimisers.jl optimizers
- Update docs to use Flux.Optimise.Adam explicitly
The two separate constructors for Flux.Optimise and Optimisers.jl were
causing a method overwriting error during precompilation. Merged into
a single constructor that works with both optimizer types since the
_get_eta helper already dispatches correctly for both types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants