Fix GPU segfault: switch tests to V100, add CUDA pinning for docs by ChrisRackauckas-Claude · Pull Request #127 · SciML/HighDimPDE.jl

ChrisRackauckas-Claude · 2026-03-19T12:43:18Z

Summary

This PR addresses the GPU test segfault (signal 11) and documentation build failure reported in ChrisRackauckas/InternalJunk#23.

Changes

GPU Tests

Switch from gpu-t4 to gpu-v100 runner — The T4 runner was experiencing segfaults in Julia's codegen (emit_unboxed_coercion) during the DeepBSDE test. The crash happens during Zygote gradient computation with complex types. The V100 with 32GB VRAM (vs T4's shared 15GB) should provide enough headroom for the heavy JIT compilation.

Documentation

Add LocalPreferences.toml to docs/ directory to pin CUDA runtime to v12.6 and disable forward-compat driver
Add CUDA_Driver_jll and CUDA_Runtime_jll to docs/Project.toml
Widen CUDA compat to 4, 5 to allow broader version range

The documentation was failing on demeter4 V100 runners because CUDA_Driver_jll v13+ drops compute capability 7.0 (V100) support. The LocalPreferences.toml fix follows the pattern established in OrdinaryDiffEq.jl.

Test Details

The segfault occurred at:

[1006912] signal 11 (1): Segmentation fault
in expression starting at .../test/DeepBSDE.jl:19
emit_unboxed_coercion at .../julia-release-1-dot-12/src/intrinsics.cpp:394 [inlined]
emit_unbox at .../julia-release-1-dot-12/src/intrinsics.cpp:458

This is a Julia compiler crash during codegen, triggered by heavy AD compilation through Zygote/Flux in the DeepBSDE solver. Moving to V100 provides more resources for compilation.

References

Fixes: ChrisRackauckas/InternalJunk#23
Related: ChrisRackauckas/InternalJunk#19 (CUDA compatibility pattern)

- Switch GPU tests from gpu-t4 to gpu-v100 to address Julia codegen segfault (signal 11) during DeepBSDE test compilation - Add LocalPreferences.toml to docs/ with CUDA 12.6 runtime pinning and driver forward-compat disabled for V100 compatibility - Add CUDA_Driver_jll and CUDA_Runtime_jll to docs/Project.toml - Widen CUDA compat to '4, 5' in docs/Project.toml The T4 runner was experiencing segfaults in Julia's codegen during heavy AD (Zygote) compilation. V100 with more VRAM should resolve this. Docs segfault was caused by CUDA_Driver_jll v13+ dropping compute capability 7.0 (V100) support. Fixes: ChrisRackauckas/InternalJunk#23

- Add Optimisers.jl as a dependency - Add _copy and _get_eta overloads for Optimisers.AbstractRule - Add constructor for Optimisers.jl optimizers - Update docs to use Flux.Optimise.Adam explicitly

The two separate constructors for Flux.Optimise and Optimisers.jl were causing a method overwriting error during precompilation. Merged into a single constructor that works with both optimizer types since the _get_eta helper already dispatches correctly for both types.

ChrisRackauckas added 3 commits March 19, 2026 08:42

Support Optimisers.jl optimizers in DeepSplitting

f422bbc

- Add Optimisers.jl as a dependency - Add _copy and _get_eta overloads for Optimisers.AbstractRule - Add constructor for Optimisers.jl optimizers - Update docs to use Flux.Optimise.Adam explicitly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127

Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127
ChrisRackauckas-Claude wants to merge 3 commits intoSciML:mainfrom
ChrisRackauckas-Claude:fix-gpu-segfault-v100

ChrisRackauckas-Claude commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ChrisRackauckas-Claude commented Mar 19, 2026

Summary

Changes

GPU Tests

Documentation

Test Details

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants