• GPU Model: Nvidia 5090 (32GB)
• CUDA Version: 13.0
Problem:
I am attempting to speed up my evaluation by using multiple parallel environments. However, when I attempt to create a large number of environments (leading to an Out-Of-Memory (OOM) error), my GPU seems to crash. After the crash, the GPU no longer appears in the output of nvidia-smi, and it only reappears after a system reboot.
Questions:
1. Why does creating multiple subprocess environments (SubprocEnv) cause my GPU to crash?
2. What are some strategies for debugging this issue?