Hey team,
I followed the setup provided by the TPU7X GPT-OSS vLLM inference recipe but frequently (63.64% error rate) encountered a RESOURCE_EXHAUSTED error when starting the inference server:
RESOURCE_EXHAUSTED: Error loading program 'jit__multi_slice': Attempting to reserve 6.14G at the bottom of memory. That was not possible.
Run Details:
I found that reducing the gpu-memory-utilization value from 0.93 to 0.83 allows the server to start as expected.
I would like to better understand:
- How gpu-memory-utilization impacts specific memory allocation and why this error occurs intermittently.
- The performance impacts of lowering gpu-memory-utilization.
- Whether there are better mitigation strategies for this issue.
Thanks for your help!