Frequently got `resource exhausted error` when start up the server based on GPT-OSS vLLM inference recipe

Hey team,

I followed the setup provided by the [TPU7X GPT-OSS vLLM inference recipe](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/ironwood/vLLM/GPT-OSS) but frequently (63.64% error rate) encountered a RESOURCE_EXHAUSTED error when starting the inference server:

```
RESOURCE_EXHAUSTED: Error loading program 'jit__multi_slice': Attempting to reserve 6.14G at the bottom of memory. That was not possible.
```

Run Details:
- Manifests: [[Link](http://paste.googleplex.com/5740430492827648)]
- Error Log: [[Link](https://cloudlogging.app.goo.gl/879kfQ783FMkY7p48)]
- Full Logs: [[Link](https://cloudlogging.app.goo.gl/LkbcsipcV3mkPvus5)]

I found that reducing the gpu-memory-utilization value from 0.93 to 0.83 allows the server to start as expected.

I would like to better understand:

- How gpu-memory-utilization impacts specific memory allocation and why this error occurs intermittently.
- The performance impacts of lowering gpu-memory-utilization.
- Whether there are better mitigation strategies for this issue.

Thanks for your help!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently got `resource exhausted error` when start up the server based on GPT-OSS vLLM inference recipe #151

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Frequently got resource exhausted error when start up the server based on GPT-OSS vLLM inference recipe #151

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Frequently got `resource exhausted error` when start up the server based on GPT-OSS vLLM inference recipe #151