Skip to content

Frequently got resource exhausted error when start up the server based on GPT-OSS vLLM inference recipe #151

@lepan-google

Description

@lepan-google

Hey team,

I followed the setup provided by the TPU7X GPT-OSS vLLM inference recipe but frequently (63.64% error rate) encountered a RESOURCE_EXHAUSTED error when starting the inference server:

RESOURCE_EXHAUSTED: Error loading program 'jit__multi_slice': Attempting to reserve 6.14G at the bottom of memory. That was not possible.

Run Details:

I found that reducing the gpu-memory-utilization value from 0.93 to 0.83 allows the server to start as expected.

I would like to better understand:

  • How gpu-memory-utilization impacts specific memory allocation and why this error occurs intermittently.
  • The performance impacts of lowering gpu-memory-utilization.
  • Whether there are better mitigation strategies for this issue.

Thanks for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions