Onnxruntime backend error when workload is high since Triton uses CUDA 12

**Description**

When workload is high, some model in Triton ONNXRUNTIME backend will fail. And after it fails, it will never succeed again. Failures will look like:

```
"[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
```
(also see https://github.com/microsoft/onnxruntime/issues/12288 for this, I'm not the only one facing this problem)

and 

```
"[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.1/model/model.0/model.0.1/block/block.3/block.3.0/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/framework/bfc_arena.cc:368 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 4547406516732812544
```
After the `failed to allocate memory` issue occurs, I opened nvidia-smi to check memory usage and the usage peak does not reach 100%, but all subsequent inferences will fail.

Following images shows prometheus dashboard, when the model says `failed to allocate memory`, GRAM usage is actually low
![image](https://github.com/triton-inference-server/server/assets/4648756/461b75a7-bebe-49d8-be12-99fa607ca00a)

![image](https://github.com/triton-inference-server/server/assets/4648756/0fe02595-58ed-4868-b6ee-9ffa8a9bc0d0)


**Triton Information**
What version of Triton are you using?

Tried r23.03, r23.05, r23.06. All with same problem. R22.07 is ok.

Are you using the Triton container or did you build it yourself?
Triton container

**To Reproduce**
Steps to reproduce the behavior.

Put ~30 ONNXruntime models in Triton, set memory_arena_shrinkage and keeps running them until one model says `SafeIntOnOverflow` or `Failed to allocate memory`. After that, that model will never succeed again unless you restart Triton

**Expected behavior**
A clear and concise description of what you expected to happen.

`SafeIntOnOverflow` should not happen. I never see this error in r22.07

`Failed to allocate memory` should only happen if GRAM is **really** full. And once `Failed to allocate memory` happens, it should not raise this error again after other models finish inferencing and arena_shrinkage returns GRAM to system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Onnxruntime backend error when workload is high since Triton uses CUDA 12 #203

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions