❓ High Native Memory Usage with Multiple OrtSession Instances - Optimization Options? #718

qibin065965 · 2025-11-06T04:23:32Z

qibin065965
Nov 6, 2025

Title

High Native Memory Usage with Multiple OrtSession Instances - Optimization Options?

Issue Description

I'm using ONNX Runtime Java for real-time VAD (Voice Activity Detection) inference. Each WebSocket session creates its own OrtSession instance:

public SlieroVadOnnxModel(String modelPath) throws OrtException {
    OrtEnvironment env = OrtEnvironment.getEnvironment(OrtLoggingLevel.ORT_LOGGING_LEVEL_ERROR);
    OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
    opts.setInterOpNumThreads(1);
    opts.setIntraOpNumThreads(1);
    opts.addCPU(true);
    session = env.createSession(modelPath, opts);
}

Problem

With 200 concurrent sessions, we observe ~4GB in native memory (off-heap), causing OOM errors. We suspect each OrtSession might be consuming significant native memory.

Memory breakdown:

Java heap: ~2.6 MB (very low)
Native memory: ~4 GB (estimated, possibly ~20MB per OrtSession)

Questions

Is it expected that each OrtSession consumes significant native memory (possibly ~20MB), even when loading the same model file?
Are there any optimization options?
- Can we share a single OrtSession across multiple threads (since OrtSession.run() is thread-safe)?
- Are there SessionOptions configurations that can reduce native memory usage ?
What are the recommended best practices for handling multiple concurrent inference sessions?

Environment

ONNX Runtime Java: [version]
Java: [version]
Model: Silero VAD ONNX (~1.5MB)
Concurrent sessions: 200

Thanks!

Answered by snakers4

Nov 6, 2025

Hi,

Each WebSocket session creates its own OrtSession instance:

I believe there are 3 global architectural ways to handle this:

Create several permanent shared workers, working within a remote-procedure-call (RPC) paradigm. Do not forget to pass state back-and-forth to make these workers essentially STATELESS;
The same, but make these workers dedicated per connection (makes no sense with your number of connections);
The way you do it. Create a VAD instance each time new connection is made (I do not know, is a separate OS process created each time, but I assume it is for simplicity);

With 200 concurrent sessions, we observe ~4GB in native memory (off-heap), causing OOM errors. We susp…

View full answer

snakers4 · 2025-11-06T08:04:50Z

snakers4
Nov 6, 2025
Maintainer

Hi,

Each WebSocket session creates its own OrtSession instance:

I believe there are 3 global architectural ways to handle this:

Create several permanent shared workers, working within a remote-procedure-call (RPC) paradigm. Do not forget to pass state back-and-forth to make these workers essentially STATELESS;
The same, but make these workers dedicated per connection (makes no sense with your number of connections);
The way you do it. Create a VAD instance each time new connection is made (I do not know, is a separate OS process created each time, but I assume it is for simplicity);

With 200 concurrent sessions, we observe ~4GB in native memory (off-heap), causing OOM errors. We suspect each OrtSession might be consuming significant native memory.
Memory breakdown:
Java heap: ~2.6 MB (very low)
Native memory: ~4 GB (estimated, possibly ~20MB per OrtSession)
Is it expected that each OrtSession consumes significant native memory (possibly ~20MB), even when loading the same model file?

If each session consumes only 20 MB of memory, this is quite good actually. RAM is cheap, so if your app works properly and requires only 4GB of RAM per 200 connections, I believe this is fine. PyTorch typically requires more RAM.

Another looming issue is that IDEALLY each VAD instance should have a SEPARATE CPU thread to work with to avoid deadlocks. So if 200 connections are active all the time, it may cause some congestions. Probably a better way would be to share 10-20 connections via RPC, each one with a dedicated thread / CPU / processor.

An issue for us with Python was that creating a VAD instance is not cheap and takes some time. onnx-runtime is much faster than PyTorch, but model creation overhead is very noticeable on high loads (please check for JAVA).

Are there SessionOptions configurations that can reduce native memory usage ?

We did not dig into this direction.
But please consult the docs here - https://onnxruntime.ai/docs/performance/tune-performance/threading.html
Our VAD requires the minimum amount of compute possible. With PyTorch this is 1 CPU thread (i.e. 1/2 of HT processor).

I would experiment with setting 0 or 1 for threads and would fiddle with turning various optimizations on and off just out of curiousity.
Last time we checked, these docs were not as verbose, and we just YOLO-ed 1 everywhere similar to PyTorch, i.e 1 being the mimimum value. I guess now 0 is the minimum value, lol.

Some testing is required. Not sure that JAVA and Python onnx-runtime will behave exactly the same here.

Can we share a single OrtSession across multiple threads (since OrtSession.run() is thread-safe)?

I cannot really comment on how onnx runtime works in Java, but there is a different important thing to consider.

You see, the model is NOT STATELESS, i.e. it trades low compute and memory footprint for keeping "memory" about the previous chunk passed onto the model.

With ONNX this happens here:

silero-vad/src/silero_vad/utils_vad.py

Lines 78 to 85 in 6979fbd

    
           x = torch.cat([self._context, x], dim=1) 
        
           if sr in [8000, 16000]: 
        
               ort_inputs = {'input': x.numpy(), 'state': self._state.numpy(), 'sr': np.array(sr, dtype='int64')} 
        
               ort_outs = self.session.run(None, ort_inputs) 
        
               out, state = ort_outs 
        
               self._state = torch.from_numpy(state) 
        
           else: 
        
               raise ValueError()

I.e. some part of previous audio is appended as context, and the model state is passed back into the model. Also pay attention to this:

silero-vad/src/silero_vad/utils_vad.py

Lines 51 to 55 in 6979fbd

    
           def reset_states(self, batch_size=1): 
        
               self._state = torch.zeros((2, batch_size, 128)).float() 
        
               self._context = torch.zeros(0) 
        
               self._last_sr = 0 
        
               self._last_batch_size = 0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ High Native Memory Usage with Multiple OrtSession Instances - Optimization Options? #718

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

❓ High Native Memory Usage with Multiple OrtSession Instances - Optimization Options? #718

Uh oh!

qibin065965 Nov 6, 2025

Title

Issue Description

Problem

Questions

Environment

Replies: 1 comment

Uh oh!

Uh oh!

snakers4 Nov 6, 2025 Maintainer

qibin065965
Nov 6, 2025

snakers4
Nov 6, 2025
Maintainer