❓ High Native Memory Usage with Multiple OrtSession Instances - Optimization Options? #718
-
TitleHigh Native Memory Usage with Multiple OrtSession Instances - Optimization Options? Issue DescriptionI'm using ONNX Runtime Java for real-time VAD (Voice Activity Detection) inference. Each WebSocket session creates its own public SlieroVadOnnxModel(String modelPath) throws OrtException {
OrtEnvironment env = OrtEnvironment.getEnvironment(OrtLoggingLevel.ORT_LOGGING_LEVEL_ERROR);
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
opts.setInterOpNumThreads(1);
opts.setIntraOpNumThreads(1);
opts.addCPU(true);
session = env.createSession(modelPath, opts);
}ProblemWith 200 concurrent sessions, we observe ~4GB in native memory (off-heap), causing OOM errors. We suspect each Memory breakdown:
Questions
Environment
Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Hi,
I believe there are 3 global architectural ways to handle this:
If each session consumes only 20 MB of memory, this is quite good actually. RAM is cheap, so if your app works properly and requires only 4GB of RAM per 200 connections, I believe this is fine. PyTorch typically requires more RAM. Another looming issue is that IDEALLY each VAD instance should have a SEPARATE CPU thread to work with to avoid deadlocks. So if 200 connections are active all the time, it may cause some congestions. Probably a better way would be to share 10-20 connections via RPC, each one with a dedicated thread / CPU / processor. An issue for us with Python was that creating a VAD instance is not cheap and takes some time.
We did not dig into this direction. I would experiment with setting 0 or 1 for threads and would fiddle with turning various optimizations on and off just out of curiousity. Some testing is required. Not sure that JAVA and Python
I cannot really comment on how onnx runtime works in Java, but there is a different important thing to consider. You see, the model is NOT STATELESS, i.e. it trades low compute and memory footprint for keeping "memory" about the previous chunk passed onto the model. With ONNX this happens here: silero-vad/src/silero_vad/utils_vad.py Lines 78 to 85 in 6979fbd I.e. some part of previous audio is appended as context, and the model state is passed back into the model. Also pay attention to this: silero-vad/src/silero_vad/utils_vad.py Lines 51 to 55 in 6979fbd |
Beta Was this translation helpful? Give feedback.
Hi,
I believe there are 3 global architectural ways to handle this: