-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Hi,
I'm tuning CosyVoice2 performance with Triton and I would like some clarification about instance_group in config.pbtxt.
What exactly is the role of instance_group? Does increasing it allow more inference requests to run in parallel? How does it interact with dynamic batching?
For which components of the model (cosyvoice2, audio_tokenizer, speaker_embedding, tensorrt_llm, token2wav) is it useful to increase instance_group? Should multiple instances be configured for all of them, or only for specific components (e.g. tensorrt_llm or token2wav)?
What is the relationship between the number of simultaneous inference requests and instance_group?
If I want to support N concurrent TTS requests, should instance_group scale proportionally?
Any best practices for configuring this for low latency and stable streaming under moderate concurrency would be very helpful.
Thanks!