Question about instance_group usage in CosyVoice2 Triton deployment

Hi,

I'm tuning CosyVoice2 performance with Triton and I would like some clarification about _instance_group_ in _config.pbtxt_.
What exactly is the role of _instance_group_? Does increasing it allow more inference requests to run in parallel? How does it interact with dynamic batching?

For which components of the model (cosyvoice2, audio_tokenizer, speaker_embedding, tensorrt_llm, token2wav)  is it useful to increase instance_group? Should multiple instances be configured for all of them, or only for specific components (e.g. tensorrt_llm or token2wav)?

What is the relationship between the number of simultaneous inference requests and instance_group?
If I want to support N concurrent TTS requests, should instance_group scale proportionally?

Any best practices for configuring this for low latency and stable streaming under moderate concurrency would be very helpful.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about instance_group usage in CosyVoice2 Triton deployment #1834

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about instance_group usage in CosyVoice2 Triton deployment #1834

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions