usage: Text emotion input + segment/voice tags in IndexTTS2 #177

Vatharian · 2025-11-27T02:11:28Z

Vatharian
Nov 27, 2025

Forgive me my autism, but I have trouble with comprehending the emotion control documentation bit.

Question 1: In Dynamic templates section, {seg} refers to complete contents of TTS Text node, is that right? And, by extension, it is impossible to get two emotions in single run via this method, and voice input to clone needs to be configured by opt_narrator, or narrator_voice, is that right?

Question 2: In next point, Character Tag Emotion Control.
If I am reading it correctly, here, the tags refer to:
[A:B] line spoken
where A is either filename from voice_examples folder, or aliases from character_alias_map.txt, and gives basic voice to clone
and B is either filename from voice_examples folder, or aliases from character_alias_map.txt, and indicates audio input to copy emotion from.
Line spoken, is pretty clear.
Is this correct?

Question 3: Does it mean I need to record a short clip with exact emotion for each spoken phrase I need? And the workflow presented won't work, until I record clips and name them happy_sarah, serious_narrator and excited_sarah? Because sure as hell, when I pasted the example text into correct nodes Alice wasn't excited:

Querstion 4: In parameter switching guide, I can see only TTS engine controls in parameters. Does it mean it is impossible to inject emotion control into these parameters?

Answered by diodiogod

Nov 27, 2025

Question 1: Yes, an no. You are correct that the emotion you write is "fixed". My idea with {seg} was to make this dynamic in the sense that you can feed to the text interpreter model the actual TTS content in a dynamic way. The dynamic part is the TTS text not the emotion. Whatever you write before or after {seg} is fixed for all the segments generated. And no because it's not ALL the text that is in the "TTS text" node that constitute a "seg". A segment is a full generation in one go by the model, you can see in the console, if it stopped and started generating again, that is another segment.

And the text can be segmented in some ways:
1- by configuring the chunks (if the text reaches t…

View full answer

diodiogod · 2025-11-27T05:01:06Z

diodiogod
Nov 27, 2025
Maintainer

Question 1: Yes, an no. You are correct that the emotion you write is "fixed". My idea with {seg} was to make this dynamic in the sense that you can feed to the text interpreter model the actual TTS content in a dynamic way. The dynamic part is the TTS text not the emotion. Whatever you write before or after {seg} is fixed for all the segments generated. And no because it's not ALL the text that is in the "TTS text" node that constitute a "seg". A segment is a full generation in one go by the model, you can see in the console, if it stopped and started generating again, that is another segment.

And the text can be segmented in some ways:
1- by configuring the chunks (if the text reaches the limit it gets chunked, so each chunk is a "seg").
2- By making a new line. (I might be wrong with this one, but I think it does segment new lines)
3- By using [pause:1] tags
4- By using [Character] tags.
5- By using the "TTS SRT". SRT also segments by all the previous methods but each subtitle is it's own generated segment.

and voice input to clone needs to be configured by opt_narrator, or narrator_voice, is that right?

: Yes.

Question 2: Yes, correct. A is the voice reference to be cloned, B is the emotion from another voice, not the voice itself.

Question 3: Yes, all your "B"s need to be a "character". So you can clone their emotions. But no, you don't need to record your emotion for every phrase. The emotion can be anything, it does not need to match the TTS phrase. Your "Angry_Sarah" can be any person speaking absolute anything angrily. It should clone that angriness from that person and apply to your "A" character. How good is this? You tell me. IndexTTS2 is very interesting on paper... in practice, I'm not that sure.
And your workflow won't work because you don't have "serious_narrator" or "exited_sarah" in your voices folder or alias map.

Question 4: You can control emotion_alpha with tags. But for now I did not add emotional control from text and not from vectors because it would look cluttered... but I could try to add it in the future. But to be honest, using an emotion from an audio (from character B's like you said) works way better than using vectors in my experience. You might be better serviced by having a character audio to clone emotions from than to use vector.

I hope this makes everything more clear.

1 reply

Vatharian Nov 28, 2025
Author

Thank you so much for taking time to explain in details how the system works, I really appreciate it.
It appears I will have to sit down on quiet evening and scream a little to the mic. Well, anything for peace at work :D
Anyway, you gave me a lot of pointers on how to effectively work with the tools you made, I will make it work one way or another.

The reason why I asked for injecting emotion vectors in text, I was wishfully and hopefully thinking, I could use the qwen text emotion analysis to prepare a bunch of vectors beforehand, somehow record output of them, and store it somewhere (i.e. as a hex string), maybe even in just my own notes, and inject it from text, since I have really no idea how to input the vectors to get i.e. deadpan remark, or confident order, while the qwen does this incredibly well. This way I could skip the recording phase, but no cheating the legwork, from what I see.

Thank you very much again!

diodiogod · 2025-12-02T04:17:34Z

diodiogod
Dec 2, 2025
Maintainer

Also, worth linking to the guide https://github.com/diodiogod/TTS-Audio-Suite/blob/main/docs/IndexTTS2_Emotion_Control_Guide.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

usage: Text emotion input + segment/voice tags in IndexTTS2 #177

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

usage: Text emotion input + segment/voice tags in IndexTTS2 #177

Uh oh!

Vatharian Nov 27, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

diodiogod Nov 27, 2025 Maintainer

Uh oh!

Vatharian Nov 28, 2025 Author

Uh oh!

diodiogod Dec 2, 2025 Maintainer

Vatharian
Nov 27, 2025

Replies: 2 comments 1 reply

diodiogod
Nov 27, 2025
Maintainer

Vatharian Nov 28, 2025
Author

diodiogod
Dec 2, 2025
Maintainer