-
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
Question 1: Yes, an no. You are correct that the emotion you write is "fixed". My idea with {seg} was to make this dynamic in the sense that you can feed to the text interpreter model the actual TTS content in a dynamic way. The dynamic part is the TTS text not the emotion. Whatever you write before or after {seg} is fixed for all the segments generated. And no because it's not ALL the text that is in the "TTS text" node that constitute a "seg". A segment is a full generation in one go by the model, you can see in the console, if it stopped and started generating again, that is another segment. And the text can be segmented in some ways:
: Yes. Question 2: Yes, correct. A is the voice reference to be cloned, B is the emotion from another voice, not the voice itself. Question 3: Yes, all your "B"s need to be a "character". So you can clone their emotions. But no, you don't need to record your emotion for every phrase. The emotion can be anything, it does not need to match the TTS phrase. Your "Angry_Sarah" can be any person speaking absolute anything angrily. It should clone that angriness from that person and apply to your "A" character. How good is this? You tell me. IndexTTS2 is very interesting on paper... in practice, I'm not that sure. Question 4: You can control emotion_alpha with tags. But for now I did not add emotional control from text and not from vectors because it would look cluttered... but I could try to add it in the future. But to be honest, using an emotion from an audio (from character B's like you said) works way better than using vectors in my experience. You might be better serviced by having a character audio to clone emotions from than to use vector. I hope this makes everything more clear. |
Beta Was this translation helpful? Give feedback.
-
|
Also, worth linking to the guide https://github.com/diodiogod/TTS-Audio-Suite/blob/main/docs/IndexTTS2_Emotion_Control_Guide.md |
Beta Was this translation helpful? Give feedback.

Question 1: Yes, an no. You are correct that the emotion you write is "fixed". My idea with {seg} was to make this dynamic in the sense that you can feed to the text interpreter model the actual TTS content in a dynamic way. The dynamic part is the TTS text not the emotion. Whatever you write before or after {seg} is fixed for all the segments generated. And no because it's not ALL the text that is in the "TTS text" node that constitute a "seg". A segment is a full generation in one go by the model, you can see in the console, if it stopped and started generating again, that is another segment.
And the text can be segmented in some ways:
1- by configuring the chunks (if the text reaches t…