forked from SesameAILabs/csm
-
-
Notifications
You must be signed in to change notification settings - Fork 68
Open
Description
Since csm is about emotional intelligence, and the underlying llama model generates the 32 audio codebooks based not just on the text to vocalize but also the codebooks of the prior audio (to determine what prosody etc. is relevant in the current context), doesn't it follow that fine tuning must always have at least 1 prior turn as context?
The current lora.py script expects only a single speaker per entry which seems to go against the architecture of csm. By fine tuning it this way aren't you making it less capable of sounding empathetic, the very thing it was built for?
If so, any thoughts on adding turn taking to fine tuning?
Metadata
Metadata
Assignees
Labels
No labels