Skip to content

Fine tuning doesn't use context? #39

@tommedema

Description

@tommedema

Since csm is about emotional intelligence, and the underlying llama model generates the 32 audio codebooks based not just on the text to vocalize but also the codebooks of the prior audio (to determine what prosody etc. is relevant in the current context), doesn't it follow that fine tuning must always have at least 1 prior turn as context?

The current lora.py script expects only a single speaker per entry which seems to go against the architecture of csm. By fine tuning it this way aren't you making it less capable of sounding empathetic, the very thing it was built for?

If so, any thoughts on adding turn taking to fine tuning?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions