Original model vs. finetune logic difference #49

yfeng997 · 2025-11-16T22:52:45Z

yfeng997
Nov 16, 2025

Hi @davidbrowne17 , thank you for spinning this up. Really appreciate the work to allow finetuning CSM.

I have a quick question regarding the behavior of finetuning. In forward_and_loss, it seems that we are predicting for N codebooks' tokens directly using the backbone hidden state, along with a codebook specific head.

However this seems different from the original CSM model behavior where the N-1 (except first codebook) is generated auto-regressively using another decoder transformer, along with the codebook specific head.

Is this difference intentional? I can see the obvious savings on compute and complexity. I'm curious if you have any data point on the performance of this simplified model architecture.

Thank you for all your work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Original model vs. finetune logic difference #49

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Original model vs. finetune logic difference #49

Uh oh!

yfeng997 Nov 16, 2025

Replies: 0 comments

yfeng997
Nov 16, 2025