[chunked loss] align teacher and student logit shape (#634)

yundai424 · web-flow · commit 87187b121bd5 · 2025-03-28T07:24:04.000-07:00
## Summary
&lt;!--- This is a required section; please describe the main purpose of
this proposed code change. ---&gt;

In rare cases where the teacher and student models don't have the same
vocab size (but their vocabs are actually the same), for example qwen
models, we pad students to match the teacher's logit.
&lt;!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
---&gt;

## Testing Done
&lt;!--- This is a required section; please describe how this change was
tested. ---&gt;
make test
&lt;!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
--&gt;

- Hardware Type: &lt;BLANK&gt;
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
diff --git a/src/liger_kernel/chunked_loss/fused_linear_distillation.py b/src/liger_kernel/chunked_loss/fused_linear_distillation.py
@@ -115,6 +115,21 @@ def _compute_loss(
         student_logits_chunk /= temperature
         teacher_logits_chunk /= temperature
 
+        # If the teacher and student token size is different, pad student logits to match the teacher's.
+        # This only applies to cases where they share exactly the same vocab and tokenizer just
+        # that teacher logit is padded for some training efficiency such as
+        # https://huggingface.co/Qwen/Qwen1.5-72B-Chat/discussions/1#662883f568adf59b07b176d2
+        teacher_vocab_size = teacher_weight.shape[0]
+        student_vocab_size = student_weight.shape[0]
+        if teacher_vocab_size > student_vocab_size:
+            pad_size = teacher_vocab_size - student_vocab_size
+            pad_tensor = torch.zeros(
+                (*student_logits_chunk.shape[:-1], pad_size),
+                dtype=student_logits_chunk.dtype,
+                device=student_logits_chunk.device,
+            )
+            student_logits_chunk = torch.cat([student_logits_chunk, pad_tensor], dim=-1)
+
         hard_loss /= full_target.shape[0]
 
         soft_loss = distillation_loss_fn(student_logits_chunk, teacher_logits_chunk, **loss_kwargs)