You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds gradient cap for teacher student distillation (#91)
This PR adds a gradient cap to the teacher-student distillation setup.
The goal is to prevent excessively large gradients from destabilizing training.
📌 Changes
Introduced a clipping mechanism to cap the gradients during backpropagation in the distillation process.
Helps improve training stability, especially in early iterations.
---------
Co-authored-by: alessandro.assirelli <[email protected]>
0 commit comments