Support for using MultipleNegativesRankingLoss with SentenceTransformerTrainer (evaluation, early stopping, eval logging, custom callbacks)

Hi,

Currently it seems not possible to use **MultipleNegativesRankingLoss** together with the SentenceTransformerTrainer.

This is the training script i follow (i implemented the same steps for my real world dataset)
https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/ms_marco/train_bi-encoder_mnrl.py


_And this is an issue because,_
When I train with **MultipleNegativesRankingLoss** using traditional model.fit, the trainer does not return an evaluation loss. I know there is a way to do it but its either too frequent, or reduces the batch size therefore training results in overfit. Here, let me explain:

- In model.fit(), evaluation_steps is defined in terms of training steps (iterations). Since each step = one batch, changing the batch size directly changes the effective evaluation frequency.
- With larger batch sizes (which are recommended for MNRL to stabilize gradients and reduce overfitting), the number of steps per epoch decreases. This means that for the same evaluation_steps value, evaluations become less frequent.
- With smaller batch sizes, you get more frequent evaluations, but the signal from MNRL becomes noisier and training may overfit or become unstable.

So the trade-off is not only about choosing between frequent vs. infrequent evaluations, but also about balancing it with the batch size that MNRL requires.

This makes it very difficult to tune experiments reliably:

_If I prioritize large batches (good for MNRL), I lose eval signal._

_If I prioritize eval frequency, I am forced into small batches that hurt training quality and efficiency._

A native solution in SentenceTransformerTrainer that decouples evaluation frequency from batch size — and integrates eval loss logging with callbacks — would make MNRL training much more practical and reproducible.

And even more disturbing fact is that; with model.fit, I cannot properly use early stopping callbacks or track metrics through (preferably MLflow) any other logging frameworks.

_I am just training blindly without any feedback signal from the validation set, which is very limiting for real-world experiments._

It would be very helpful if there was:

A way to integrate MultipleNegativesRankingLoss into the SentenceTransformerTrainer’s evaluation loop.

Native support for logging the eval loss so that callbacks (early stopping, best checkpoint, etc.) and external loggers (e.g. MLflow, WandB) can be used.

This gap makes it hard to run reproducible and well-controlled experiments — right now the only option is to “just train and hope” with MNRL, which is a bit heartbreaking 🙂

Would it make sense to add an example or official support for this in the examples/ directory?
I would be happy to contribute an implementation (custom collator + compute_loss integration for HF Trainer) if maintainers think this fits.

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for using MultipleNegativesRankingLoss with SentenceTransformerTrainer (evaluation, early stopping, eval logging, custom callbacks) #3490

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for using MultipleNegativesRankingLoss with SentenceTransformerTrainer (evaluation, early stopping, eval logging, custom callbacks) #3490

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions