How to set max_length with HF models by MathVast · Pull Request #57 · experimaestro/experimaestro-ir

MathVast · 2025-10-10T15:58:01Z

I have implemented some changes related to the max_lengthoption in the CrossScorer class but between the first 2 commits and the last one, I noticed they contradict (even though the first 2 commits were just making explicit a behavior that was happening anyway as the tokenizer_options were commented out and thus never used).

Long story short, I think it is necessary to find a way to explicitly set, somewhere, the maximum length of the inputs the cross-scorers (and perhaps other models too) can handle. Personally, I think it should be set at the initialization of the tokenizer and of the encoder by copying the corresponding HF configs (either max_position_embeddings or max_length I'm not sure).
As an example, right now and even though I'm working with a MiniLM model with maximum input size of 512, the max_length is the maximum system size for the encoder (not very convenient) and 4096 for the tokenizer (set inside the HFTokenizer class).

Then at inference, this can be further specified if we want to have shorter inputs.

…ax_length attribute (used only for setting tokenizer_options but the corresponding line is commented out as of right now)

…fig instead of maximum system size (dangerous imo) + add max length in the forward

bpiwowar · 2025-10-10T16:24:45Z

Yes agree - I think the best place is at the tokenizer level – and this is somehow implemented I think (with options that can be overwritten)

MathVast · 2025-10-16T14:13:34Z

Should this PR also include HFCrossScorer? https://github.com/experimaestro/experimaestro-ir/blob/master/src/xpmir/neural/huggingface.py#L19

We could imagine removing the max_length parameter and read it instead from the config of the model:

self.config = AutoConfig.from_pretrained(self.hf_id)
self.max_length = self.config.max_position_embeddings

For encoders at least, I believe this attribute is present in every config.json files and should be a starting point for setting the model maximum input length (also, as we want to set it through the tokenizer, this value will be double checked anyway in the forward).
Edge cases would be more recent architectures, but I guess it doesn't fit anyway in the scope of the models supported by the HFCrossScorer class.

…e at the level of the Tokenizer only

… contains missing triplets)

…training

…e forward

mat_vast added 4 commits September 29, 2025 11:30

Uniformize CrossScorer class with the DuoCrossScorer one and remove m…

502755c

…ax_length attribute (used only for setting tokenizer_options but the corresponding line is commented out as of right now)

First batch of missing .C for the cross-encoders experiments

2454754

Make the max_length return the max_position_embedding from the hf_con…

330f68e

…fig instead of maximum system size (dangerous imo) + add max length in the forward

Forgot to create options before if it doesn't exist

60bb2e9

mat_vast and others added 2 commits October 10, 2025 18:27

Forgot another thing

78942ac

Merge branch 'experimaestro:master' into master

f97bc25

mat_vast and others added 11 commits October 17, 2025 14:48

Move the max_length out of the Encoder to make it fully parametrizabl…

53030e3

…e at the level of the Tokenizer only

Add a CrossScorer that mimicks MinILM

15cd1ca

Change the source of the MSMARCO triplets to use the v2 version (that…

b1fb3e9

… contains missing triplets)

Fix implementation of the MiniLM based crossscorer

452de6b

Learner also outputs module loaders for the checkpoints saved during …

fecfbe1

…training

Override the ModuleLoader to add the epoch as a param

b0b7f83

Fix import and task_outputs format

d39748f

Add dep

fbc30fc

If there is no token_type_ids from the tokenizer, don't use themin th…

8e934c5

…e forward

Merge remote-tracking branch 'official/master'

fd32d0c

Merge branch 'experimaestro:master' into master

256c827

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set max_length with HF models#57

How to set max_length with HF models#57
MathVast wants to merge 17 commits intoexperimaestro:masterfrom
MathVast:master

MathVast commented Oct 10, 2025

Uh oh!

bpiwowar commented Oct 10, 2025

Uh oh!

MathVast commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MathVast commented Oct 10, 2025

Uh oh!

bpiwowar commented Oct 10, 2025

Uh oh!

MathVast commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants