Skip to content

How to set max_length with HF models#57

Open
MathVast wants to merge 17 commits intoexperimaestro:masterfrom
MathVast:master
Open

How to set max_length with HF models#57
MathVast wants to merge 17 commits intoexperimaestro:masterfrom
MathVast:master

Conversation

@MathVast
Copy link
Collaborator

I have implemented some changes related to the max_lengthoption in the CrossScorer class but between the first 2 commits and the last one, I noticed they contradict (even though the first 2 commits were just making explicit a behavior that was happening anyway as the tokenizer_options were commented out and thus never used).

Long story short, I think it is necessary to find a way to explicitly set, somewhere, the maximum length of the inputs the cross-scorers (and perhaps other models too) can handle. Personally, I think it should be set at the initialization of the tokenizer and of the encoder by copying the corresponding HF configs (either max_position_embeddings or max_length I'm not sure).
As an example, right now and even though I'm working with a MiniLM model with maximum input size of 512, the max_length is the maximum system size for the encoder (not very convenient) and 4096 for the tokenizer (set inside the HFTokenizer class).

Then at inference, this can be further specified if we want to have shorter inputs.

mat_vast added 4 commits September 29, 2025 11:30
…ax_length attribute (used only for setting tokenizer_options but the corresponding line is commented out as of right now)
…fig instead of maximum system size (dangerous imo) + add max length in the forward
@bpiwowar
Copy link
Collaborator

Yes agree - I think the best place is at the tokenizer level – and this is somehow implemented I think (with options that can be overwritten)

@MathVast
Copy link
Collaborator Author

Should this PR also include HFCrossScorer? https://github.com/experimaestro/experimaestro-ir/blob/master/src/xpmir/neural/huggingface.py#L19

We could imagine removing the max_length parameter and read it instead from the config of the model:

self.config = AutoConfig.from_pretrained(self.hf_id)
self.max_length = self.config.max_position_embeddings 

For encoders at least, I believe this attribute is present in every config.json files and should be a starting point for setting the model maximum input length (also, as we want to set it through the tokenizer, this value will be double checked anyway in the forward).
Edge cases would be more recent architectures, but I guess it doesn't fit anyway in the scope of the models supported by the HFCrossScorer class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants