How to set max_length with HF models#57
How to set max_length with HF models#57MathVast wants to merge 17 commits intoexperimaestro:masterfrom
Conversation
…ax_length attribute (used only for setting tokenizer_options but the corresponding line is commented out as of right now)
…fig instead of maximum system size (dangerous imo) + add max length in the forward
|
Yes agree - I think the best place is at the tokenizer level – and this is somehow implemented I think (with options that can be overwritten) |
|
Should this PR also include HFCrossScorer? https://github.com/experimaestro/experimaestro-ir/blob/master/src/xpmir/neural/huggingface.py#L19 We could imagine removing the For encoders at least, I believe this attribute is present in every config.json files and should be a starting point for setting the model maximum input length (also, as we want to set it through the tokenizer, this value will be double checked anyway in the forward). |
…e at the level of the Tokenizer only
… contains missing triplets)
I have implemented some changes related to the
max_lengthoption in the CrossScorer class but between the first 2 commits and the last one, I noticed they contradict (even though the first 2 commits were just making explicit a behavior that was happening anyway as the tokenizer_options were commented out and thus never used).Long story short, I think it is necessary to find a way to explicitly set, somewhere, the maximum length of the inputs the cross-scorers (and perhaps other models too) can handle. Personally, I think it should be set at the initialization of the tokenizer and of the encoder by copying the corresponding HF configs (either
max_position_embeddingsormax_lengthI'm not sure).As an example, right now and even though I'm working with a MiniLM model with maximum input size of 512, the
max_lengthis the maximum system size for the encoder (not very convenient) and 4096 for the tokenizer (set inside theHFTokenizerclass).Then at inference, this can be further specified if we want to have shorter inputs.