Commit e40bef2
* tokenizers respect padding: true with non-null max_length
This commit changes the behavior of tokenizers to match the
behavior described in the docs and the behavior of the Python
library.
Before this commit, passing
{
padding: true,
max_length: 512
}
or
{
padding: 'max_length',
max_length: 512
}
would both always pad all outputs to 512 tokens.
After this change,
{
padding: true,
max_length: 512
}
will now pad the outputs to match the longest encoding
or max_length, whichever is shorter.
This commit also adds a test to prevent regressions.
* Revamp tokenizer padding/truncation test suite
* Fix tokenization padding/truncation logic
* nit
---------
Co-authored-by: Joshua Lochner <[email protected]>
1 parent 06ebe86 commit e40bef2
2 files changed
+280
-108
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2806 | 2806 | | |
2807 | 2807 | | |
2808 | 2808 | | |
2809 | | - | |
2810 | | - | |
2811 | | - | |
| 2809 | + | |
| 2810 | + | |
2812 | 2811 | | |
2813 | | - | |
| 2812 | + | |
| 2813 | + | |
| 2814 | + | |
| 2815 | + | |
| 2816 | + | |
| 2817 | + | |
| 2818 | + | |
2814 | 2819 | | |
2815 | | - | |
2816 | | - | |
2817 | | - | |
2818 | | - | |
2819 | | - | |
2820 | | - | |
2821 | | - | |
| 2820 | + | |
| 2821 | + | |
| 2822 | + | |
2822 | 2823 | | |
2823 | 2824 | | |
2824 | 2825 | | |
| 2826 | + | |
| 2827 | + | |
| 2828 | + | |
| 2829 | + | |
| 2830 | + | |
| 2831 | + | |
2825 | 2832 | | |
2826 | 2833 | | |
2827 | 2834 | | |
| |||
0 commit comments