-
Notifications
You must be signed in to change notification settings - Fork 717
Fix preference of tokenizer_config.json and remove doLowerCase from TokenizerConfig #3785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix preference of tokenizer_config.json and remove doLowerCase from TokenizerConfig #3785
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3785 +/- ##
============================================
+ Coverage 60.17% 62.76% +2.58%
- Complexity 6172 6459 +287
============================================
Files 704 704
Lines 34631 34657 +26
Branches 3740 3752 +12
============================================
+ Hits 20839 21752 +913
+ Misses 12207 11250 -957
- Partials 1585 1655 +70 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -61,13 +62,15 @@ public final class HuggingFaceTokenizer extends NativeResource<Long> implements | |||
private boolean cleanupTokenizationSpaces; | |||
private boolean stripAccents; | |||
private boolean addPrefixSpace; | |||
private final Map<String, String> options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need keep options as a member variable, we can pass it down to applyConfig()
|
||
private HuggingFaceTokenizer( | ||
long handle, | ||
Map<String, String> options, | ||
TokenizerConfig config, | ||
PadTokenResolver.PadInfo padInfo) { | ||
super(handle); | ||
this.options = options != null ? new HashMap<>(options) : new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need create an empty HashMap, just check null in applyConfig()
438d0f1
to
719b633
Compare
719b633
to
8479de8
Compare
Fix preference of tokenizer_config.json and remove doLowerCase from TokenizerConfig
Description
This PR improves the HuggingFace tokenizer configuration handling by:
tokenizer_config.json
values, allowing runtime overrides of config file settingsdoLowerCase
parameter is now handled exclusively through options, simplifying the configuration modelmodelMaxLength
via options with proper fallback to config valuesChanges
applyConfig()
method: Only applies config values when not explicitly set in optionsmodel_max_length
Backward Compatibility
This change is backward compatible. Existing code will continue to work as before, but now has additional flexibility to override config file values at runtime.
Edge Cases
modelMaxLength
is set in both options and config, options take precedenceReference Discussion - #3730