Skip to content

Conversation

Soha-Agarwal
Copy link
Contributor

@Soha-Agarwal Soha-Agarwal commented Aug 29, 2025

Fix preference of tokenizer_config.json and remove doLowerCase from TokenizerConfig

Description

This PR improves the HuggingFace tokenizer configuration handling by:

  1. Fixing configuration precedence: Options now take priority over tokenizer_config.json values, allowing runtime overrides of config file settings
  2. Removing doLowerCase from TokenizerConfig: The doLowerCase parameter is now handled exclusively through options, simplifying the configuration model
  3. Adding modelMaxLength support: Users can now set modelMaxLength via options with proper fallback to config values

Changes

  • Enhanced applyConfig() method: Only applies config values when not explicitly set in options
  • Improved parameter precedence: Options → TokenizerConfig → Defaults
  • Better modelMaxLength handling: Supports runtime override of config's model_max_length

Backward Compatibility

This change is backward compatible. Existing code will continue to work as before, but now has additional flexibility to override config file values at runtime.

Edge Cases

  • When modelMaxLength is set in both options and config, options take precedence
  • Config values are only applied if the corresponding option key is not present
  • Maintains existing default behavior when neither options nor config specify values

Reference Discussion - #3730

@Soha-Agarwal Soha-Agarwal requested review from zachgk and a team as code owners August 29, 2025 22:28
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 13.33333% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.76%. Comparing base (a664d8b) to head (c42c173).
⚠️ Report is 27 commits behind head on master.

Files with missing lines Patch % Lines
...l/huggingface/tokenizers/HuggingFaceTokenizer.java 13.33% 9 Missing and 4 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3785      +/-   ##
============================================
+ Coverage     60.17%   62.76%   +2.58%     
- Complexity     6172     6459     +287     
============================================
  Files           704      704              
  Lines         34631    34657      +26     
  Branches       3740     3752      +12     
============================================
+ Hits          20839    21752     +913     
+ Misses        12207    11250     -957     
- Partials       1585     1655      +70     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -61,13 +62,15 @@ public final class HuggingFaceTokenizer extends NativeResource<Long> implements
private boolean cleanupTokenizationSpaces;
private boolean stripAccents;
private boolean addPrefixSpace;
private final Map<String, String> options;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need keep options as a member variable, we can pass it down to applyConfig()


private HuggingFaceTokenizer(
long handle,
Map<String, String> options,
TokenizerConfig config,
PadTokenResolver.PadInfo padInfo) {
super(handle);
this.options = options != null ? new HashMap<>(options) : new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need create an empty HashMap, just check null in applyConfig()

@Soha-Agarwal Soha-Agarwal force-pushed the hf-tokenizer-config-support branch 2 times, most recently from 438d0f1 to 719b633 Compare September 2, 2025 00:02
@Soha-Agarwal Soha-Agarwal marked this pull request as draft September 2, 2025 00:02
@Soha-Agarwal Soha-Agarwal force-pushed the hf-tokenizer-config-support branch from 719b633 to 8479de8 Compare September 2, 2025 00:05
@Soha-Agarwal Soha-Agarwal marked this pull request as ready for review September 2, 2025 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants