Remove Qwen tokenizer modification #390

cjluo-nv · 2025-09-29T15:45:51Z

What does this PR do?

Bug fix

Overview: ?

The newer qwen models such as qwen3 and qwen2.5VL does not require tokenizer modification anymore.

Summary by CodeRabbit

Refactor
- Removed Qwen-specific tokenizer overrides in evaluation and quantization examples; pad/eos tokens now follow the tokenizer’s defaults for more consistent behavior.
Chores
- Simplified example utilities by removing special-case handling for Qwen tokenizers across LLM eval, MMLU, PTQ, and Windows accuracy benchmark scripts.
Notes
- Users running Qwen-based models may see changed padding/end-of-sequence behavior and should verify outputs with their tokenizer’s default settings.

Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai · 2025-09-29T15:45:59Z

Walkthrough

Removed Qwen-specific pad/eos token adjustments across multiple tokenizer initialization paths. Affected scripts continue initializing models and tokenizers without inferring model type from checkpoint directories. No public APIs changed.

Changes

Cohort / File(s)	Summary
LLM Eval: model init cleanup `examples/llm_eval/gen_model_answer.py`, `examples/llm_eval/mmlu.py`	Deleted logic that inferred "qwen" from `checkpoint_dir` and set `tokenizer.pad_token`/`eos_token`. Retained existing model/tokenizer initialization paths.
Tokenizer utils: remove Qwen overrides `examples/llm_eval/quantization_utils.py`, `examples/windows/accuracy_benchmark/quantization_utils.py`, `examples/llm_ptq/example_utils.py`	Removed special-casing for QWen/Qwen tokenizers that set pad/eos via hardcoded id. Padding/eos now rely on tokenizer defaults. Other initialization remains unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I twitch my ears at tokens’ tune,
No special carrots for Qwen’s spoon.
Defaults nibble, simple and neat,
One path forward, fewer feet.
Hop hop—configs trimmed with care,
Clean meadows, lighter air. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “Remove Qwen tokenizer modification” succinctly and accurately captures the primary change in this pull request, namely the removal of special-case logic for Qwen tokenizers across multiple files, and does so clearly without extraneous details or vague terms.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenjiel/remove_qwen_token

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 615f3c0 and c238ab7.

📒 Files selected for processing (5)

examples/llm_eval/gen_model_answer.py (0 hunks)
examples/llm_eval/mmlu.py (0 hunks)
examples/llm_eval/quantization_utils.py (0 hunks)
examples/llm_ptq/example_utils.py (0 hunks)
examples/windows/accuracy_benchmark/quantization_utils.py (0 hunks)

💤 Files with no reviewable changes (5)

examples/llm_eval/quantization_utils.py
examples/llm_eval/mmlu.py
examples/windows/accuracy_benchmark/quantization_utils.py
examples/llm_eval/gen_model_answer.py
examples/llm_ptq/example_utils.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-09-29T15:59:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.85%. Comparing base (615f3c0) to head (c238ab7).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #390      +/-   ##
==========================================
- Coverage   73.86%   73.85%   -0.01%     
==========================================
  Files         171      171              
  Lines       17629    17629              
==========================================
- Hits        13021    13020       -1     
- Misses       4608     4609       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <[email protected]>

Remove Qwen tokenizer modification

c238ab7

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv requested review from a team as code owners September 29, 2025 15:45

cjluo-nv requested review from ynankani and realAsma September 29, 2025 15:45

kevalmorabia97 requested a review from Edwardf0t1 September 29, 2025 15:57

kevalmorabia97 mentioned this pull request Sep 29, 2025

Fix Qwen2.5-VL FP8 #272

Closed

realAsma approved these changes Oct 2, 2025

View reviewed changes

kevalmorabia97 merged commit 340eb7a into main Oct 4, 2025
27 checks passed

kevalmorabia97 deleted the chenjiel/remove_qwen_token branch October 4, 2025 12:11

kevalmorabia97 pushed a commit that referenced this pull request Oct 4, 2025

Remove Qwen tokenizer modification (#390)

847178b

Signed-off-by: Chenjie Luo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove Qwen tokenizer modification #390

Remove Qwen tokenizer modification #390

cjluo-nv commented Sep 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Remove Qwen tokenizer modification #390

Remove Qwen tokenizer modification #390

Conversation

cjluo-nv commented Sep 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

cjluo-nv commented Sep 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 29, 2025 •

edited

Loading

codecov bot commented Sep 29, 2025 •

edited

Loading