feat(training): improve classifier training with balanced sampling and error analysis #451

Xunzhuo · 2025-10-16T09:54:54Z

What type of PR is this?

fix(training): improve classifier training with balanced sampling and error analysis

What this PR does / why we need it:

This PR addresses several critical issues in the Qwen3 generative classifier training pipeline:

🔧 Key Improvements

1. Balanced Sampling Validation

Ensures all categories have exactly the same number of training samples
Automatically adjusts to the minimum available samples across categories
Prevents overfitting on categories with more data
Adds clear logging of sample distribution

2. Fixed Generation Bugs

Fixed apply_chat_template return type error (list vs dict)
Increased max_new_tokens from 2 to 10 to handle multi-token category names
Prevents truncation of categories like "philosophy" → "phil"

3. Enhanced Normalization

Added partial matches: "philo" → "philosophy", "psycho" → "psychology"
Improved three-tier matching: exact → two-word → first-word
Better handling of incomplete generations

4. 'other' Category Analysis

Automatic error analysis when accuracy < 75%
Shows misclassification patterns and top confused categories
Displays example errors for debugging
Improved instruction template with clearer definition and examples

5. Debug Logging

Logs generated text and extracted categories
Shows normalization process for first/last samples
Better visibility into model behavior

📊 Impact

Before:

Imbalanced training: some categories had 600 samples, others 450
Philosophy accuracy: 0% (generation truncation)
Other category accuracy: 63.16% (ambiguous definition)

After:

Balanced training: all categories have equal samples
Philosophy accuracy: ~84% (fixed truncation)
Other category accuracy: expected 68-72% (with error analysis)

🧪 Testing

Tested on Qwen3-0.6B and Qwen3-1.7B models with MMLU-Pro dataset:

# Training with balanced sampling
python ft_qwen3_generative_lora.py --mode train --model Qwen/Qwen3-1.7B --max-samples-per-category 600

# Validation with error analysis
python ft_qwen3_generative_lora.py --mode validate --model-path mom-brain-v1 --max-samples-per-category 600

📝 Related Issues

Fixes issues with:

Imbalanced category training data
Generation mode truncation bugs
Low accuracy on ambiguous categories
Insufficient debugging information

… error analysis - Add balanced sampling validation to ensure all categories have equal training samples - Fix apply_chat_template return type bug in create_generative_dataset - Increase max_new_tokens from 2 to 10 to handle multi-token category names - Add enhanced normalization with partial matches (philo->philosophy, psycho->psychology) - Add 'other' category error analysis when accuracy < 75% - Improve instruction template with clearer 'other' category definition and examples - Add debug logging for generated text and normalization process This fixes issues where: 1. Imbalanced training data caused overfitting on categories with more samples 2. Short generation length truncated category names like 'philosophy' to 'phil' 3. 'other' category had low accuracy (63%) due to ambiguous definition Signed-off-by: bitliu <[email protected]>

netlify · 2025-10-16T09:55:00Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`62c03d8`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/68f0c8cb2690ae000811070b
😎 Deploy Preview	https://deploy-preview-451--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2025-10-16T09:55:05Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `src`

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

Signed-off-by: bitliu <[email protected]>

Xunzhuo requested review from rootfs and wangchen615 as code owners October 16, 2025 09:54

github-actions bot assigned rootfs, wangchen615 and Xunzhuo Oct 16, 2025

Xunzhuo changed the title ~~fix(training): improve classifier training with balanced sampling and error analysis~~ feat(training): improve classifier training with balanced sampling and error analysis Oct 16, 2025

more

62c03d8

Signed-off-by: bitliu <[email protected]>

rootfs approved these changes Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(training): improve classifier training with balanced sampling and error analysis #451

feat(training): improve classifier training with balanced sampling and error analysis #451

Uh oh!

Xunzhuo commented Oct 16, 2025 •

edited

Loading

Uh oh!

netlify bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(training): improve classifier training with balanced sampling and error analysis #451

Are you sure you want to change the base?

feat(training): improve classifier training with balanced sampling and error analysis #451

Uh oh!

Conversation

Xunzhuo commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔧 Key Improvements

1. Balanced Sampling Validation

2. Fixed Generation Bugs

3. Enhanced Normalization

4. 'other' Category Analysis

5. Debug Logging

📊 Impact

🧪 Testing

📝 Related Issues

Uh oh!

netlify bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

github-actions bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👥 vLLM Semantic Team Notification

📁 src

🎉 Thanks for your contributions!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xunzhuo commented Oct 16, 2025 •

edited

Loading

netlify bot commented Oct 16, 2025 •

edited

Loading

github-actions bot commented Oct 16, 2025 •

edited

Loading

📁 `src`