Skip to content

Conversation

Xunzhuo
Copy link
Member

@Xunzhuo Xunzhuo commented Oct 16, 2025

What type of PR is this?

fix(training): improve classifier training with balanced sampling and error analysis

What this PR does / why we need it:

This PR addresses several critical issues in the Qwen3 generative classifier training pipeline:

🔧 Key Improvements

1. Balanced Sampling Validation

  • Ensures all categories have exactly the same number of training samples
  • Automatically adjusts to the minimum available samples across categories
  • Prevents overfitting on categories with more data
  • Adds clear logging of sample distribution

2. Fixed Generation Bugs

  • Fixed apply_chat_template return type error (list vs dict)
  • Increased max_new_tokens from 2 to 10 to handle multi-token category names
  • Prevents truncation of categories like "philosophy" → "phil"

3. Enhanced Normalization

  • Added partial matches: "philo" → "philosophy", "psycho" → "psychology"
  • Improved three-tier matching: exact → two-word → first-word
  • Better handling of incomplete generations

4. 'other' Category Analysis

  • Automatic error analysis when accuracy < 75%
  • Shows misclassification patterns and top confused categories
  • Displays example errors for debugging
  • Improved instruction template with clearer definition and examples

5. Debug Logging

  • Logs generated text and extracted categories
  • Shows normalization process for first/last samples
  • Better visibility into model behavior

📊 Impact

Before:

  • Imbalanced training: some categories had 600 samples, others 450
  • Philosophy accuracy: 0% (generation truncation)
  • Other category accuracy: 63.16% (ambiguous definition)

After:

  • Balanced training: all categories have equal samples
  • Philosophy accuracy: ~84% (fixed truncation)
  • Other category accuracy: expected 68-72% (with error analysis)

🧪 Testing

Tested on Qwen3-0.6B and Qwen3-1.7B models with MMLU-Pro dataset:

# Training with balanced sampling
python ft_qwen3_generative_lora.py --mode train --model Qwen/Qwen3-1.7B --max-samples-per-category 600

# Validation with error analysis
python ft_qwen3_generative_lora.py --mode validate --model-path mom-brain-v1 --max-samples-per-category 600

📝 Related Issues

Fixes issues with:

  • Imbalanced category training data
  • Generation mode truncation bugs
  • Low accuracy on ambiguous categories
  • Insufficient debugging information

… error analysis

- Add balanced sampling validation to ensure all categories have equal training samples
- Fix apply_chat_template return type bug in create_generative_dataset
- Increase max_new_tokens from 2 to 10 to handle multi-token category names
- Add enhanced normalization with partial matches (philo->philosophy, psycho->psychology)
- Add 'other' category error analysis when accuracy < 75%
- Improve instruction template with clearer 'other' category definition and examples
- Add debug logging for generated text and normalization process

This fixes issues where:
1. Imbalanced training data caused overfitting on categories with more samples
2. Short generation length truncated category names like 'philosophy' to 'phil'
3. 'other' category had low accuracy (63%) due to ambiguous definition

Signed-off-by: bitliu <[email protected]>
Copy link

netlify bot commented Oct 16, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 62c03d8
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68f0c8cb2690ae000811070b
😎 Deploy Preview https://deploy-preview-451--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

github-actions bot commented Oct 16, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@Xunzhuo Xunzhuo changed the title fix(training): improve classifier training with balanced sampling and error analysis feat(training): improve classifier training with balanced sampling and error analysis Oct 16, 2025
Signed-off-by: bitliu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants