-
Notifications
You must be signed in to change notification settings - Fork 238
feat: use decoder only model for mcp classification server #427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Huamin Chen <[email protected]>
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a new approach to classification using a Small Language Model (SLM) - specifically Qwen3 0.6B with instruction fine-tuning - to replace the existing BERT-based approach for multilingual support. The change introduces generative classification where the model generates category labels as text rather than using a classification head.
- Adds comprehensive GPU management utilities for multi-GPU environments
- Implements a new generative classification approach using Qwen3 with LoRA fine-tuning
- Updates existing classification code to support GPU selection functionality
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
src/training/training_lora/common_lora_utils.py | Adds GPU management utilities and deprecates existing device info function |
src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py | New implementation of generative classification using Qwen3 model |
src/training/training_lora/classifier_model_fine_tuning_lora/ft_linear_lora.py | Updates to support GPU selection functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
return "cpu", -1 | ||
|
||
torch.cuda.set_device(best_gpu_id) | ||
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id) |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting CUDA_VISIBLE_DEVICES after CUDA initialization may not take effect. This environment variable should be set before importing torch or calling any CUDA functions. Consider moving this to the beginning of the process or using torch.cuda.set_device() only.
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id) |
Copilot uses AI. Check for mistakes.
REQUIRED_CATEGORIES = [ | ||
"biology", | ||
"business", | ||
"chemistry", | ||
"computer science", | ||
"economics", | ||
"engineering", | ||
"health", | ||
"history", | ||
"law", | ||
"math", | ||
"other", | ||
"philosophy", | ||
"physics", | ||
"psychology", | ||
] |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider moving REQUIRED_CATEGORIES to a configuration file or constant module to avoid duplication across files and make it easier to maintain category changes centrally.
Copilot uses AI. Check for mistakes.
fp16=False, # Disable fp16 to avoid gradient issues | ||
gradient_checkpointing=False, # Disable to avoid gradient issues |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments mention avoiding 'gradient issues' but don't specify what issues. Consider documenting the specific problems encountered (e.g., NaN gradients, instability) to help future developers understand the reasoning.
fp16=False, # Disable fp16 to avoid gradient issues | |
gradient_checkpointing=False, # Disable to avoid gradient issues | |
fp16=False, # Disable fp16 to avoid gradient issues (NaN gradients and training instability observed with fp16 enabled) | |
gradient_checkpointing=False, # Disable to avoid gradient issues (enabling caused NaN gradients and unstable loss during training) |
Copilot uses AI. Check for mistakes.
per_device_train_batch_size=1, # Minimal batch size to fit in memory | ||
per_device_eval_batch_size=1, | ||
gradient_accumulation_steps=16, # Effective batch size = 1 * 16 = 16 |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using batch size of 1 with gradient accumulation may be inefficient compared to using a larger batch size directly if memory allows. Consider making batch size configurable or dynamically determining optimal batch size based on available GPU memory.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
torch.cuda.set_device(best_gpu_id) | ||
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id) | ||
|
||
return f"cuda:{best_gpu_id}", best_gpu_id |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting CUDA_VISIBLE_DEVICES after torch.cuda.set_device() may not have the expected effect since PyTorch has already initialized CUDA. Consider setting CUDA_VISIBLE_DEVICES before any CUDA operations or use torch.cuda.set_device() exclusively.
torch.cuda.set_device(best_gpu_id) | |
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id) | |
return f"cuda:{best_gpu_id}", best_gpu_id | |
# Set CUDA_VISIBLE_DEVICES before any CUDA operations | |
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id) | |
torch.cuda.set_device(0) | |
logger.info(f"Auto-selected GPU {best_gpu_id} (now visible as cuda:0): {torch.cuda.get_device_name(0)}") | |
return "cuda:0", best_gpu_id |
Copilot uses AI. Check for mistakes.
if gpu_id is not None: | ||
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id) | ||
logger.info(f"Set CUDA_VISIBLE_DEVICES={gpu_id}") |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting CUDA_VISIBLE_DEVICES after importing torch may not work as expected. The environment variable should be set before any CUDA initialization. Consider moving this logic to the top of the script or using torch.cuda.set_device() instead.
Copilot uses AI. Check for mistakes.
if args.gpu_id is not None: | ||
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id) | ||
print(f"INFO: Set CUDA_VISIBLE_DEVICES={args.gpu_id}") |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting CUDA_VISIBLE_DEVICES after argument parsing may be too late if torch has already been imported and initialized CUDA. Consider setting this at the very beginning of the script or using the set_gpu_device() utility function instead.
Copilot uses AI. Check for mistakes.
per_device_train_batch_size=1, # Minimal batch size to fit in memory | ||
per_device_eval_batch_size=1, | ||
gradient_accumulation_steps=16, # Effective batch size = 1 * 16 = 16 |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using batch size of 1 with gradient accumulation of 16 may be inefficient for GPU utilization. Consider testing with larger batch sizes (e.g., 2 or 4) and correspondingly smaller gradient accumulation steps to better utilize GPU memory and compute.
per_device_train_batch_size=1, # Minimal batch size to fit in memory | |
per_device_eval_batch_size=1, | |
gradient_accumulation_steps=16, # Effective batch size = 1 * 16 = 16 | |
per_device_train_batch_size=4, # Increased batch size for better GPU utilization | |
per_device_eval_batch_size=4, | |
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16 |
Copilot uses AI. Check for mistakes.
# Move to GPU | ||
device = "cuda" if torch.cuda.is_available() else "cpu" |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU device selection logic is inconsistent with the earlier gpu_id handling. Consider using the set_gpu_device() utility function consistently throughout the code instead of manually setting device strings.
# Move to GPU | |
device = "cuda" if torch.cuda.is_available() else "cpu" | |
# Move to GPU using set_gpu_device utility | |
device = set_gpu_device(gpu_id) |
Copilot uses AI. Check for mistakes.
@OneZero-Y PTAL thanks |
Signed-off-by: Huamin Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py:1
- Setting the CUDA device inside a loop can cause issues in multi-threaded environments or when using multiple processes. Consider setting the device once outside the loop or using device-specific memory queries without changing the global device state.
"""
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
per_device_train_batch_size=4, # Increased batch size for better GPU utilization | ||
per_device_eval_batch_size=4, | ||
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16 |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states 'Increased batch size' but the batch size of 4 appears to be quite small for modern GPU training. Consider making these values configurable parameters rather than hardcoded values to allow for different GPU memory configurations.
Copilot uses AI. Check for mistakes.
# Load base model | ||
base_model = AutoModelForCausalLM.from_pretrained( | ||
model_name, | ||
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using float16 based only on CUDA availability may not be optimal for all GPU types. Some older GPUs don't support efficient float16 operations. Consider checking for specific GPU capabilities or making the dtype configurable.
Copilot uses AI. Check for mistakes.
Signed-off-by: Huamin Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
use_fp16 = ( | ||
compute_capability[0] >= 7 | ||
) # Volta and newer support efficient FP16 | ||
except: |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bare 'except:' is discouraged as it catches all exceptions including system exits. Use 'except Exception:' to catch only actual exceptions.
except: | |
except Exception: |
Copilot uses AI. Check for mistakes.
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) | ||
from common_lora_utils import ( |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sys.path.append with relative path manipulation is fragile and error-prone. Consider using proper Python packaging with relative imports or setuptools entry points.
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) | |
from common_lora_utils import ( | |
from ..common_lora_utils import ( |
Copilot uses AI. Check for mistakes.
gradient_accumulation_steps=16 | ||
// batch_size, # Maintain effective batch size of 16 |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer division could result in 0 gradient accumulation steps if batch_size > 16, which would cause training to fail. Add a minimum value check: max(1, 16 // batch_size).
gradient_accumulation_steps=16 | |
// batch_size, # Maintain effective batch size of 16 | |
gradient_accumulation_steps=max(1, 16 // batch_size), # Maintain effective batch size of 16 |
Copilot uses AI. Check for mistakes.
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
""" | ||
Get device information and capabilities. | ||
DEPRECATED: Use set_gpu_device() and get_all_gpu_info() for better multi-GPU support. |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a deprecation warning in the function implementation to alert users during runtime, not just in the docstring.
Copilot uses AI. Check for mistakes.
if len(predictions.shape) == 3: | ||
pred_tokens = np.argmax(predictions, axis=-1) | ||
else: | ||
pred_tokens = predictions |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable 'predictions' is used without checking if it exists in the else clause. This could cause a NameError if the condition len(predictions.shape) == 3 is False and predictions was not defined earlier.
Copilot uses AI. Check for mistakes.
lr_scheduler_type="cosine", | ||
fp16=False, # Disable fp16 to avoid gradient issues | ||
gradient_checkpointing=False, # Disable to avoid gradient issues | ||
dataloader_num_workers=0, |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting dataloader_num_workers=0 disables multiprocessing for data loading, which may reduce training performance. Consider allowing this to be configurable or using a small positive value like 2-4.
Copilot uses AI. Check for mistakes.
Q: {question} | ||
A:""" | ||
|
||
def classify(self, text: str, with_probabilities: bool = False) -> dict[str, Any]: |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation uses dict[str, Any]
which is less type-safe. Consider defining a TypedDict or dataclass for the return type to improve type safety and documentation.
Copilot uses AI. Check for mistakes.
|
||
return result | ||
|
||
def _calculate_entropy(self, probabilities: list[float]) -> float: |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function parameter uses list[float]
but could be more flexible by accepting any sequence type. Consider using Sequence[float]
from typing module for better type compatibility.
Copilot uses AI. Check for mistakes.
Signed-off-by: Huamin Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!! a big step towards more possibilities
What type of PR is this?
Use SLM (i.e. qwen3 0.6b with instruct fine tuning) for classification, replacing bert based approach, for multilingual support.
The fine tuned model is at https://huggingface.co/llm-semantic-router/qwen3_generative_classifier_r16
Demo
Start router
Start MCP Server
cd examples/mcp-classifier-server python server_generative.py --http --port 8090 --device cpu --model-path llm-semantic-router/qwen3_generative_classifier_r16
Test
Log
Router
MCP Server
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #374
Release Notes: Yes/No