feat: use decoder only model for mcp classification server #427

rootfs · 2025-10-14T18:27:09Z

What type of PR is this?

Use SLM (i.e. qwen3 0.6b with instruct fine tuning) for classification, replacing bert based approach, for multilingual support.

The fine tuned model is at https://huggingface.co/llm-semantic-router/qwen3_generative_classifier_r16

Demo

Start router

CONFIG_FILE=./config/config-mcp-classifier-example.yaml make run-router

Start MCP Server

cd examples/mcp-classifier-server
python server_generative.py --http --port 8090 --device cpu --model-path llm-semantic-router/qwen3_generative_classifier_r16

Test

make test-auto-prompt-reasoning

Log

Router

{"level":"info","ts":"2025-10-14T23:22:32.59534356Z","caller":"observability/logging.go:141","msg":"Starting metrics server on :9190"}
{"level":"info","ts":"2025-10-14T23:22:32.59568014Z","caller":"candle-binding/semantic-router.go:276","msg":"Initializing BERT similarity model: sentence-transformers/all-MiniLM-L6-v2"}
{"level":"info","ts":"2025-10-14T23:22:32.720935164Z","caller":"observability/logging.go:141","msg":"Category descriptions: []"}
{"level":"info","ts":"2025-10-14T23:22:32.721027709Z","caller":"observability/logging.go:141","msg":"Semantic cache is disabled"}
{"level":"info","ts":"2025-10-14T23:22:32.7210412Z","caller":"observability/logging.go:141","msg":"Tools database is disabled"}
{"level":"info","ts":"2025-10-14T23:22:32.723748191Z","caller":"observability/logging.go:141","msg":"Auto-discovered classification tool: classify_text - Classify text into categories using a fine-tuned generative model and provide intelligent routing recommendations. Categories: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology. Returns: class index, confidence, recommended model, and reasoning flag. Optionally returns full probability distribution (from softmax) for entropy analysis."}
{"level":"info","ts":"2025-10-14T23:22:32.723782703Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier with tool 'classify_text'"}
{"level":"info","ts":"2025-10-14T23:22:32.723792674Z","caller":"observability/logging.go:141","msg":"Loading category mapping from MCP server..."}
{"level":"info","ts":"2025-10-14T23:22:32.724677807Z","caller":"observability/logging.go:141","msg":"Loaded 14 categories with 14 system prompts from MCP server: [biology business chemistry computer science economics engineering health history law math other philosophy physics psychology]"}
{"level":"info","ts":"2025-10-14T23:22:32.724716539Z","caller":"observability/logging.go:141","msg":"Successfully loaded 14 categories from MCP server"}
{"level":"info","ts":"2025-10-14T23:22:32.72472763Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier"}
{"level":"info","ts":"2025-10-14T23:22:32.726932621Z","caller":"observability/logging.go:141","msg":"Initializing LoRA models: Intent=models/lora_intent_classifier_bert-base-uncased_model, PII=models/lora_pii_detector_bert-base-uncased_model, Security=models/lora_jailbreak_classifier_bert-base-uncased_model, Architecture=bert"}
Detected BERT token classifier - using BERT naming
{"level":"info","ts":"2025-10-14T23:22:33.539572004Z","caller":"observability/logging.go:141","msg":"LoRA C bindings initialized successfully"}
{"level":"info","ts":"2025-10-14T23:22:33.539650588Z","caller":"observability/logging.go:141","msg":"Category mapping will be loaded from MCP server"}
{"level":"info","ts":"2025-10-14T23:22:33.541668909Z","caller":"observability/logging.go:141","msg":"Auto-discovered classification tool: classify_text - Classify text into categories using a fine-tuned generative model and provide intelligent routing recommendations. Categories: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology. Returns: class index, confidence, recommended model, and reasoning flag. Optionally returns full probability distribution (from softmax) for entropy analysis."}
{"level":"info","ts":"2025-10-14T23:22:33.54168794Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier with tool 'classify_text'"}
{"level":"info","ts":"2025-10-14T23:22:33.54169483Z","caller":"observability/logging.go:141","msg":"Loading category mapping from MCP server..."}
{"level":"info","ts":"2025-10-14T23:22:33.542498778Z","caller":"observability/logging.go:141","msg":"Loaded 14 categories with 14 system prompts from MCP server: [biology business chemistry computer science economics engineering health history law math other philosophy physics psychology]"}
{"level":"info","ts":"2025-10-14T23:22:33.542521669Z","caller":"observability/logging.go:141","msg":"Successfully loaded 14 categories from MCP server"}
{"level":"info","ts":"2025-10-14T23:22:33.54253171Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier"}
{"level":"info","ts":"2025-10-14T23:22:33.54254157Z","caller":"observability/logging.go:141","msg":"Router initialization: Using auto-discovered unified classifier"}
{"level":"info","ts":"2025-10-14T23:22:33.542551791Z","caller":"observability/logging.go:141","msg":"No categories configured for reasoning mode"}
{"level":"info","ts":"2025-10-14T23:22:33.542569872Z","caller":"observability/logging.go:141","msg":"Starting vLLM Semantic Router ExtProc with config: ./config/config-mcp-classifier-example.yaml"}
{"level":"info","ts":"2025-10-14T23:22:33.542657957Z","caller":"observability/logging.go:141","msg":"Starting insecure LLM Router ExtProc server on port 50051..."}
{"level":"info","ts":"2025-10-14T23:22:33.542687599Z","caller":"observability/logging.go:141","msg":"Starting Classification API server on port 8080"}
{"level":"info","ts":"2025-10-14T23:22:33.543251653Z","caller":"observability/logging.go:141","msg":"Found global classification service on attempt 1/5"}
{"level":"info","ts":"2025-10-14T23:22:33.543383071Z","caller":"observability/logging.go:141","msg":"System prompt configuration endpoints enabled"}
{"level":"info","ts":"2025-10-14T23:22:33.543404522Z","caller":"observability/logging.go:141","msg":"Classification API server listening on port 8080"}
{"level":"error","ts":"2025-10-14T23:22:33.543483146Z","caller":"observability/logging.go:143","msg":"Classification API server error: listen tcp :8080: bind: address already in use","stacktrace":"github.com/vllm-project/semantic-router/src/semantic-router/pkg/observability.Errorf\n\t/home/ubuntu/rootfs/back/semantic-router.bak/src/semantic-router/pkg/observability/logging.go:143\nmain.main.func4\n\t/home/ubuntu/rootfs/back/semantic-router.bak/src/semantic-router/cmd/main.go:119"}
{"level":"info","ts":"2025-10-14T23:23:10.427195055Z","caller":"observability/logging.go:141","msg":"Started processing a new request"}
{"level":"info","ts":"2025-10-14T23:23:10.428580898Z","caller":"observability/logging.go:141","msg":"Received request headers"}
{"level":"info","ts":"2025-10-14T23:23:10.429048946Z","caller":"observability/logging.go:141","msg":"Received request body {\"model\": \"auto\", \"messages\": [{\"role\": \"system\", \"content\": \"You are a professional math teacher. Explain math concepts clearly and show step-by-step solutions to problems.\"}, {\"role\": \"user\", \"content\": \"What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?\"}]}"}
{"level":"info","ts":"2025-10-14T23:23:10.42929603Z","caller":"observability/logging.go:141","msg":"Original model: auto"}
{"level":"info","ts":"2025-10-14T23:23:10.429342203Z","caller":"observability/logging.go:141","msg":"Using Auto Model Selection"}
{"level":"info","ts":"2025-10-14T23:23:10.429360834Z","caller":"observability/logging.go:141","msg":"Routing to model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392018076Z","caller":"observability/logging.go:141","msg":"MCP classification result: class=9, confidence=0.9953, entropy_available=true"}
{"level":"info","ts":"2025-10-14T23:23:11.39208404Z","caller":"observability/logging.go:141","msg":"MCP classified as category: math (mmlu=math), reasoning_decision: use=false, confidence=0.796, reason=category_not_in_reasoning_map"}
{"level":"info","ts":"2025-10-14T23:23:11.392098731Z","caller":"observability/logging.go:141","msg":"Entropy-based reasoning decision: category='math', confidence=0.995, use_reasoning=false, reason=category_not_in_reasoning_map, strategy=unknown_category_default"}
{"level":"info","ts":"2025-10-14T23:23:11.392134253Z","caller":"observability/logging.go:141","msg":"Top predicted categories: [{math 0.9952573} {chemistry 0.0029757656} {physics 0.0010777508}]"}
{"level":"info","ts":"2025-10-14T23:23:11.392141833Z","caller":"observability/logging.go:141","msg":"Entropy-based reasoning decision for this query: false on [openai/gpt-oss-20b] model (confidence: 0.796, reason: category_not_in_reasoning_map)"}
{"level":"info","ts":"2025-10-14T23:23:11.392169765Z","caller":"observability/logging.go:141","msg":"Selected endpoint address: 127.0.0.1:8000 for model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392458232Z","caller":"observability/logging.go:141","msg":"Reasoning mode disabled for model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392499495Z","caller":"observability/logging.go:141","msg":"Use new model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392526346Z","caller":"observability/logging.go:137","msg":"routing_decision","selected_model":"openai/gpt-oss-20b","category":"math","routing_latency_ms":963,"event":"routing_decision","reason_code":"auto_routing","request_id":"75504728-3e8d-42b7-9fa4-bebb2fcc8429","original_model":"auto","reasoning_enabled":false,"reasoning_effort":"high","selected_endpoint":"127.0.0.1:8000"}
{"level":"info","ts":"2025-10-14T23:23:14.805695881Z","caller":"observability/logging.go:137","msg":"llm_usage","request_id":"75504728-3e8d-42b7-9fa4-bebb2fcc8429","prompt_tokens":121,"cost":0,"event":"llm_usage","model":"openai/gpt-oss-20b","completion_tokens":301,"total_tokens":422,"completion_latency_ms":4376,"currency":"unknown","pricing":"not_configured"}
{"level":"info","ts":"2025-10-14T23:23:14.805733533Z","caller":"observability/logging.go:141","msg":"Cache updated for request ID: 75504728-3e8d-42b7-9fa4-bebb2fcc8429"}
{"level":"info","ts":"2025-10-14T23:23:14.806250844Z","caller":"observability/logging.go:141","msg":"Stream canceled gracefully"}
^C{"level":"info","ts":"2025-10-14T23:26:04.991521146Z","caller":"observability/logging.go:141","msg":"Received shutdown signal, gracefully stopping server..."}
{"level":"info","ts":"2025-10-14T23:26:04.991573689Z","caller":"observability/logging.go:141","msg":"Received shutdown signal, cleaning up..."}

MCP Server

2025-10-14 23:22:23,894 - __main__ - INFO - Starting Generative Model-Based MCP Classification Server (HTTP mode)
2025-10-14 23:22:23,894 - __main__ - INFO - Loading generative model from: llm-semantic-router/qwen3_generative_classifier_r16
2025-10-14 23:22:23,894 - __main__ - INFO - Using device: cpu
2025-10-14 23:22:23,894 - __main__ - INFO - Detected HuggingFace model: llm-semantic-router/qwen3_generative_classifier_r16
label_mapping.json: 1.17kB [00:00, 9.54MB/s]
2025-10-14 23:22:24,046 - __main__ - INFO - Loading label mapping from: /home/ubuntu/.cache/huggingface/hub/models--llm-semantic-router--qwen3_generative_classifier_r16/snapshots/0f27c62886132c8e22ab88f5c699196083dfe716/label_mapping.json
2025-10-14 23:22:24,046 - __main__ - INFO - Loaded 14 categories: ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'other', 'philosophy', 'physics', 'psychology']
2025-10-14 23:22:24,046 - __main__ - INFO - Loading tokenizer...
tokenizer_config.json: 5.40kB [00:00, 40.5MB/s]
vocab.json: 2.78MB [00:00, 13.5MB/s]
merges.txt: 1.67MB [00:00, 11.0MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 233MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 10.8MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 613/613 [00:00<00:00, 10.1MB/s]
chat_template.jinja: 4.17kB [00:00, 24.3MB/s]
2025-10-14 23:22:25,728 - __main__ - INFO - Loading base model...
`torch_dtype` is deprecated! Use `dtype` instead!
2025-10-14 23:22:26,189 - __main__ - INFO - Loading LoRA weights...
adapter_config.json: 100%|███████████████████████████████████████████████████████████| 927/927 [00:00<00:00, 6.58MB/s]
adapter_model.safetensors: 100%|██████████████████████████████████████████████████| 40.4M/40.4M [00:00<00:00, 238MB/s]
2025-10-14 23:22:27,017 - __main__ - INFO - Prepared category tokens: 14 categories
2025-10-14 23:22:27,017 - __main__ - INFO - Model loaded successfully
2025-10-14 23:22:27,017 - __main__ - INFO - Available categories: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology
2025-10-14 23:22:27,017 - __main__ - INFO - Base model: Qwen/Qwen3-0.6B
2025-10-14 23:22:27,017 - __main__ - INFO - Model path: llm-semantic-router/qwen3_generative_classifier_r16
2025-10-14 23:22:27,018 - __main__ - INFO - Device: cpu
2025-10-14 23:22:27,018 - __main__ - INFO - Listening on http://0.0.0.0:8090/mcp
2025-10-14 23:22:27,018 - __main__ - INFO - Server is ready at http://0.0.0.0:8090/mcp
2025-10-14 23:22:27,018 - __main__ - INFO - Health check available at http://0.0.0.0:8090/health
2025-10-14 23:22:32,722 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/initialize HTTP/1.1" 200 294 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/tools/list HTTP/1.1" 200 1327 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/resources/list HTTP/1.1" 404 269 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/prompts/list HTTP/1.1" 404 266 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - __main__ - INFO - Returning 14 categories with 14 system prompts: ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'other', 'philosophy', 'physics', 'psychology']
2025-10-14 23:22:32,724 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/tools/call HTTP/1.1" 200 6578 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,540 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/initialize HTTP/1.1" 200 294 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,540 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/tools/list HTTP/1.1" 200 1327 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,541 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/resources/list HTTP/1.1" 404 269 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,541 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/prompts/list HTTP/1.1" 404 266 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,541 - __main__ - INFO - Returning 14 categories with 14 system prompts: ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'other', 'philosophy', 'physics', 'psychology']
2025-10-14 23:22:33,542 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/tools/call HTTP/1.1" 200 6578 "-" "Go-http-client/1.1"
2025-10-14 23:23:11,386 - __main__ - INFO - Classification result: class=9 (math), confidence=0.995, entropy=0.050, model=openai/gpt-oss-20b, use_reasoning=False
2025-10-14 23:23:11,391 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:23:10 +0000] "POST /mcp/tools/call HTTP/1.1" 200 711 "-" "Go-http-client/1.1"

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #374

Release Notes: Yes/No

Signed-off-by: Huamin Chen <[email protected]>

netlify · 2025-10-14T18:27:15Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`77e4584`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/68eee0f31797140007208f5e
😎 Deploy Preview	https://deploy-preview-427--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2025-10-14T18:27:22Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `Root Directory`

Owners: @rootfs, @Xunzhuo
Files changed:

examples/mcp-classifier-server/requirements_generative.txt
examples/mcp-classifier-server/server_generative.py
examples/mcp-classifier-server/README.md
examples/mcp-classifier-server/server_embedding.py

📁 `src`

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py
src/training/training_lora/classifier_model_fine_tuning_lora/ft_linear_lora.py
src/training/training_lora/common_lora_utils.py
src/training/training_lora/pii_model_fine_tuning_lora/pii_bert_finetuning_lora.py
src/training/training_lora/prompt_guard_fine_tuning_lora/jailbreak_bert_finetuning_lora.py

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

Copilot

Pull Request Overview

This PR implements a new approach to classification using a Small Language Model (SLM) - specifically Qwen3 0.6B with instruction fine-tuning - to replace the existing BERT-based approach for multilingual support. The change introduces generative classification where the model generates category labels as text rather than using a classification head.

Adds comprehensive GPU management utilities for multi-GPU environments
Implements a new generative classification approach using Qwen3 with LoRA fine-tuning
Updates existing classification code to support GPU selection functionality

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
src/training/training_lora/common_lora_utils.py	Adds GPU management utilities and deprecates existing device info function
src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py	New implementation of generative classification using Qwen3 model
src/training/training_lora/classifier_model_fine_tuning_lora/ft_linear_lora.py	Updates to support GPU selection functionality

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T18:29:13Z

src/training/training_lora/common_lora_utils.py

+            return "cpu", -1
+
+        torch.cuda.set_device(best_gpu_id)
+        os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)


Setting CUDA_VISIBLE_DEVICES after CUDA initialization may not take effect. This environment variable should be set before importing torch or calling any CUDA functions. Consider moving this to the beginning of the process or using torch.cuda.set_device() only.

Suggested change

os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)

Copilot · 2025-10-14T18:29:14Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+REQUIRED_CATEGORIES = [
+    "biology",
+    "business",
+    "chemistry",
+    "computer science",
+    "economics",
+    "engineering",
+    "health",
+    "history",
+    "law",
+    "math",
+    "other",
+    "philosophy",
+    "physics",
+    "psychology",
+]


[nitpick] Consider moving REQUIRED_CATEGORIES to a configuration file or constant module to avoid duplication across files and make it easier to maintain category changes centrally.

Copilot · 2025-10-14T18:29:14Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+        fp16=False,  # Disable fp16 to avoid gradient issues
+        gradient_checkpointing=False,  # Disable to avoid gradient issues


The comments mention avoiding 'gradient issues' but don't specify what issues. Consider documenting the specific problems encountered (e.g., NaN gradients, instability) to help future developers understand the reasoning.

Suggested change

fp16=False, # Disable fp16 to avoid gradient issues

gradient_checkpointing=False, # Disable to avoid gradient issues

fp16=False, # Disable fp16 to avoid gradient issues (NaN gradients and training instability observed with fp16 enabled)

gradient_checkpointing=False, # Disable to avoid gradient issues (enabling caused NaN gradients and unstable loss during training)

Copilot · 2025-10-14T18:29:14Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+        per_device_train_batch_size=1,  # Minimal batch size to fit in memory
+        per_device_eval_batch_size=1,
+        gradient_accumulation_steps=16,  # Effective batch size = 1 * 16 = 16


[nitpick] Using batch size of 1 with gradient accumulation may be inefficient compared to using a larger batch size directly if memory allows. Consider making batch size configurable or dynamically determining optimal batch size based on available GPU memory.

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T18:35:39Z

src/training/training_lora/common_lora_utils.py

+        torch.cuda.set_device(best_gpu_id)
+        os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)
+
+        return f"cuda:{best_gpu_id}", best_gpu_id


Setting CUDA_VISIBLE_DEVICES after torch.cuda.set_device() may not have the expected effect since PyTorch has already initialized CUDA. Consider setting CUDA_VISIBLE_DEVICES before any CUDA operations or use torch.cuda.set_device() exclusively.

Suggested change

torch.cuda.set_device(best_gpu_id)

os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)

return f"cuda:{best_gpu_id}", best_gpu_id

# Set CUDA_VISIBLE_DEVICES before any CUDA operations

os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)

torch.cuda.set_device(0)

logger.info(f"Auto-selected GPU {best_gpu_id} (now visible as cuda:0): {torch.cuda.get_device_name(0)}")

return "cuda:0", best_gpu_id

Copilot · 2025-10-14T18:35:40Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+    if gpu_id is not None:
+        os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
+        logger.info(f"Set CUDA_VISIBLE_DEVICES={gpu_id}")


Setting CUDA_VISIBLE_DEVICES after importing torch may not work as expected. The environment variable should be set before any CUDA initialization. Consider moving this logic to the top of the script or using torch.cuda.set_device() instead.

Copilot · 2025-10-14T18:35:40Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+    if args.gpu_id is not None:
+        os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id)
+        print(f"INFO: Set CUDA_VISIBLE_DEVICES={args.gpu_id}")


Setting CUDA_VISIBLE_DEVICES after argument parsing may be too late if torch has already been imported and initialized CUDA. Consider setting this at the very beginning of the script or using the set_gpu_device() utility function instead.

Copilot · 2025-10-14T18:35:40Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+        per_device_train_batch_size=1,  # Minimal batch size to fit in memory
+        per_device_eval_batch_size=1,
+        gradient_accumulation_steps=16,  # Effective batch size = 1 * 16 = 16


Using batch size of 1 with gradient accumulation of 16 may be inefficient for GPU utilization. Consider testing with larger batch sizes (e.g., 2 or 4) and correspondingly smaller gradient accumulation steps to better utilize GPU memory and compute.

Suggested change

per_device_train_batch_size=1, # Minimal batch size to fit in memory

per_device_eval_batch_size=1,

gradient_accumulation_steps=16, # Effective batch size = 1 * 16 = 16

per_device_train_batch_size=4, # Increased batch size for better GPU utilization

per_device_eval_batch_size=4,

gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16

Copilot · 2025-10-14T18:35:41Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+    # Move to GPU
+    device = "cuda" if torch.cuda.is_available() else "cpu"


The GPU device selection logic is inconsistent with the earlier gpu_id handling. Consider using the set_gpu_device() utility function consistently throughout the code instead of manually setting device strings.

Suggested change

# Move to GPU

device = "cuda" if torch.cuda.is_available() else "cpu"

# Move to GPU using set_gpu_device utility

device = set_gpu_device(gpu_id)

rootfs · 2025-10-14T19:44:59Z

@OneZero-Y PTAL thanks

Signed-off-by: Huamin Chen <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py:1

Setting the CUDA device inside a loop can cause issues in multi-threaded environments or when using multiple processes. Consider setting the device once outside the loop or using device-specific memory queries without changing the global device state.

"""

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T19:59:22Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+        per_device_train_batch_size=4,  # Increased batch size for better GPU utilization
+        per_device_eval_batch_size=4,
+        gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16


The comment states 'Increased batch size' but the batch size of 4 appears to be quite small for modern GPU training. Consider making these values configurable parameters rather than hardcoded values to allow for different GPU memory configurations.

Copilot · 2025-10-14T19:59:22Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+        # Load base model
+        base_model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,


Using float16 based only on CUDA availability may not be optimal for all GPU types. Some older GPUs don't support efficient float16 operations. Consider checking for specific GPU capabilities or making the dtype configurable.

Signed-off-by: Huamin Chen <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T20:11:01Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+                use_fp16 = (
+                    compute_capability[0] >= 7
+                )  # Volta and newer support efficient FP16
+            except:


Using bare 'except:' is discouraged as it catches all exceptions including system exits. Use 'except Exception:' to catch only actual exceptions.

Suggested change

except:

except Exception:

Copilot · 2025-10-14T20:11:01Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from common_lora_utils import (


Using sys.path.append with relative path manipulation is fragile and error-prone. Consider using proper Python packaging with relative imports or setuptools entry points.

Suggested change

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from common_lora_utils import (

from ..common_lora_utils import (

Copilot · 2025-10-14T20:11:02Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+        gradient_accumulation_steps=16
+        // batch_size,  # Maintain effective batch size of 16


Integer division could result in 0 gradient accumulation steps if batch_size > 16, which would cause training to fail. Add a minimum value check: max(1, 16 // batch_size).

Suggested change

gradient_accumulation_steps=16

// batch_size, # Maintain effective batch size of 16

gradient_accumulation_steps=max(1, 16 // batch_size), # Maintain effective batch size of 16

Signed-off-by: Huamin Chen <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T23:36:37Z

src/training/training_lora/common_lora_utils.py

    """
    Get device information and capabilities.

+    DEPRECATED: Use set_gpu_device() and get_all_gpu_info() for better multi-GPU support.


Consider adding a deprecation warning in the function implementation to alert users during runtime, not just in the docstring.

Copilot · 2025-10-14T23:36:37Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+    if len(predictions.shape) == 3:
+        pred_tokens = np.argmax(predictions, axis=-1)
+    else:
+        pred_tokens = predictions


Variable 'predictions' is used without checking if it exists in the else clause. This could cause a NameError if the condition len(predictions.shape) == 3 is False and predictions was not defined earlier.

Copilot · 2025-10-14T23:36:38Z

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py

+        lr_scheduler_type="cosine",
+        fp16=False,  # Disable fp16 to avoid gradient issues
+        gradient_checkpointing=False,  # Disable to avoid gradient issues
+        dataloader_num_workers=0,


Setting dataloader_num_workers=0 disables multiprocessing for data loading, which may reduce training performance. Consider allowing this to be configurable or using a small positive value like 2-4.

Copilot · 2025-10-14T23:36:38Z

examples/mcp-classifier-server/server_generative.py

+Q: {question}
+A:"""
+
+    def classify(self, text: str, with_probabilities: bool = False) -> dict[str, Any]:


The return type annotation uses dict[str, Any] which is less type-safe. Consider defining a TypedDict or dataclass for the return type to improve type safety and documentation.

Copilot · 2025-10-14T23:36:38Z

examples/mcp-classifier-server/server_generative.py

+
+        return result
+
+    def _calculate_entropy(self, probabilities: list[float]) -> float:


The function parameter uses list[float] but could be more flexible by accepting any sequence type. Consider using Sequence[float] from typing module for better type compatibility.

Signed-off-by: Huamin Chen <[email protected]>

Xunzhuo

great!! a big step towards more possibilities

feat: use decoder only model for classification

4cbbe87

Signed-off-by: Huamin Chen <[email protected]>

rootfs requested review from Xunzhuo and wangchen615 as code owners October 14, 2025 18:27

github-actions bot assigned rootfs, wangchen615 and Xunzhuo Oct 14, 2025

rootfs marked this pull request as draft October 14, 2025 18:28

rootfs requested a review from Copilot October 14, 2025 18:28

Copilot AI reviewed Oct 14, 2025

View reviewed changes

rootfs requested a review from Copilot October 14, 2025 18:34

Copilot AI reviewed Oct 14, 2025

View reviewed changes

rootfs mentioned this pull request Oct 14, 2025

[External CI Broken] Netlify buildbot broken #429

Closed

Merge branch 'main' into lora-qwen3

a110f51

review feedback

6563eaf

Signed-off-by: Huamin Chen <[email protected]>

rootfs requested a review from Copilot October 14, 2025 19:58

Copilot AI reviewed Oct 14, 2025

View reviewed changes

review feedback

500e541

Signed-off-by: Huamin Chen <[email protected]>

rootfs requested a review from Copilot October 14, 2025 20:10

Copilot AI reviewed Oct 14, 2025

View reviewed changes

rootfs marked this pull request as ready for review October 14, 2025 20:32

rootfs changed the title ~~[WIP] feat: use decoder only model for classification~~ feat: use decoder only model for classification Oct 14, 2025

rootfs added 3 commits October 14, 2025 20:35

review feedback

effbaa3

Signed-off-by: Huamin Chen <[email protected]>

review feedback

06f717a

Signed-off-by: Huamin Chen <[email protected]>

add the decoder based classification mcp server

0c9002f

Signed-off-by: Huamin Chen <[email protected]>

rootfs changed the title ~~feat: use decoder only model for classification~~ feat: use decoder only model for mcp classification server Oct 14, 2025

fix json path

a39f619

Signed-off-by: Huamin Chen <[email protected]>

rootfs requested a review from Copilot October 14, 2025 23:35

Copilot AI reviewed Oct 14, 2025

View reviewed changes

review feedback

77e4584

Signed-off-by: Huamin Chen <[email protected]>

Xunzhuo approved these changes Oct 15, 2025

View reviewed changes

Xunzhuo merged commit 644541d into vllm-project:main Oct 15, 2025
16 checks passed

		fp16=False, # Disable fp16 to avoid gradient issues
		gradient_checkpointing=False, # Disable to avoid gradient issues

		# Move to GPU
		device = "cuda" if torch.cuda.is_available() else "cpu"

		sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
		from common_lora_utils import (

	sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
	from common_lora_utils import (
	from ..common_lora_utils import (

		gradient_accumulation_steps=16
		// batch_size, # Maintain effective batch size of 16

	gradient_accumulation_steps=16
	// batch_size, # Maintain effective batch size of 16
	gradient_accumulation_steps=max(1, 16 // batch_size), # Maintain effective batch size of 16


		return result

		def _calculate_entropy(self, probabilities: list[float]) -> float:

feat: use decoder only model for mcp classification server #427

feat: use decoder only model for mcp classification server #427

Uh oh!

Conversation

rootfs commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Demo

Start router

Start MCP Server

Test

Log

Router

MCP Server

Uh oh!

netlify bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

github-actions bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👥 vLLM Semantic Team Notification

📁 Root Directory

📁 src

🎉 Thanks for your contributions!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

rootfs commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

rootfs commented Oct 14, 2025 •

edited

Loading

netlify bot commented Oct 14, 2025 •

edited

Loading

github-actions bot commented Oct 14, 2025 •

edited

Loading

📁 `Root Directory`

📁 `src`