Skip to content

Conversation

rootfs
Copy link
Collaborator

@rootfs rootfs commented Oct 14, 2025

What type of PR is this?

Use SLM (i.e. qwen3 0.6b with instruct fine tuning) for classification, replacing bert based approach, for multilingual support.

The fine tuned model is at https://huggingface.co/llm-semantic-router/qwen3_generative_classifier_r16

Demo

Start router

CONFIG_FILE=./config/config-mcp-classifier-example.yaml make run-router

Start MCP Server

cd examples/mcp-classifier-server
python server_generative.py --http --port 8090 --device cpu --model-path llm-semantic-router/qwen3_generative_classifier_r16

Test

make test-auto-prompt-reasoning

Log

Router

{"level":"info","ts":"2025-10-14T23:22:32.59534356Z","caller":"observability/logging.go:141","msg":"Starting metrics server on :9190"}
{"level":"info","ts":"2025-10-14T23:22:32.59568014Z","caller":"candle-binding/semantic-router.go:276","msg":"Initializing BERT similarity model: sentence-transformers/all-MiniLM-L6-v2"}
{"level":"info","ts":"2025-10-14T23:22:32.720935164Z","caller":"observability/logging.go:141","msg":"Category descriptions: []"}
{"level":"info","ts":"2025-10-14T23:22:32.721027709Z","caller":"observability/logging.go:141","msg":"Semantic cache is disabled"}
{"level":"info","ts":"2025-10-14T23:22:32.7210412Z","caller":"observability/logging.go:141","msg":"Tools database is disabled"}
{"level":"info","ts":"2025-10-14T23:22:32.723748191Z","caller":"observability/logging.go:141","msg":"Auto-discovered classification tool: classify_text - Classify text into categories using a fine-tuned generative model and provide intelligent routing recommendations. Categories: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology. Returns: class index, confidence, recommended model, and reasoning flag. Optionally returns full probability distribution (from softmax) for entropy analysis."}
{"level":"info","ts":"2025-10-14T23:22:32.723782703Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier with tool 'classify_text'"}
{"level":"info","ts":"2025-10-14T23:22:32.723792674Z","caller":"observability/logging.go:141","msg":"Loading category mapping from MCP server..."}
{"level":"info","ts":"2025-10-14T23:22:32.724677807Z","caller":"observability/logging.go:141","msg":"Loaded 14 categories with 14 system prompts from MCP server: [biology business chemistry computer science economics engineering health history law math other philosophy physics psychology]"}
{"level":"info","ts":"2025-10-14T23:22:32.724716539Z","caller":"observability/logging.go:141","msg":"Successfully loaded 14 categories from MCP server"}
{"level":"info","ts":"2025-10-14T23:22:32.72472763Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier"}
{"level":"info","ts":"2025-10-14T23:22:32.726932621Z","caller":"observability/logging.go:141","msg":"Initializing LoRA models: Intent=models/lora_intent_classifier_bert-base-uncased_model, PII=models/lora_pii_detector_bert-base-uncased_model, Security=models/lora_jailbreak_classifier_bert-base-uncased_model, Architecture=bert"}
Detected BERT token classifier - using BERT naming
{"level":"info","ts":"2025-10-14T23:22:33.539572004Z","caller":"observability/logging.go:141","msg":"LoRA C bindings initialized successfully"}
{"level":"info","ts":"2025-10-14T23:22:33.539650588Z","caller":"observability/logging.go:141","msg":"Category mapping will be loaded from MCP server"}
{"level":"info","ts":"2025-10-14T23:22:33.541668909Z","caller":"observability/logging.go:141","msg":"Auto-discovered classification tool: classify_text - Classify text into categories using a fine-tuned generative model and provide intelligent routing recommendations. Categories: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology. Returns: class index, confidence, recommended model, and reasoning flag. Optionally returns full probability distribution (from softmax) for entropy analysis."}
{"level":"info","ts":"2025-10-14T23:22:33.54168794Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier with tool 'classify_text'"}
{"level":"info","ts":"2025-10-14T23:22:33.54169483Z","caller":"observability/logging.go:141","msg":"Loading category mapping from MCP server..."}
{"level":"info","ts":"2025-10-14T23:22:33.542498778Z","caller":"observability/logging.go:141","msg":"Loaded 14 categories with 14 system prompts from MCP server: [biology business chemistry computer science economics engineering health history law math other philosophy physics psychology]"}
{"level":"info","ts":"2025-10-14T23:22:33.542521669Z","caller":"observability/logging.go:141","msg":"Successfully loaded 14 categories from MCP server"}
{"level":"info","ts":"2025-10-14T23:22:33.54253171Z","caller":"observability/logging.go:141","msg":"Successfully initialized MCP category classifier"}
{"level":"info","ts":"2025-10-14T23:22:33.54254157Z","caller":"observability/logging.go:141","msg":"Router initialization: Using auto-discovered unified classifier"}
{"level":"info","ts":"2025-10-14T23:22:33.542551791Z","caller":"observability/logging.go:141","msg":"No categories configured for reasoning mode"}
{"level":"info","ts":"2025-10-14T23:22:33.542569872Z","caller":"observability/logging.go:141","msg":"Starting vLLM Semantic Router ExtProc with config: ./config/config-mcp-classifier-example.yaml"}
{"level":"info","ts":"2025-10-14T23:22:33.542657957Z","caller":"observability/logging.go:141","msg":"Starting insecure LLM Router ExtProc server on port 50051..."}
{"level":"info","ts":"2025-10-14T23:22:33.542687599Z","caller":"observability/logging.go:141","msg":"Starting Classification API server on port 8080"}
{"level":"info","ts":"2025-10-14T23:22:33.543251653Z","caller":"observability/logging.go:141","msg":"Found global classification service on attempt 1/5"}
{"level":"info","ts":"2025-10-14T23:22:33.543383071Z","caller":"observability/logging.go:141","msg":"System prompt configuration endpoints enabled"}
{"level":"info","ts":"2025-10-14T23:22:33.543404522Z","caller":"observability/logging.go:141","msg":"Classification API server listening on port 8080"}
{"level":"error","ts":"2025-10-14T23:22:33.543483146Z","caller":"observability/logging.go:143","msg":"Classification API server error: listen tcp :8080: bind: address already in use","stacktrace":"github.com/vllm-project/semantic-router/src/semantic-router/pkg/observability.Errorf\n\t/home/ubuntu/rootfs/back/semantic-router.bak/src/semantic-router/pkg/observability/logging.go:143\nmain.main.func4\n\t/home/ubuntu/rootfs/back/semantic-router.bak/src/semantic-router/cmd/main.go:119"}
{"level":"info","ts":"2025-10-14T23:23:10.427195055Z","caller":"observability/logging.go:141","msg":"Started processing a new request"}
{"level":"info","ts":"2025-10-14T23:23:10.428580898Z","caller":"observability/logging.go:141","msg":"Received request headers"}
{"level":"info","ts":"2025-10-14T23:23:10.429048946Z","caller":"observability/logging.go:141","msg":"Received request body {\"model\": \"auto\", \"messages\": [{\"role\": \"system\", \"content\": \"You are a professional math teacher. Explain math concepts clearly and show step-by-step solutions to problems.\"}, {\"role\": \"user\", \"content\": \"What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?\"}]}"}
{"level":"info","ts":"2025-10-14T23:23:10.42929603Z","caller":"observability/logging.go:141","msg":"Original model: auto"}
{"level":"info","ts":"2025-10-14T23:23:10.429342203Z","caller":"observability/logging.go:141","msg":"Using Auto Model Selection"}
{"level":"info","ts":"2025-10-14T23:23:10.429360834Z","caller":"observability/logging.go:141","msg":"Routing to model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392018076Z","caller":"observability/logging.go:141","msg":"MCP classification result: class=9, confidence=0.9953, entropy_available=true"}
{"level":"info","ts":"2025-10-14T23:23:11.39208404Z","caller":"observability/logging.go:141","msg":"MCP classified as category: math (mmlu=math), reasoning_decision: use=false, confidence=0.796, reason=category_not_in_reasoning_map"}
{"level":"info","ts":"2025-10-14T23:23:11.392098731Z","caller":"observability/logging.go:141","msg":"Entropy-based reasoning decision: category='math', confidence=0.995, use_reasoning=false, reason=category_not_in_reasoning_map, strategy=unknown_category_default"}
{"level":"info","ts":"2025-10-14T23:23:11.392134253Z","caller":"observability/logging.go:141","msg":"Top predicted categories: [{math 0.9952573} {chemistry 0.0029757656} {physics 0.0010777508}]"}
{"level":"info","ts":"2025-10-14T23:23:11.392141833Z","caller":"observability/logging.go:141","msg":"Entropy-based reasoning decision for this query: false on [openai/gpt-oss-20b] model (confidence: 0.796, reason: category_not_in_reasoning_map)"}
{"level":"info","ts":"2025-10-14T23:23:11.392169765Z","caller":"observability/logging.go:141","msg":"Selected endpoint address: 127.0.0.1:8000 for model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392458232Z","caller":"observability/logging.go:141","msg":"Reasoning mode disabled for model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392499495Z","caller":"observability/logging.go:141","msg":"Use new model: openai/gpt-oss-20b"}
{"level":"info","ts":"2025-10-14T23:23:11.392526346Z","caller":"observability/logging.go:137","msg":"routing_decision","selected_model":"openai/gpt-oss-20b","category":"math","routing_latency_ms":963,"event":"routing_decision","reason_code":"auto_routing","request_id":"75504728-3e8d-42b7-9fa4-bebb2fcc8429","original_model":"auto","reasoning_enabled":false,"reasoning_effort":"high","selected_endpoint":"127.0.0.1:8000"}
{"level":"info","ts":"2025-10-14T23:23:14.805695881Z","caller":"observability/logging.go:137","msg":"llm_usage","request_id":"75504728-3e8d-42b7-9fa4-bebb2fcc8429","prompt_tokens":121,"cost":0,"event":"llm_usage","model":"openai/gpt-oss-20b","completion_tokens":301,"total_tokens":422,"completion_latency_ms":4376,"currency":"unknown","pricing":"not_configured"}
{"level":"info","ts":"2025-10-14T23:23:14.805733533Z","caller":"observability/logging.go:141","msg":"Cache updated for request ID: 75504728-3e8d-42b7-9fa4-bebb2fcc8429"}
{"level":"info","ts":"2025-10-14T23:23:14.806250844Z","caller":"observability/logging.go:141","msg":"Stream canceled gracefully"}
^C{"level":"info","ts":"2025-10-14T23:26:04.991521146Z","caller":"observability/logging.go:141","msg":"Received shutdown signal, gracefully stopping server..."}
{"level":"info","ts":"2025-10-14T23:26:04.991573689Z","caller":"observability/logging.go:141","msg":"Received shutdown signal, cleaning up..."}

MCP Server

2025-10-14 23:22:23,894 - __main__ - INFO - Starting Generative Model-Based MCP Classification Server (HTTP mode)
2025-10-14 23:22:23,894 - __main__ - INFO - Loading generative model from: llm-semantic-router/qwen3_generative_classifier_r16
2025-10-14 23:22:23,894 - __main__ - INFO - Using device: cpu
2025-10-14 23:22:23,894 - __main__ - INFO - Detected HuggingFace model: llm-semantic-router/qwen3_generative_classifier_r16
label_mapping.json: 1.17kB [00:00, 9.54MB/s]
2025-10-14 23:22:24,046 - __main__ - INFO - Loading label mapping from: /home/ubuntu/.cache/huggingface/hub/models--llm-semantic-router--qwen3_generative_classifier_r16/snapshots/0f27c62886132c8e22ab88f5c699196083dfe716/label_mapping.json
2025-10-14 23:22:24,046 - __main__ - INFO - Loaded 14 categories: ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'other', 'philosophy', 'physics', 'psychology']
2025-10-14 23:22:24,046 - __main__ - INFO - Loading tokenizer...
tokenizer_config.json: 5.40kB [00:00, 40.5MB/s]
vocab.json: 2.78MB [00:00, 13.5MB/s]
merges.txt: 1.67MB [00:00, 11.0MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 233MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 10.8MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 613/613 [00:00<00:00, 10.1MB/s]
chat_template.jinja: 4.17kB [00:00, 24.3MB/s]
2025-10-14 23:22:25,728 - __main__ - INFO - Loading base model...
`torch_dtype` is deprecated! Use `dtype` instead!
2025-10-14 23:22:26,189 - __main__ - INFO - Loading LoRA weights...
adapter_config.json: 100%|███████████████████████████████████████████████████████████| 927/927 [00:00<00:00, 6.58MB/s]
adapter_model.safetensors: 100%|██████████████████████████████████████████████████| 40.4M/40.4M [00:00<00:00, 238MB/s]
2025-10-14 23:22:27,017 - __main__ - INFO - Prepared category tokens: 14 categories
2025-10-14 23:22:27,017 - __main__ - INFO - Model loaded successfully
2025-10-14 23:22:27,017 - __main__ - INFO - Available categories: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, other, philosophy, physics, psychology
2025-10-14 23:22:27,017 - __main__ - INFO - Base model: Qwen/Qwen3-0.6B
2025-10-14 23:22:27,017 - __main__ - INFO - Model path: llm-semantic-router/qwen3_generative_classifier_r16
2025-10-14 23:22:27,018 - __main__ - INFO - Device: cpu
2025-10-14 23:22:27,018 - __main__ - INFO - Listening on http://0.0.0.0:8090/mcp
2025-10-14 23:22:27,018 - __main__ - INFO - Server is ready at http://0.0.0.0:8090/mcp
2025-10-14 23:22:27,018 - __main__ - INFO - Health check available at http://0.0.0.0:8090/health
2025-10-14 23:22:32,722 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/initialize HTTP/1.1" 200 294 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/tools/list HTTP/1.1" 200 1327 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/resources/list HTTP/1.1" 404 269 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/prompts/list HTTP/1.1" 404 266 "-" "Go-http-client/1.1"
2025-10-14 23:22:32,723 - __main__ - INFO - Returning 14 categories with 14 system prompts: ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'other', 'philosophy', 'physics', 'psychology']
2025-10-14 23:22:32,724 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:32 +0000] "POST /mcp/tools/call HTTP/1.1" 200 6578 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,540 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/initialize HTTP/1.1" 200 294 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,540 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/tools/list HTTP/1.1" 200 1327 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,541 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/resources/list HTTP/1.1" 404 269 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,541 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/prompts/list HTTP/1.1" 404 266 "-" "Go-http-client/1.1"
2025-10-14 23:22:33,541 - __main__ - INFO - Returning 14 categories with 14 system prompts: ['biology', 'business', 'chemistry', 'computer science', 'economics', 'engineering', 'health', 'history', 'law', 'math', 'other', 'philosophy', 'physics', 'psychology']
2025-10-14 23:22:33,542 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:22:33 +0000] "POST /mcp/tools/call HTTP/1.1" 200 6578 "-" "Go-http-client/1.1"
2025-10-14 23:23:11,386 - __main__ - INFO - Classification result: class=9 (math), confidence=0.995, entropy=0.050, model=openai/gpt-oss-20b, use_reasoning=False
2025-10-14 23:23:11,391 - aiohttp.access - INFO - 127.0.0.1 [14/Oct/2025:23:23:10 +0000] "POST /mcp/tools/call HTTP/1.1" 200 711 "-" "Go-http-client/1.1"

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #374

Release Notes: Yes/No

Copy link

netlify bot commented Oct 14, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 77e4584
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68eee0f31797140007208f5e
😎 Deploy Preview https://deploy-preview-427--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

github-actions bot commented Oct 14, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 Root Directory

Owners: @rootfs, @Xunzhuo
Files changed:

  • examples/mcp-classifier-server/requirements_generative.txt
  • examples/mcp-classifier-server/server_generative.py
  • examples/mcp-classifier-server/README.md
  • examples/mcp-classifier-server/server_embedding.py

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py
  • src/training/training_lora/classifier_model_fine_tuning_lora/ft_linear_lora.py
  • src/training/training_lora/common_lora_utils.py
  • src/training/training_lora/pii_model_fine_tuning_lora/pii_bert_finetuning_lora.py
  • src/training/training_lora/prompt_guard_fine_tuning_lora/jailbreak_bert_finetuning_lora.py

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@rootfs rootfs marked this pull request as draft October 14, 2025 18:28
@rootfs rootfs requested a review from Copilot October 14, 2025 18:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new approach to classification using a Small Language Model (SLM) - specifically Qwen3 0.6B with instruction fine-tuning - to replace the existing BERT-based approach for multilingual support. The change introduces generative classification where the model generates category labels as text rather than using a classification head.

  • Adds comprehensive GPU management utilities for multi-GPU environments
  • Implements a new generative classification approach using Qwen3 with LoRA fine-tuning
  • Updates existing classification code to support GPU selection functionality

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/training/training_lora/common_lora_utils.py Adds GPU management utilities and deprecates existing device info function
src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py New implementation of generative classification using Qwen3 model
src/training/training_lora/classifier_model_fine_tuning_lora/ft_linear_lora.py Updates to support GPU selection functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

return "cpu", -1

torch.cuda.set_device(best_gpu_id)
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting CUDA_VISIBLE_DEVICES after CUDA initialization may not take effect. This environment variable should be set before importing torch or calling any CUDA functions. Consider moving this to the beginning of the process or using torch.cuda.set_device() only.

Suggested change
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)

Copilot uses AI. Check for mistakes.

Comment on lines +83 to +98
REQUIRED_CATEGORIES = [
"biology",
"business",
"chemistry",
"computer science",
"economics",
"engineering",
"health",
"history",
"law",
"math",
"other",
"philosophy",
"physics",
"psychology",
]
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider moving REQUIRED_CATEGORIES to a configuration file or constant module to avoid duplication across files and make it easier to maintain category changes centrally.

Copilot uses AI. Check for mistakes.

Comment on lines +443 to +444
fp16=False, # Disable fp16 to avoid gradient issues
gradient_checkpointing=False, # Disable to avoid gradient issues
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments mention avoiding 'gradient issues' but don't specify what issues. Consider documenting the specific problems encountered (e.g., NaN gradients, instability) to help future developers understand the reasoning.

Suggested change
fp16=False, # Disable fp16 to avoid gradient issues
gradient_checkpointing=False, # Disable to avoid gradient issues
fp16=False, # Disable fp16 to avoid gradient issues (NaN gradients and training instability observed with fp16 enabled)
gradient_checkpointing=False, # Disable to avoid gradient issues (enabling caused NaN gradients and unstable loss during training)

Copilot uses AI. Check for mistakes.

Comment on lines 431 to 433
per_device_train_batch_size=1, # Minimal batch size to fit in memory
per_device_eval_batch_size=1,
gradient_accumulation_steps=16, # Effective batch size = 1 * 16 = 16
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using batch size of 1 with gradient accumulation may be inefficient compared to using a larger batch size directly if memory allows. Consider making batch size configurable or dynamically determining optimal batch size based on available GPU memory.

Copilot uses AI. Check for mistakes.

@rootfs rootfs requested a review from Copilot October 14, 2025 18:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +257 to +260
torch.cuda.set_device(best_gpu_id)
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)

return f"cuda:{best_gpu_id}", best_gpu_id
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting CUDA_VISIBLE_DEVICES after torch.cuda.set_device() may not have the expected effect since PyTorch has already initialized CUDA. Consider setting CUDA_VISIBLE_DEVICES before any CUDA operations or use torch.cuda.set_device() exclusively.

Suggested change
torch.cuda.set_device(best_gpu_id)
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)
return f"cuda:{best_gpu_id}", best_gpu_id
# Set CUDA_VISIBLE_DEVICES before any CUDA operations
os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu_id)
torch.cuda.set_device(0)
logger.info(f"Auto-selected GPU {best_gpu_id} (now visible as cuda:0): {torch.cuda.get_device_name(0)}")
return "cuda:0", best_gpu_id

Copilot uses AI. Check for mistakes.

Comment on lines 344 to 346
if gpu_id is not None:
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
logger.info(f"Set CUDA_VISIBLE_DEVICES={gpu_id}")
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting CUDA_VISIBLE_DEVICES after importing torch may not work as expected. The environment variable should be set before any CUDA initialization. Consider moving this logic to the top of the script or using torch.cuda.set_device() instead.

Copilot uses AI. Check for mistakes.

Comment on lines 688 to 690
if args.gpu_id is not None:
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id)
print(f"INFO: Set CUDA_VISIBLE_DEVICES={args.gpu_id}")
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting CUDA_VISIBLE_DEVICES after argument parsing may be too late if torch has already been imported and initialized CUDA. Consider setting this at the very beginning of the script or using the set_gpu_device() utility function instead.

Copilot uses AI. Check for mistakes.

Comment on lines 431 to 433
per_device_train_batch_size=1, # Minimal batch size to fit in memory
per_device_eval_batch_size=1,
gradient_accumulation_steps=16, # Effective batch size = 1 * 16 = 16
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using batch size of 1 with gradient accumulation of 16 may be inefficient for GPU utilization. Consider testing with larger batch sizes (e.g., 2 or 4) and correspondingly smaller gradient accumulation steps to better utilize GPU memory and compute.

Suggested change
per_device_train_batch_size=1, # Minimal batch size to fit in memory
per_device_eval_batch_size=1,
gradient_accumulation_steps=16, # Effective batch size = 1 * 16 = 16
per_device_train_batch_size=4, # Increased batch size for better GPU utilization
per_device_eval_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16

Copilot uses AI. Check for mistakes.

Comment on lines 378 to 379
# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU device selection logic is inconsistent with the earlier gpu_id handling. Consider using the set_gpu_device() utility function consistently throughout the code instead of manually setting device strings.

Suggested change
# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
# Move to GPU using set_gpu_device utility
device = set_gpu_device(gpu_id)

Copilot uses AI. Check for mistakes.

@rootfs
Copy link
Collaborator Author

rootfs commented Oct 14, 2025

@OneZero-Y PTAL thanks

Signed-off-by: Huamin Chen <[email protected]>
@rootfs rootfs requested a review from Copilot October 14, 2025 19:58
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/training/training_lora/classifier_model_fine_tuning_lora/ft_qwen3_generative_lora.py:1

  • Setting the CUDA device inside a loop can cause issues in multi-threaded environments or when using multiple processes. Consider setting the device once outside the loop or using device-specific memory queries without changing the global device state.
"""

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 432 to 434
per_device_train_batch_size=4, # Increased batch size for better GPU utilization
per_device_eval_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states 'Increased batch size' but the batch size of 4 appears to be quite small for modern GPU training. Consider making these values configurable parameters rather than hardcoded values to allow for different GPU memory configurations.

Copilot uses AI. Check for mistakes.

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using float16 based only on CUDA availability may not be optimal for all GPU types. Some older GPUs don't support efficient float16 operations. Consider checking for specific GPU capabilities or making the dtype configurable.

Copilot uses AI. Check for mistakes.

Signed-off-by: Huamin Chen <[email protected]>
@rootfs rootfs requested a review from Copilot October 14, 2025 20:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

use_fp16 = (
compute_capability[0] >= 7
) # Volta and newer support efficient FP16
except:
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare 'except:' is discouraged as it catches all exceptions including system exits. Use 'except Exception:' to catch only actual exceptions.

Suggested change
except:
except Exception:

Copilot uses AI. Check for mistakes.

Comment on lines 74 to 75
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from common_lora_utils import (
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sys.path.append with relative path manipulation is fragile and error-prone. Consider using proper Python packaging with relative imports or setuptools entry points.

Suggested change
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from common_lora_utils import (
from ..common_lora_utils import (

Copilot uses AI. Check for mistakes.

Comment on lines 438 to 439
gradient_accumulation_steps=16
// batch_size, # Maintain effective batch size of 16
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer division could result in 0 gradient accumulation steps if batch_size > 16, which would cause training to fail. Add a minimum value check: max(1, 16 // batch_size).

Suggested change
gradient_accumulation_steps=16
// batch_size, # Maintain effective batch size of 16
gradient_accumulation_steps=max(1, 16 // batch_size), # Maintain effective batch size of 16

Copilot uses AI. Check for mistakes.

@rootfs rootfs marked this pull request as ready for review October 14, 2025 20:32
@rootfs rootfs changed the title [WIP] feat: use decoder only model for classification feat: use decoder only model for classification Oct 14, 2025
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
@rootfs rootfs changed the title feat: use decoder only model for classification feat: use decoder only model for mcp classification server Oct 14, 2025
Signed-off-by: Huamin Chen <[email protected]>
@rootfs rootfs requested a review from Copilot October 14, 2025 23:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

"""
Get device information and capabilities.
DEPRECATED: Use set_gpu_device() and get_all_gpu_info() for better multi-GPU support.
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a deprecation warning in the function implementation to alert users during runtime, not just in the docstring.

Copilot uses AI. Check for mistakes.

if len(predictions.shape) == 3:
pred_tokens = np.argmax(predictions, axis=-1)
else:
pred_tokens = predictions
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable 'predictions' is used without checking if it exists in the else clause. This could cause a NameError if the condition len(predictions.shape) == 3 is False and predictions was not defined earlier.

Copilot uses AI. Check for mistakes.

lr_scheduler_type="cosine",
fp16=False, # Disable fp16 to avoid gradient issues
gradient_checkpointing=False, # Disable to avoid gradient issues
dataloader_num_workers=0,
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting dataloader_num_workers=0 disables multiprocessing for data loading, which may reduce training performance. Consider allowing this to be configurable or using a small positive value like 2-4.

Copilot uses AI. Check for mistakes.

Q: {question}
A:"""

def classify(self, text: str, with_probabilities: bool = False) -> dict[str, Any]:
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation uses dict[str, Any] which is less type-safe. Consider defining a TypedDict or dataclass for the return type to improve type safety and documentation.

Copilot uses AI. Check for mistakes.


return result

def _calculate_entropy(self, probabilities: list[float]) -> float:
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function parameter uses list[float] but could be more flexible by accepting any sequence type. Consider using Sequence[float] from typing module for better type compatibility.

Copilot uses AI. Check for mistakes.

Signed-off-by: Huamin Chen <[email protected]>
Copy link
Member

@Xunzhuo Xunzhuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!! a big step towards more possibilities

@Xunzhuo Xunzhuo merged commit 644541d into vllm-project:main Oct 15, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Prompt Classification] Implement Out-of-Tree Classification

3 participants