vllm-project · Xunzhuo · Oct 15, 2025 · Oct 14, 2025 · Oct 14, 2025 · Oct 14, 2025
@@ -2,9 +2,9 @@
 
 Example MCP servers that provide text classification with intelligent routing for the semantic router.
 
-## 📦 Two Implementations
+## 📦 Three Implementations
 
-This directory contains **two MCP classification servers**:
+This directory contains **three MCP classification servers**:
 
 ### 1. **Regex-Based Server** (`server.py`)
 
@@ -13,17 +13,27 @@ This directory contains **two MCP classification servers**:
 - ✅ **No Dependencies** - Just MCP SDK
 - 📝 **Best For**: Prototyping, simple rules, low-latency requirements
 
-### 2. **Embedding-Based Server** (`server_embedding.py`) 🆕
+### 2. **Embedding-Based Server** (`server_embedding.py`)
 
 - ✅ **High Accuracy** - Semantic understanding with Qwen3-Embedding-0.6B
 - ✅ **RAG-Style** - FAISS vector database with similarity search
 - ✅ **Flexible** - Handles paraphrases, synonyms, variations
-- 📝 **Best For**: Production use, high-accuracy requirements
+- 📝 **Best For**: Production use when you have good training examples
+
+### 3. **Generative Model Server** (`server_generative.py`) 🆕
+
+- ✅ **Highest Accuracy** - Fine-tuned Qwen3 generative model
+- ✅ **True Probabilities** - Softmax-based probability distributions
+- ✅ **Better Generalization** - Learns category patterns, not just examples
+- ✅ **Entropy Calculation** - Shannon entropy for uncertainty quantification
+- ✅ **HuggingFace Support** - Load models from HuggingFace Hub or local paths
+- 📝 **Best For**: Production use with fine-tuned models (70-85% accuracy)
 
 **Choose based on your needs:**
 
 - **Quick start / Testing?** → Use `server.py` (regex-based)
-- **Production / Accuracy?** → Use `server_embedding.py` (embedding-based)
+- **Production with training examples?** → Use `server_embedding.py` (embedding-based)
+- **Production with fine-tuned model?** → Use `server_generative.py` (generative model)
 
 ---
 
@@ -217,10 +227,83 @@ python3 server_embedding.py --http --port 8090
 
 ### Comparison
 
-| Feature | Regex (`server.py`) | Embedding (`server_embedding.py`) |
-|---------|---------------------|-----------------------------------|
-| **Accuracy** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
-| **Speed** | ~1-5ms | ~50-100ms |
-| **Memory** | ~10MB | ~600MB |
-| **Setup** | Simple | Requires model |
-| **Best For** | Prototyping | Production |
+| Feature | Regex (`server.py`) | Embedding (`server_embedding.py`) | Generative (`server_generative.py`) |
+|---------|---------------------|-----------------------------------|-------------------------------------|
+| **Accuracy** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
+| **Speed** | ~1-5ms | ~50-100ms | ~100-200ms (GPU) |
+| **Memory** | ~10MB | ~600MB | ~2GB (GPU) / ~4GB (CPU) |
+| **Setup** | Simple | CSV + embeddings | Fine-tuned model required |
+| **Probabilities** | Rule-based | Similarity scores | Softmax (true) |
+| **Entropy** | No | Manual calculation | Built-in (Shannon) |
+| **Best For** | Prototyping | Examples-based production | Model-based production |
+
+---
+
+## Generative Model Server (`server_generative.py`)
+
+For **production use with a fine-tuned model and highest accuracy**, see the generative model server.
+
+### Quick Start
+
+**Option 1: Use Pre-trained HuggingFace Model** (Easiest)
+
+```bash
+# Server automatically downloads from HuggingFace Hub
+python server_generative.py --http --port 8092 --model-path llm-semantic-router/qwen3_generative_classifier_r16
+```
+
+**Option 2: Train Your Own Model**
+
+Step 1: Train the model
+
+```bash
+cd ../../../src/training/training_lora/classifier_model_fine_tuning_lora/
+python ft_qwen3_generative_lora.py --mode train --epochs 8 --lora-rank 16
+# Creates: qwen3_generative_classifier_r16/
+```
+
+Step 2: Start the server
+
+```bash
+cd -  # Back to examples/mcp-classifier-server/
+python server_generative.py --http --port 8092 --model-path ../../../src/training/training_lora/classifier_model_fine_tuning_lora/qwen3_generative_classifier_r16
+```
+
+### Features
+
+- **Fine-tuned Qwen3-0.6B** generative model with LoRA
+- **Softmax probabilities** from model logits (true probability distribution)
+- **Shannon entropy** for uncertainty quantification
+- **14 MMLU-Pro categories** (biology, business, chemistry, CS, economics, engineering, health, history, law, math, other, philosophy, physics, psychology)
+- **Same MCP protocol** as other servers (drop-in replacement)
+- **Highest accuracy** - 70-85% on validation set
+
+### Why Use Generative Server?
+
+**Advantages over Embedding Server:**
+
+- ✅ True probability distributions (softmax-based, not similarity-based)
+- ✅ Better generalization beyond training examples
+- ✅ More accurate classification (70-85% vs ~60-70%)
+- ✅ Built-in entropy calculation for uncertainty
+- ✅ Fine-tuned on task-specific data
+
+**When to Use:**
+
+- You have training data to fine-tune a model
+- Need highest accuracy for production
+- Want true probability distributions
+- Need uncertainty quantification (entropy)
+- Can afford 2-4GB memory footprint
+
+### Testing
+
+Test the generative server with sample queries:
+
+```bash
+python test_generative.py --model-path qwen3_generative_classifier_r16
+```
+
+### Documentation
+
+For detailed documentation, see [README_GENERATIVE.md](README_GENERATIVE.md).
@@ -0,0 +1,15 @@
+# Requirements for Generative Model-Based MCP Classification Server
+# server_generative.py
+
+# Core dependencies
+torch>=2.0.0
+transformers>=4.30.0
+peft>=0.4.0
+huggingface_hub>=0.16.0
+
+# MCP SDK
+mcp>=0.1.0
+
+# HTTP mode (optional)
+aiohttp>=3.8.0
+
@@ -592,9 +592,14 @@ async def handle_mcp_request(request):
                 init_result = {
                     "protocolVersion": "2024-11-05",
                     "capabilities": {
-                        "tools": {},
+                        "tools": {},  # We support tools
+                        # Note: We don't support resources or prompts
+                    },
+                    "serverInfo": {
+                        "name": "embedding-classifier",
+                        "version": "1.0.0",
+                        "description": "Embedding-based text classification with semantic similarity",
                     },
-                    "serverInfo": {"name": "embedding-classifier", "version": "1.0.0"},
                 }
 
                 if request.path.startswith("/mcp/") and request.path != "/mcp":
@@ -648,13 +653,38 @@ async def handle_mcp_request(request):
                 result = {"jsonrpc": "2.0", "id": request_id, "result": {}}
                 return web.json_response(result)
 
+            # Handle unsupported but valid MCP methods gracefully
+            elif method in [
+                "resources/list",
+                "resources/read",
+                "prompts/list",
+                "prompts/get",
+            ]:
+                # These are valid MCP methods but not implemented in this server
+                # Return empty results instead of error for better compatibility
+                logger.debug(
+                    f"Unsupported method called: {method} (returning empty result)"
+                )
+
+                if method == "resources/list":
+                    result_data = {"resources": []}
+                elif method == "prompts/list":
+                    result_data = {"prompts": []}
+                else:
+                    result_data = {}
+
+                result = {"jsonrpc": "2.0", "id": request_id, "result": result_data}
+                return web.json_response(result)
+
             else:
+                # Unknown method - return error with HTTP 200 (per JSON-RPC spec)
+                logger.warning(f"Unknown method called: {method}")
                 error = {
                     "jsonrpc": "2.0",
                     "id": request_id,
                     "error": {"code": -32601, "message": f"Method not found: {method}"},
                 }
-                return web.json_response(error, status=404)
+                return web.json_response(error)
 
         except Exception as e:
             logger.error(f"Error handling request: {e}", exc_info=True)
@@ -667,7 +697,8 @@ async def handle_mcp_request(request):
                 ),
                 "error": {"code": -32603, "message": f"Internal error: {str(e)}"},
             }
-            return web.json_response(error, status=500)
+            # Per JSON-RPC 2.0 spec, return HTTP 200 even for errors
+            return web.json_response(error)
 
     async def health_check(request):
         """Health check endpoint."""