vllm-project · Xunzhuo · Oct 17, 2025 · Oct 17, 2025 · Oct 17, 2025 · Oct 17, 2025
@@ -0,0 +1,269 @@
+---
+slug: mom-family
+title: "MoM: Specialized Models for Intelligent Routing"
+authors: [Xunzhuo]
+tags: [mom, models, routing, announcement]
+---
+
+![MoM Family](/img/mom-family.png)
+
+**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.
+
+<!-- truncate -->
+
+## Why MoM?
+
+vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
+
+## MoM System Card
+
+A quick overview of all MoM models:
+
+| Category | Model | Size | Architecture | Base Model | Purpose |
+|----------|-------|------|--------------|------------|---------|
+| **🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
+| | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
+| | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
+| **🔍 Similarity Search** | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
+| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
+| | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
+| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
+| | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
+| | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
+| | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
+| | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
+| | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |
+
+**Key Insights:**
+
+- **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
+- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing
+- **Qwen3** (decoder-only) → Explainable routing decisions + domain-specific problem solving
+- **Flash** models achieve 10,000+ QPS on commodity hardware
+- **SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems
+
+## The Evolution: From Encoder-Only to Mixture-of-Models
+
+### Where We Started: ModernBERT Foundation
+
+vLLM-SR initially built its routing intelligence entirely on **ModernBERT** (encoder-only models):
+
+**Advantages**:
+
+- ⚡ **Blazing fast**: Sub-10ms inference latency
+- 📊 **High throughput**: 10,000+ QPS on commodity hardware
+- 💰 **Cost-effective**: Minimal compute requirements
+- 🎯 **Proven accuracy**: Strong performance on classification tasks
+
+**Limitations**:
+
+- ❌ **Black box decisions**: No explanation for routing choices
+- ❌ **Limited reasoning**: Cannot handle complex, multi-step logic
+- ❌ **Fixed capabilities**: Hard to extend with new behaviors
+- ❌ **No tool integration**: Cannot leverage external tools or APIs
+
+### Why We're Evolving: Decoder-Only Models
+
+As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:
+
+- **Explainability**: Users need to understand *why* a query was routed to a specific model
+- **Complex reasoning**: Some routing decisions require multi-step analysis
+- **Agentic workflows**: Integration with tool calling, function execution, and external APIs
+- **Advanced techniques**: Reinforcement learning (RL), sophisticated post-training methods
+- **Domain expertise**: Specialized routing for legal, medical, scientific domains
+
+**The Solution**: Expand to decoder-only models while keeping encoder speed where it matters.
+
+### The MoM Architecture: Best of Both Worlds
+
+**Mixture-of-Models (MoM)** is both a philosophy and an architecture:
+
+1. **Backend LLM Architecture** — Route requests to the optimal downstream model (GPT-4, Claude, Llama, etc.)
+2. **Router Internal Design** — The router itself uses multiple specialized models working together
+
+Our MoM approach combines encoder and decoder strengths:
+
+- ⚡ **Encoders (ModernBERT)** — Fast classification (sub-10ms latency) for high-throughput scenarios
+- 🧠 **Decoders (Qwen3)** — Explainable decisions with reasoning for transparency
+- 🎯 **Domain Agents (Qwen3)** — Expert problem solving with specialized knowledge
+
+This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
+
+**Key Insight**: Just as vLLM-SR routes to different backend LLMs, the router itself is powered by a mixture of specialized models—each optimized for specific routing tasks (security, similarity, intent classification, domain expertise).
+
+## The MoM Model Family
+
+We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max):
+
+### 🧠 Intelligent Routing
+
+Smart routing models with three size variants:
+
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
+| **mom-brain-pro** | Pro | Qwen3 0.6B | Balanced performance with reasoning capabilities |
+| **mom-brain-max** | Max | Qwen3 1.7B | Maximum accuracy for complex routing decisions |
+
+**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen3 0.6B and 1.7B (decoder-only) models.
+
+### 🔍 Similarity Search
+
+Semantic similarity and vector search:
+
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection |
+
+**Architecture**: Based on ModernBERT (encoder-only) for high-speed embedding generation.
+
+### 🔒 Prompt Guardian
+
+Security and safety checks before routing:
+
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) |
+| **mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) |
+
+**Architecture**: Both based on ModernBERT (encoder-only) for ultra-fast security checks.
+
+### 🎯 SLM Experts
+
+Specialized small language models deployed as **backend problem solvers**:
+
+| Model | Size | Base Model | Domain | Training Data |
+|-------|------|------------|--------|---------------|
+| **mom-expert-math-flash** | Flash | Qwen3 0.6B | Mathematics | GSM8K, MATH |
+| **mom-expert-science-flash** | Flash | Qwen3 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
+| **mom-expert-social-flash** | Flash | Qwen3 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
+| **mom-expert-humanities-flash** | Flash | Qwen3 0.6B | Humanities | TruthfulQA, MMLU-train subset |
+| **mom-expert-law-flash** | Flash | Qwen3 0.6B | Law | MMLU-train law subset + specialized sources |
+| **mom-expert-generalist-flash** | Flash | Qwen3 0.6B | Generalist | Mixed from above domains |
+
+**Architecture**: All based on Qwen3 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.
+
+**Purpose**: These models are **not routers**—they are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.
+
+## Design Principles
+
+**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
+
+**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
+
+**Domain Expertise**: SLM Expert models are deployed as backend problem solvers, achieving 15-25% better accuracy on domain-specific tasks vs. generalist LLMs. Math problems are solved by math experts, science questions by science experts, etc.
+
+## How vLLM-SR Uses MoM
+
+MoM operates at **two levels** in vLLM-SR:
+
+### Level 1: Router Internal Architecture (MoM Inside)
+
+The router itself is a mixture of specialized models working together in a pipeline:
+
+1. **Security Check** → `mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
+2. **Intent Classification** → `mom-brain-*` models (flash/pro/max) determine query type and routing decisions
+3. **Similarity Search** → `mom-similarity-flash` finds semantically similar routes
+
+Each stage uses the **right model for the right task**: fast encoders for security checks, reasoning decoders for complex decisions.
+
+### Level 2: Backend LLM Orchestration (MoM Outside)
+
+The router then directs requests to the optimal backend LLM from a mixture of models:
+
+**General-Purpose LLMs**:
+
+- **Simple queries** → Lightweight models (Llama 3.2, Qwen3 2.5)
+- **Complex queries** → Premium models (GPT-4, Claude 3.5)
+
+**Domain-Specific SLM Experts** (`mom-expert-*`):
+
+- **Math problems** → `mom-expert-math-flash` (Qwen3 0.6B trained on GSM8K, MATH)
+- **Science questions** → `mom-expert-science-flash` (Qwen3 0.6B trained on ARC, SciQ)
+- **Social sciences** → `mom-expert-social-flash` (Qwen3 0.6B on CommonsenseQA, StrategyQA)
+- **Humanities** → `mom-expert-humanities-flash` (Qwen3 0.6B on TruthfulQA, MMLU)
+- **Legal queries** → `mom-expert-law-flash` (Qwen3 0.6B on MMLU law + specialized sources)
+- **General tasks** → `mom-expert-generalist-flash` (Qwen3 0.6B on mixed training)
+
+This dual-level MoM architecture achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
+
+**The Philosophy**: Mixture-of-Models all the way down—from the router's internal decision-making to the backend LLM pool (including both general-purpose LLMs and specialized SLM experts).
+
+## What's Next: Exploring Frontier Techniques
+
+The move to decoder-only models opens exciting possibilities for vLLM-SR:
+
+### 🤖 Agentic Routing
+
+Decoder models can act as intelligent agents that:
+
+- Dynamically select and orchestrate multiple models
+- Make multi-step routing decisions with tool calling
+- Adapt routing strategies based on feedback
+
+### 🎯 Reinforcement Learning (RL)
+
+Apply RL techniques to optimize routing decisions:
+
+- Learn from user feedback and model performance
+- Discover optimal routing policies through trial and error
+- Continuously improve cost-quality trade-offs
+
+### 🔧 Advanced Post-Training
+
+Leverage cutting-edge post-training methods:
+
+- **Distillation**: Transfer knowledge from large models to efficient routers
+- **Preference learning**: Train on human feedback (RLHF, DPO)
+- **Domain adaptation**: Fine-tune for specific industries or use cases
+
+### 🛠️ Tool Integration
+
+Enable routers to:
+
+- Call external APIs for context-aware routing
+- Query databases for historical routing patterns
+- Integrate with monitoring systems for real-time optimization
+
+**The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*.
+
+## Model Naming Convention
+
+```text
+mom-{category}-{size}
+mom-expert-{domain}-{size}
+```
+
+### Four Categories
+
+1. **Intelligent Routing**: `mom-brain-{flash|pro|max}`
+2. **Similarity Search**: `mom-similarity-{flash}`
+3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}`
+4. **SLM Experts**: `mom-expert-{domain}-{flash}` where domain = `{math|science|social|humanities|law|generalist}`
+
+### Three Size Variants
+
+- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen3 0.6B (for experts) — fastest, sub-10ms latency
+- **pro**: Qwen3 0.6B (for brain) — balanced performance with reasoning
+- **max**: Qwen3 1.7B (for brain) — maximum accuracy and capabilities
+
+### Architecture Summary
+
+- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen3 0.6B/1.7B)
+- **Similarity Search**: Flash (ModernBERT)
+- **Prompt Guardian**: Flash (ModernBERT)
+- **SLM Experts**: Flash only (Qwen3 0.6B) — 6 domain specialists
+
+## Get Started
+
+All MoM models are available on [Hugging Face](https://huggingface.co/LLM-Semantic-Router).
+
+**Resources**:
+
+- [GitHub](https://github.com/vllm-project/semantic-router)
+- [Documentation](https://vllm-semantic-router.com)
+- [Quick Start Guide](https://vllm-semantic-router.com/docs/installation)
+
+---
+
+**vLLM-SR · Route with intent. Think with reason.**
@@ -779,3 +779,107 @@ td, th {
     width: 100% !important;
   }
 }
+
+/* ============================================
+   Blog Page Optimizations - Wider Content
+   ============================================ */
+
+/* Only apply blog optimizations to blog pages - use more specific selectors */
+
+/* Hide blog sidebar (left side posts list) - only on blog pages */
+[class*='blog-wrapper'] aside[class*='blogSidebar'],
+[class*='blog-wrapper'] aside.col--3,
+[class*='blog-wrapper'] .theme-blog-sidebar {
+  display: none !important;
+}
+
+/* Hide table of contents (right side) - only on blog pages, not docs */
+[class*='blog-wrapper'] .theme-doc-toc-desktop,
+[class*='blog-wrapper'] .table-of-contents,
+[class*='blog-wrapper'] div[class*='tableOfContents'],
+[class*='blog-wrapper'] div[class*='tocCollapsible'] {
+  display: none !important;
+}
+
+/* Expand blog content row to full width - only on blog pages */
+[class*='blog-wrapper'] div[class*='blogContainer'] .row,
+[class*='blog-wrapper'] .row {
+  justify-content: center !important;
+  margin: 0 auto !important;
+}
+
+/* Expand blog content column to use full width - only on blog pages */
+[class*='blog-wrapper'] .col--7,
+[class*='blog-wrapper'] div[class*='blogPostContent'] {
+  max-width: 100% !important;
+  flex: 0 0 100% !important;
+  margin: 0 auto !important;
+}
+
+/* Center blog content container - wider layout - only on blog pages */
+[class*='blog-wrapper'] .container,
+[class*='blog-wrapper'] div[class*='blogContainer'] {
+  max-width: 1600px !important;
+  margin: 0 auto !important;
+  padding: 0 3rem !important;
+}
+
+/* Blog post content - centered and wide - only on blog pages */
+[class*='blog-wrapper'] article,
+[class*='blog-wrapper'] article[class*='blogPostItem'],
+[class*='blog-wrapper'] div[class*='blogPostContent'] article {
+  min-width: 60vw !important;
+  max-width: 1400px !important;
+  margin: 2rem auto !important;
+  padding: 3rem 4rem !important;
+  display: block !important;
+}
+
+/* Blog post header - only on blog pages */
+[class*='blog-wrapper'] header[class*='blogPostHeader'] {
+  max-width: 1400px !important;
+  margin: 0 auto !important;
+}
+
+/* Blog list page optimization - only on blog pages */
+[class*='blog-wrapper'] .margin-vert--lg,
+[class*='blog-wrapper'] div[class*='blogListPage'] {
+  max-width: 1400px !important;
+  margin: 2rem auto !important;
+  width: 100% !important;
+}
+
+/* Ensure blog post items are centered - only on blog pages */
+[class*='blog-wrapper'] .blogPostItem,
+[class*='blog-wrapper'] div[class*='blogPostItem'] {
+  max-width: 1400px !important;
+  margin: 0 auto 2rem auto !important;
+}
+
+/* Center blog post content wrapper - only on blog pages */
+[class*='blog-wrapper'] div[class*='blogPostPageContent'] {
+  display: flex !important;
+  justify-content: center !important;
+  width: 100% !important;
+}
+
+/* Responsive adjustments for blog - only on blog pages */
+@media (max-width: 996px) {
+  [class*='blog-wrapper'] article,
+  [class*='blog-wrapper'] article[class*='blogPostItem'] {
+    min-width: auto !important;
+    padding: 2rem 1.5rem !important;
+  }
+
+  [class*='blog-wrapper'] .container,
+  [class*='blog-wrapper'] div[class*='blogContainer'] {
+    padding: 0 1rem !important;
+  }
+}
+
+@media (max-width: 768px) {
+  [class*='blog-wrapper'] article,
+  [class*='blog-wrapper'] article[class*='blogPostItem'] {
+    padding: 1.5rem 1rem !important;
+  }
+}