Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
269 changes: 269 additions & 0 deletions website/blog/2025-10-16-mom-family.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
---
slug: mom-family
title: "MoM: Specialized Models for Intelligent Routing"
authors: [Xunzhuo]
tags: [mom, models, routing, announcement]
---

![MoM Family](/img/mom-family.png)

**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.

<!-- truncate -->

## Why MoM?

vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."

## MoM System Card

A quick overview of all MoM models:

| Category | Model | Size | Architecture | Base Model | Purpose |
|----------|-------|------|--------------|------------|---------|
| **🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
| | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
| | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
| **🔍 Similarity Search** | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
| | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
| | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
| | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
| | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
| | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
| | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |

**Key Insights:**

- **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing
- **Qwen3** (decoder-only) → Explainable routing decisions + domain-specific problem solving
- **Flash** models achieve 10,000+ QPS on commodity hardware
- **SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems

## The Evolution: From Encoder-Only to Mixture-of-Models

### Where We Started: ModernBERT Foundation

vLLM-SR initially built its routing intelligence entirely on **ModernBERT** (encoder-only models):

**Advantages**:

- ⚡ **Blazing fast**: Sub-10ms inference latency
- 📊 **High throughput**: 10,000+ QPS on commodity hardware
- 💰 **Cost-effective**: Minimal compute requirements
- 🎯 **Proven accuracy**: Strong performance on classification tasks

**Limitations**:

- ❌ **Black box decisions**: No explanation for routing choices
- ❌ **Limited reasoning**: Cannot handle complex, multi-step logic
- ❌ **Fixed capabilities**: Hard to extend with new behaviors
- ❌ **No tool integration**: Cannot leverage external tools or APIs

### Why We're Evolving: Decoder-Only Models

As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:

- **Explainability**: Users need to understand *why* a query was routed to a specific model
- **Complex reasoning**: Some routing decisions require multi-step analysis
- **Agentic workflows**: Integration with tool calling, function execution, and external APIs
- **Advanced techniques**: Reinforcement learning (RL), sophisticated post-training methods
- **Domain expertise**: Specialized routing for legal, medical, scientific domains

**The Solution**: Expand to decoder-only models while keeping encoder speed where it matters.

### The MoM Architecture: Best of Both Worlds

**Mixture-of-Models (MoM)** is both a philosophy and an architecture:

1. **Backend LLM Architecture** — Route requests to the optimal downstream model (GPT-4, Claude, Llama, etc.)
2. **Router Internal Design** — The router itself uses multiple specialized models working together

Our MoM approach combines encoder and decoder strengths:

- ⚡ **Encoders (ModernBERT)** — Fast classification (sub-10ms latency) for high-throughput scenarios
- 🧠 **Decoders (Qwen3)** — Explainable decisions with reasoning for transparency
- 🎯 **Domain Agents (Qwen3)** — Expert problem solving with specialized knowledge

This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.

**Key Insight**: Just as vLLM-SR routes to different backend LLMs, the router itself is powered by a mixture of specialized models—each optimized for specific routing tasks (security, similarity, intent classification, domain expertise).

## The MoM Model Family

We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max):

### 🧠 Intelligent Routing

Smart routing models with three size variants:

| Model | Size | Base Model | Purpose |
|-------|------|------------|---------|
| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
| **mom-brain-pro** | Pro | Qwen3 0.6B | Balanced performance with reasoning capabilities |
| **mom-brain-max** | Max | Qwen3 1.7B | Maximum accuracy for complex routing decisions |

**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen3 0.6B and 1.7B (decoder-only) models.

### 🔍 Similarity Search

Semantic similarity and vector search:

| Model | Size | Base Model | Purpose |
|-------|------|------------|---------|
| **mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection |

**Architecture**: Based on ModernBERT (encoder-only) for high-speed embedding generation.

### 🔒 Prompt Guardian

Security and safety checks before routing:

| Model | Size | Base Model | Purpose |
|-------|------|------------|---------|
| **mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) |
| **mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) |

**Architecture**: Both based on ModernBERT (encoder-only) for ultra-fast security checks.

### 🎯 SLM Experts

Specialized small language models deployed as **backend problem solvers**:

| Model | Size | Base Model | Domain | Training Data |
|-------|------|------------|--------|---------------|
| **mom-expert-math-flash** | Flash | Qwen3 0.6B | Mathematics | GSM8K, MATH |
| **mom-expert-science-flash** | Flash | Qwen3 0.6B | Science | ARC-Challenge, OpenBookQA, SciQ |
| **mom-expert-social-flash** | Flash | Qwen3 0.6B | Social Sciences | CommonsenseQA, StrategyQA |
| **mom-expert-humanities-flash** | Flash | Qwen3 0.6B | Humanities | TruthfulQA, MMLU-train subset |
| **mom-expert-law-flash** | Flash | Qwen3 0.6B | Law | MMLU-train law subset + specialized sources |
| **mom-expert-generalist-flash** | Flash | Qwen3 0.6B | Generalist | Mixed from above domains |

**Architecture**: All based on Qwen3 0.6B (decoder-only) for domain-specific problem solving. Currently only Flash variants are available.

**Purpose**: These models are **not routers**—they are deployed as backend LLMs to solve domain-specific problems. They form part of the Mixture-of-Models backend architecture that vLLM-SR routes to.

## Design Principles

**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.

**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.

**Domain Expertise**: SLM Expert models are deployed as backend problem solvers, achieving 15-25% better accuracy on domain-specific tasks vs. generalist LLMs. Math problems are solved by math experts, science questions by science experts, etc.

## How vLLM-SR Uses MoM

MoM operates at **two levels** in vLLM-SR:

### Level 1: Router Internal Architecture (MoM Inside)

The router itself is a mixture of specialized models working together in a pipeline:

1. **Security Check** → `mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
2. **Intent Classification** → `mom-brain-*` models (flash/pro/max) determine query type and routing decisions
3. **Similarity Search** → `mom-similarity-flash` finds semantically similar routes

Each stage uses the **right model for the right task**: fast encoders for security checks, reasoning decoders for complex decisions.

### Level 2: Backend LLM Orchestration (MoM Outside)

The router then directs requests to the optimal backend LLM from a mixture of models:

**General-Purpose LLMs**:

- **Simple queries** → Lightweight models (Llama 3.2, Qwen3 2.5)
- **Complex queries** → Premium models (GPT-4, Claude 3.5)

**Domain-Specific SLM Experts** (`mom-expert-*`):

- **Math problems** → `mom-expert-math-flash` (Qwen3 0.6B trained on GSM8K, MATH)
- **Science questions** → `mom-expert-science-flash` (Qwen3 0.6B trained on ARC, SciQ)
- **Social sciences** → `mom-expert-social-flash` (Qwen3 0.6B on CommonsenseQA, StrategyQA)
- **Humanities** → `mom-expert-humanities-flash` (Qwen3 0.6B on TruthfulQA, MMLU)
- **Legal queries** → `mom-expert-law-flash` (Qwen3 0.6B on MMLU law + specialized sources)
- **General tasks** → `mom-expert-generalist-flash` (Qwen3 0.6B on mixed training)

This dual-level MoM architecture achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).

**The Philosophy**: Mixture-of-Models all the way down—from the router's internal decision-making to the backend LLM pool (including both general-purpose LLMs and specialized SLM experts).

## What's Next: Exploring Frontier Techniques

The move to decoder-only models opens exciting possibilities for vLLM-SR:

### 🤖 Agentic Routing

Decoder models can act as intelligent agents that:

- Dynamically select and orchestrate multiple models
- Make multi-step routing decisions with tool calling
- Adapt routing strategies based on feedback

### 🎯 Reinforcement Learning (RL)

Apply RL techniques to optimize routing decisions:

- Learn from user feedback and model performance
- Discover optimal routing policies through trial and error
- Continuously improve cost-quality trade-offs

### 🔧 Advanced Post-Training

Leverage cutting-edge post-training methods:

- **Distillation**: Transfer knowledge from large models to efficient routers
- **Preference learning**: Train on human feedback (RLHF, DPO)
- **Domain adaptation**: Fine-tune for specific industries or use cases

### 🛠️ Tool Integration

Enable routers to:

- Call external APIs for context-aware routing
- Query databases for historical routing patterns
- Integrate with monitoring systems for real-time optimization

**The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*.

## Model Naming Convention

```text
mom-{category}-{size}
mom-expert-{domain}-{size}
```

### Four Categories

1. **Intelligent Routing**: `mom-brain-{flash|pro|max}`
2. **Similarity Search**: `mom-similarity-{flash}`
3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}`
4. **SLM Experts**: `mom-expert-{domain}-{flash}` where domain = `{math|science|social|humanities|law|generalist}`

### Three Size Variants

- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen3 0.6B (for experts) — fastest, sub-10ms latency
- **pro**: Qwen3 0.6B (for brain) — balanced performance with reasoning
- **max**: Qwen3 1.7B (for brain) — maximum accuracy and capabilities

### Architecture Summary

- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen3 0.6B/1.7B)
- **Similarity Search**: Flash (ModernBERT)
- **Prompt Guardian**: Flash (ModernBERT)
- **SLM Experts**: Flash only (Qwen3 0.6B) — 6 domain specialists

## Get Started

All MoM models are available on [Hugging Face](https://huggingface.co/LLM-Semantic-Router).

**Resources**:

- [GitHub](https://github.com/vllm-project/semantic-router)
- [Documentation](https://vllm-semantic-router.com)
- [Quick Start Guide](https://vllm-semantic-router.com/docs/installation)

---

**vLLM-SR · Route with intent. Think with reason.**
104 changes: 104 additions & 0 deletions website/src/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -779,3 +779,107 @@ td, th {
width: 100% !important;
}
}

/* ============================================
Blog Page Optimizations - Wider Content
============================================ */

/* Only apply blog optimizations to blog pages - use more specific selectors */

/* Hide blog sidebar (left side posts list) - only on blog pages */
[class*='blog-wrapper'] aside[class*='blogSidebar'],
[class*='blog-wrapper'] aside.col--3,
[class*='blog-wrapper'] .theme-blog-sidebar {
display: none !important;
}

/* Hide table of contents (right side) - only on blog pages, not docs */
[class*='blog-wrapper'] .theme-doc-toc-desktop,
[class*='blog-wrapper'] .table-of-contents,
[class*='blog-wrapper'] div[class*='tableOfContents'],
[class*='blog-wrapper'] div[class*='tocCollapsible'] {
display: none !important;
}

/* Expand blog content row to full width - only on blog pages */
[class*='blog-wrapper'] div[class*='blogContainer'] .row,
[class*='blog-wrapper'] .row {
justify-content: center !important;
margin: 0 auto !important;
}

/* Expand blog content column to use full width - only on blog pages */
[class*='blog-wrapper'] .col--7,
[class*='blog-wrapper'] div[class*='blogPostContent'] {
max-width: 100% !important;
flex: 0 0 100% !important;
margin: 0 auto !important;
}

/* Center blog content container - wider layout - only on blog pages */
[class*='blog-wrapper'] .container,
[class*='blog-wrapper'] div[class*='blogContainer'] {
max-width: 1600px !important;
margin: 0 auto !important;
padding: 0 3rem !important;
}

/* Blog post content - centered and wide - only on blog pages */
[class*='blog-wrapper'] article,
[class*='blog-wrapper'] article[class*='blogPostItem'],
[class*='blog-wrapper'] div[class*='blogPostContent'] article {
min-width: 60vw !important;
max-width: 1400px !important;
margin: 2rem auto !important;
padding: 3rem 4rem !important;
display: block !important;
}

/* Blog post header - only on blog pages */
[class*='blog-wrapper'] header[class*='blogPostHeader'] {
max-width: 1400px !important;
margin: 0 auto !important;
}

/* Blog list page optimization - only on blog pages */
[class*='blog-wrapper'] .margin-vert--lg,
[class*='blog-wrapper'] div[class*='blogListPage'] {
max-width: 1400px !important;
margin: 2rem auto !important;
width: 100% !important;
}

/* Ensure blog post items are centered - only on blog pages */
[class*='blog-wrapper'] .blogPostItem,
[class*='blog-wrapper'] div[class*='blogPostItem'] {
max-width: 1400px !important;
margin: 0 auto 2rem auto !important;
}

/* Center blog post content wrapper - only on blog pages */
[class*='blog-wrapper'] div[class*='blogPostPageContent'] {
display: flex !important;
justify-content: center !important;
width: 100% !important;
}

/* Responsive adjustments for blog - only on blog pages */
@media (max-width: 996px) {
[class*='blog-wrapper'] article,
[class*='blog-wrapper'] article[class*='blogPostItem'] {
min-width: auto !important;
padding: 2rem 1.5rem !important;
}

[class*='blog-wrapper'] .container,
[class*='blog-wrapper'] div[class*='blogContainer'] {
padding: 0 1rem !important;
}
}

@media (max-width: 768px) {
[class*='blog-wrapper'] article,
[class*='blog-wrapper'] article[class*='blogPostItem'] {
padding: 1.5rem 1rem !important;
}
}
Binary file added website/static/img/mom-family.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading