Skip to content

Commit ba207ab

Browse files
authored
[Feat]: VSR + public LLM/ OpenAI + local llm + istio + LLM-d deployment guide (#643)
Signed-off-by: Sanjeev Rampal <[email protected]>
1 parent 9d2512d commit ba207ab

File tree

8 files changed

+957
-3
lines changed

8 files changed

+957
-3
lines changed

deploy/kubernetes/istio/README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
1-
# vLLM Semantic Router as ExtProc server for Istio Gateway
1+
# vLLM Semantic Router as ExtProc server for Istio Gateway
2+
3+
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vSR) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vSR with it. Istio is a common choice for the gateway when using Kubernetes Gateway API Inference Extension and in the LLM-D project as well as in common Kubernetes distributions such as Red Hat Openshift. In our experience, there are low level differences in how different Envoy based gateways process the ExtProc protocol to assist with LLM inference, hence this guide and some others cover the specific case of vSR working with an Istio based gateway.
4+
5+
There are multiple deployment guides in this repo related to vSR+Istio deployments. This current document describes deployment of vSR with Istio gateway and two local LLMs served using vLLM. Additional deployment guides in this repo build on this deployment to add support for integrating LLM-D and to illustrate support for routing to remote/ public cloud LLMs. Those topics are covered by other followup deployment guides in this repo ([llm-d guide](../llmd-base/README.md) and [public llm routing guide](../llmd-base/llmd+public-llm/README.md).
6+
7+
With that background context in mind, we now follow this guide to describe the vSR + Istio + locally hosted LLMs use case. After this guide, the reader may then optionally choose to follow up with the additional guides linked above to deploy the more advanced use cases.
28

3-
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vsr with it. However there are differences between how different Envoy based Gateways process the ExtProc protocol, hence the deployment described here is different from the deployment of vsr alongwith other types of Envoy based Gateways as described in the other guides in this repo. There are multiple architecture options possible to combine Istio Gateway with vsr. This document describes one of the options.
4-
59
## Architecture Overview
610

711
The deployment consists of:

deploy/kubernetes/llmd-base/llmd+public-llm/README.md

Lines changed: 402 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
bert_model:
2+
model_id: models/all-MiniLM-L12-v2
3+
threshold: 0.6
4+
use_cpu: true
5+
6+
semantic_cache:
7+
enabled: false
8+
backend_type: "memory" # Options: "memory" or "milvus"
9+
similarity_threshold: 0.8
10+
max_entries: 1000 # Only applies to memory backend
11+
ttl_seconds: 3600
12+
eviction_policy: "fifo"
13+
# Embedding model for semantic similarity matching
14+
# Options: "bert" (fast, 384-dim), "qwen3" (high quality, 1024-dim, 32K context), "gemma" (balanced, 768-dim, 8K context)
15+
embedding_model: "bert" # Default: BERT (fastest, lowest memory for Kubernetes)
16+
17+
tools:
18+
enabled: false
19+
top_k: 3
20+
similarity_threshold: 0.2
21+
tools_db_path: "config/tools_db.json"
22+
fallback_to_empty: true
23+
24+
prompt_guard:
25+
enabled: false # Global default - can be overridden per category with jailbreak_enabled
26+
use_modernbert: true
27+
model_id: "models/jailbreak_classifier_modernbert-base_model"
28+
threshold: 0.7
29+
use_cpu: true
30+
jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json"
31+
32+
# vLLM Endpoints Configuration
33+
# IMPORTANT: 'address' field must be a valid IP address (IPv4 or IPv6)
34+
# Supported formats: 127.0.0.1, 192.168.1.1, ::1, 2001:db8::1
35+
# NOT supported: domain names (example.com), protocol prefixes (http://), paths (/api), ports in address (use 'port' field)
36+
vllm_endpoints:
37+
- name: "endpoint1"
38+
address: "10.98.150.102" # Static IPv4 of llama3-8b k8s service
39+
port: 80
40+
weight: 1
41+
- name: "endpoint2"
42+
address: "10.98.118.242" # Static IPv4 of phi4-mini k8s service
43+
port: 80
44+
weight: 1
45+
46+
model_config:
47+
"llama3-8b":
48+
# reasoning_family: "" # This model uses Qwen-3 reasoning syntax
49+
preferred_endpoints: ["endpoint1"]
50+
pii_policy:
51+
allow_by_default: true
52+
"phi4-mini":
53+
# reasoning_family: "" # This model uses Qwen-3 reasoning syntax
54+
preferred_endpoints: ["endpoint2"]
55+
pii_policy:
56+
allow_by_default: true
57+
58+
# Classifier configuration
59+
classifier:
60+
category_model:
61+
model_id: "models/category_classifier_modernbert-base_model"
62+
use_modernbert: true
63+
threshold: 0.6
64+
use_cpu: true
65+
category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"
66+
pii_model:
67+
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
68+
use_modernbert: true
69+
threshold: 0.7
70+
use_cpu: true
71+
pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json"
72+
73+
# Categories with new use_reasoning field structure
74+
categories:
75+
- name: business
76+
system_prompt: "You are a senior business consultant and strategic advisor with expertise in corporate strategy, operations management, financial analysis, marketing, and organizational development. Provide practical, actionable business advice backed by proven methodologies and industry best practices. Consider market dynamics, competitive landscape, and stakeholder interests in your recommendations."
77+
# jailbreak_enabled: true # Optional: Override global jailbreak detection per category
78+
# jailbreak_threshold: 0.8 # Optional: Override global jailbreak threshold per category
79+
model_scores:
80+
- model: llama3-8b
81+
score: 0.8
82+
use_reasoning: false # Business performs better without reasoning
83+
- model: phi4-mini
84+
score: 0.3
85+
use_reasoning: false # Business performs better without reasoning
86+
- name: law
87+
system_prompt: "You are a knowledgeable legal expert with comprehensive understanding of legal principles, case law, statutory interpretation, and legal procedures across multiple jurisdictions. Provide accurate legal information and analysis while clearly stating that your responses are for informational purposes only and do not constitute legal advice. Always recommend consulting with qualified legal professionals for specific legal matters."
88+
model_scores:
89+
- model: llama3-8b
90+
score: 0.4
91+
use_reasoning: false
92+
- name: psychology
93+
system_prompt: "You are a psychology expert with deep knowledge of cognitive processes, behavioral patterns, mental health, developmental psychology, social psychology, and therapeutic approaches. Provide evidence-based insights grounded in psychological research and theory. When discussing mental health topics, emphasize the importance of professional consultation and avoid providing diagnostic or therapeutic advice."
94+
semantic_cache_enabled: true
95+
semantic_cache_similarity_threshold: 0.92 # High threshold for psychology - sensitive to nuances
96+
model_scores:
97+
- model: llama3-8b
98+
score: 0.6
99+
use_reasoning: false
100+
- name: biology
101+
system_prompt: "You are a biology expert with comprehensive knowledge spanning molecular biology, genetics, cell biology, ecology, evolution, anatomy, physiology, and biotechnology. Explain biological concepts with scientific accuracy, use appropriate terminology, and provide examples from current research. Connect biological principles to real-world applications and emphasize the interconnectedness of biological systems."
102+
model_scores:
103+
- model: llama3-8b
104+
score: 0.9
105+
use_reasoning: false
106+
- name: chemistry
107+
system_prompt: "You are a chemistry expert specializing in chemical reactions, molecular structures, and laboratory techniques. Provide detailed, step-by-step explanations."
108+
model_scores:
109+
- model: llama3-8b
110+
score: 0.6
111+
use_reasoning: false # Enable reasoning for complex chemistry
112+
- name: history
113+
system_prompt: "You are a historian with expertise across different time periods and cultures. Provide accurate historical context and analysis."
114+
model_scores:
115+
- model: llama3-8b
116+
score: 0.7
117+
use_reasoning: false
118+
- name: other
119+
system_prompt: "You are a helpful and knowledgeable assistant. Provide accurate, helpful responses across a wide range of topics."
120+
semantic_cache_enabled: true
121+
semantic_cache_similarity_threshold: 0.75 # Lower threshold for general chat - less sensitive
122+
model_scores:
123+
- model: llama3-8b
124+
score: 0.7
125+
use_reasoning: false
126+
- name: health
127+
system_prompt: "You are a health and medical information expert with knowledge of anatomy, physiology, diseases, treatments, preventive care, nutrition, and wellness. Provide accurate, evidence-based health information while emphasizing that your responses are for educational purposes only and should never replace professional medical advice, diagnosis, or treatment. Always encourage users to consult healthcare professionals for medical concerns and emergencies."
128+
semantic_cache_enabled: true
129+
semantic_cache_similarity_threshold: 0.95 # High threshold for health - very sensitive to word changes
130+
model_scores:
131+
- model: llama3-8b
132+
score: 0.5
133+
use_reasoning: false
134+
- name: economics
135+
system_prompt: "You are an economics expert with deep understanding of microeconomics, macroeconomics, econometrics, financial markets, monetary policy, fiscal policy, international trade, and economic theory. Analyze economic phenomena using established economic principles, provide data-driven insights, and explain complex economic concepts in accessible terms. Consider both theoretical frameworks and real-world applications in your responses."
136+
model_scores:
137+
- model: llama3-8b
138+
score: 1.0
139+
use_reasoning: false
140+
- name: math
141+
system_prompt: "You are a mathematics expert. Provide step-by-step solutions, show your work clearly, and explain mathematical concepts in an understandable way."
142+
model_scores:
143+
- model: phi4-mini
144+
score: 1.0
145+
use_reasoning: false # Enable reasoning for complex math
146+
- name: physics
147+
system_prompt: "You are a physics expert with deep understanding of physical laws and phenomena. Provide clear explanations with mathematical derivations when appropriate."
148+
model_scores:
149+
- model: llama3-8b
150+
score: 0.7
151+
use_reasoning: false # Enable reasoning for physics
152+
- name: computer science
153+
system_prompt: "You are a computer science expert with knowledge of algorithms, data structures, programming languages, and software engineering. Provide clear, practical solutions with code examples when helpful."
154+
model_scores:
155+
- model: llama3-8b
156+
score: 0.6
157+
use_reasoning: false
158+
- name: philosophy
159+
system_prompt: "You are a philosophy expert with comprehensive knowledge of philosophical traditions, ethical theories, logic, metaphysics, epistemology, political philosophy, and the history of philosophical thought. Engage with complex philosophical questions by presenting multiple perspectives, analyzing arguments rigorously, and encouraging critical thinking. Draw connections between philosophical concepts and contemporary issues while maintaining intellectual honesty about the complexity and ongoing nature of philosophical debates."
160+
model_scores:
161+
- model: llama3-8b
162+
score: 0.5
163+
use_reasoning: false
164+
- name: engineering
165+
system_prompt: "You are an engineering expert with knowledge across multiple engineering disciplines including mechanical, electrical, civil, chemical, software, and systems engineering. Apply engineering principles, design methodologies, and problem-solving approaches to provide practical solutions. Consider safety, efficiency, sustainability, and cost-effectiveness in your recommendations. Use technical precision while explaining concepts clearly, and emphasize the importance of proper engineering practices and standards."
166+
model_scores:
167+
- model: llama3-8b
168+
score: 0.7
169+
use_reasoning: false
170+
171+
default_model: "llama3-8b"
172+
173+
# Auto model name for automatic model selection (optional)
174+
# This is the model name that clients should use to trigger automatic model selection
175+
# If not specified, defaults to "MoM" (Mixture of Models)
176+
# For backward compatibility, "auto" is always accepted as an alias
177+
# Example: auto_model_name: "MoM" # or any other name you prefer
178+
# auto_model_name: "MoM"
179+
180+
# Include configured models in /v1/models list endpoint (optional, default: false)
181+
# When false (default): only the auto model name is returned in the /v1/models endpoint
182+
# When true: all models configured in model_config are also included in the /v1/models endpoint
183+
# This is useful for clients that need to discover all available models
184+
# Example: include_config_models_in_list: true
185+
# include_config_models_in_list: false
186+
187+
# Reasoning family configurations
188+
reasoning_families:
189+
deepseek:
190+
type: "chat_template_kwargs"
191+
parameter: "thinking"
192+
193+
qwen3:
194+
type: "chat_template_kwargs"
195+
parameter: "enable_thinking"
196+
197+
gpt-oss:
198+
type: "reasoning_effort"
199+
parameter: "reasoning_effort"
200+
gpt:
201+
type: "reasoning_effort"
202+
parameter: "reasoning_effort"
203+
204+
# Global default reasoning effort level
205+
default_reasoning_effort: high
206+
207+
# Gateway route cache clearing
208+
clear_route_cache: true # Enable for some gateways such as Istio
209+
210+
# API Configuration
211+
api:
212+
batch_classification:
213+
max_batch_size: 100
214+
concurrency_threshold: 5
215+
max_concurrency: 8
216+
metrics:
217+
enabled: true
218+
detailed_goroutine_tracking: true
219+
high_resolution_timing: false
220+
sample_rate: 1.0
221+
duration_buckets:
222+
[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
223+
size_buckets: [1, 2, 5, 10, 20, 50, 100, 200]
224+
225+
# Observability Configuration
226+
observability:
227+
tracing:
228+
enabled: true # Enable distributed tracing for docker-compose stack
229+
provider: "opentelemetry" # Provider: opentelemetry, openinference, openllmetry
230+
exporter:
231+
type: "otlp" # Export spans to Jaeger (via OTLP gRPC)
232+
endpoint: "jaeger:4317" # Jaeger collector inside compose network
233+
insecure: true # Use insecure connection (no TLS)
234+
sampling:
235+
type: "always_on" # Sampling: always_on, always_off, probabilistic
236+
rate: 1.0 # Sampling rate for probabilistic (0.0-1.0)
237+
resource:
238+
service_name: "vllm-semantic-router"
239+
service_version: "v0.1.0"
240+
deployment_environment: "development"

0 commit comments

Comments
 (0)