Skip to content

Commit ca9d2f7

Browse files
📖 add user guides for metrics & update getting started docs
1 parent a57ab3c commit ca9d2f7

File tree

6 files changed

+1409
-6
lines changed

6 files changed

+1409
-6
lines changed

docs/getting-started/installation.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,13 @@ pip install vllm-judge
3838
```
3939

4040
This installs the essential dependencies:
41+
4142
- `httpx` - Async HTTP client
43+
4244
- `pydantic` - Data validation
45+
4346
- `tenacity` - Retry logic
47+
4448
- `click` - CLI interface
4549

4650
### Optional Features
@@ -54,8 +58,11 @@ pip install vllm-judge[api]
5458
```
5559

5660
This adds:
61+
5762
- `fastapi` - Web framework
63+
5864
- `uvicorn` - ASGI server
65+
5966
- `websockets` - WebSocket support
6067

6168
#### Jinja2 Templates
@@ -148,4 +155,10 @@ conda activate vllm-judge
148155

149156
# Install vLLM Judge
150157
pip install vllm-judge
151-
```
158+
```
159+
160+
## 🎉 Next Steps
161+
162+
Congratulations! You've successfully installed vLLM Judge and ready for some evals. Here's what to explore next:
163+
164+
- **[Quick Start](quickstart.md)** - Get up and running with vLLM Judge in 5 minutes!

docs/getting-started/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,4 +246,4 @@ Congratulations! You've learned the basics of vLLM Judge. Here's what to explore
246246
1. **[Basic Evaluation Guide](../guide/basic-evaluation.md)** - Deep dive into evaluation options
247247
2. **[Using Metrics](../guide/metrics.md)** - Explore all pre-built metrics
248248
3. **[Template Variables](../guide/templates.md)** - Advanced templating features
249-
4. **[API Server](../api/server.md)** - Deploy Judge as a service
249+
<!-- 4. **[API Server](../api/server.md)** - Deploy Judge as a service -->

docs/guide/basic-evaluation.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# Basic Evaluation Guide
2+
3+
This guide covers the fundamental evaluation capabilities of vLLM Judge, progressing from simple to advanced usage.
4+
5+
## Understanding the Universal Interface
6+
7+
vLLM Judge uses a single `evaluate()` method that adapts to your needs:
8+
9+
```python
10+
result = await judge.evaluate(
11+
response="...", # What to evaluate
12+
criteria="...", # What to evaluate for
13+
# Optional parameters to control evaluation
14+
)
15+
```
16+
17+
The method automatically determines the evaluation type based on what you provide.
18+
19+
## Level 1: Simple Criteria-Based Evaluation
20+
21+
The simplest form - just provide text and criteria:
22+
23+
```python
24+
# Basic evaluation
25+
result = await judge.evaluate(
26+
response="The Earth is the third planet from the Sun.",
27+
criteria="scientific accuracy"
28+
)
29+
30+
# Multiple criteria
31+
result = await judge.evaluate(
32+
response="Dear customer, thank you for your feedback...",
33+
criteria="professionalism, empathy, and clarity"
34+
)
35+
```
36+
37+
**What happens behind the scenes:**
38+
- Judge creates a prompt asking to evaluate the response based on your criteria
39+
- The LLM provides a score (typically 1-10) and reasoning
40+
- You get a structured result with `decision`, `reasoning`, and `score`
41+
42+
## Level 2: Adding Structure with Scales and Rubrics
43+
44+
### Numeric Scales
45+
46+
Control the scoring range:
47+
48+
```python
49+
# 5-point scale
50+
result = await judge.evaluate(
51+
response="The product works as advertised.",
52+
criteria="review helpfulness",
53+
scale=(1, 5)
54+
)
55+
56+
# 100-point scale for fine-grained scoring
57+
result = await judge.evaluate(
58+
response=essay_text,
59+
criteria="writing quality",
60+
scale=(0, 100)
61+
)
62+
```
63+
64+
### String Rubrics
65+
66+
Provide evaluation guidance as text:
67+
68+
```python
69+
result = await judge.evaluate(
70+
response="I hate this product!",
71+
criteria="sentiment analysis",
72+
rubric="Classify as 'positive', 'neutral', or 'negative' based on emotional tone"
73+
)
74+
# Result: decision="negative", score=None
75+
```
76+
77+
### Detailed Rubrics
78+
79+
Define specific score meanings:
80+
81+
```python
82+
result = await judge.evaluate(
83+
response=code_snippet,
84+
criteria="code quality",
85+
scale=(1, 10),
86+
rubric={
87+
10: "Production-ready, follows all best practices",
88+
8: "High quality with minor improvements possible",
89+
6: "Functional but needs refactoring",
90+
4: "Works but has significant issues",
91+
2: "Barely functional with major problems",
92+
1: "Broken or completely incorrect"
93+
}
94+
)
95+
```
96+
97+
## Level 3: Comparison Evaluations
98+
99+
Compare two responses by providing a dictionary:
100+
101+
```python
102+
# Compare two responses
103+
result = await judge.evaluate(
104+
response={
105+
"a": "Python is great for beginners due to its simple syntax.",
106+
"b": "Python's intuitive syntax makes it ideal for newcomers."
107+
},
108+
criteria="clarity and informativeness"
109+
)
110+
# Result: decision="response_a" or "response_b"
111+
112+
# With additional context
113+
result = await judge.evaluate(
114+
response={
115+
"a": customer_response_1,
116+
"b": customer_response_2
117+
},
118+
criteria="helpfulness and professionalism",
119+
context="Customer asked about refund policy"
120+
)
121+
```
122+
123+
## Level 4: Adding Context and Examples
124+
125+
### Providing Context
126+
127+
Add context to improve evaluation accuracy:
128+
129+
```python
130+
result = await judge.evaluate(
131+
response="Just use the default settings.",
132+
criteria="helpfulness",
133+
context="User asked how to configure advanced security settings"
134+
)
135+
# Low score due to dismissive response to specific question
136+
```
137+
138+
### Few-Shot Examples
139+
140+
Guide the evaluation with examples:
141+
142+
```python
143+
result = await judge.evaluate(
144+
response="Your code has a bug on line 5.",
145+
criteria="constructive feedback quality",
146+
scale=(1, 10),
147+
examples=[
148+
{
149+
"response": "This doesn't work. Fix it.",
150+
"score": 2,
151+
"reasoning": "Too vague and dismissive"
152+
},
153+
{
154+
"response": "Line 5 has a syntax error. Try adding a closing parenthesis.",
155+
"score": 8,
156+
"reasoning": "Specific, actionable, and helpful"
157+
}
158+
]
159+
)
160+
```
161+
162+
## Level 5: Custom System Prompts
163+
164+
Take full control of the evaluator's persona:
165+
166+
```python
167+
# Expert evaluator
168+
result = await judge.evaluate(
169+
response=medical_advice,
170+
criteria="medical accuracy and safety",
171+
system_prompt="""You are a licensed medical professional reviewing
172+
health information for accuracy and potential harm. Be extremely
173+
cautious about unsafe advice."""
174+
)
175+
176+
# Specific domain expert
177+
result = await judge.evaluate(
178+
response=legal_document,
179+
criteria="legal compliance",
180+
system_prompt="""You are a corporate lawyer specializing in GDPR
181+
compliance. Evaluate for regulatory adherence."""
182+
)
183+
```
184+
185+
## Understanding Output Types
186+
187+
### Numeric Scores
188+
189+
When you provide a scale, you get numeric scoring:
190+
191+
```python
192+
result = await judge.evaluate(
193+
response="Great product!",
194+
criteria="review quality",
195+
scale=(1, 5)
196+
)
197+
# decision: 4 (numeric)
198+
# score: 4.0
199+
# reasoning: "Brief but positive..."
200+
```
201+
202+
### Classifications
203+
204+
Without a scale but with category rubric:
205+
206+
```python
207+
result = await judge.evaluate(
208+
response="This might be considered offensive.",
209+
criteria="content moderation",
210+
rubric="Classify as 'safe', 'warning', or 'unsafe'"
211+
)
212+
# decision: "warning" (string)
213+
# score: None
214+
# reasoning: "Contains potentially sensitive content..."
215+
```
216+
217+
### Binary Decisions
218+
219+
For yes/no evaluations:
220+
221+
```python
222+
result = await judge.evaluate(
223+
response=user_message,
224+
criteria="spam detection",
225+
rubric="Determine if this is 'spam' or 'not spam'"
226+
)
227+
# decision: "not spam"
228+
# score: None
229+
```
230+
231+
### Mixed Evaluation
232+
233+
You can request both classification and scoring:
234+
235+
```python
236+
result = await judge.evaluate(
237+
response=essay,
238+
criteria="academic quality",
239+
rubric="""
240+
Grade the essay:
241+
- 'A' (90-100): Exceptional work
242+
- 'B' (80-89): Good work
243+
- 'C' (70-79): Satisfactory
244+
- 'D' (60-69): Below average
245+
- 'F' (0-59): Failing
246+
247+
Provide both letter grade and numeric score.
248+
"""
249+
)
250+
# decision: "B"
251+
# score: 85.0
252+
# reasoning: "Well-structured argument with minor issues..."
253+
```
254+
255+
## Common Patterns
256+
257+
### Quality Assurance
258+
259+
```python
260+
async def qa_check(response: str, threshold: float = 7.0):
261+
"""Check if response meets quality threshold."""
262+
result = await judge.evaluate(
263+
response=response,
264+
criteria="helpfulness, accuracy, and professionalism",
265+
scale=(1, 10)
266+
)
267+
268+
passed = result.score >= threshold
269+
return {
270+
"passed": passed,
271+
"score": result.score,
272+
"feedback": result.reasoning,
273+
"improve": None if passed else "Consider improving: " + result.reasoning
274+
}
275+
```
276+
277+
### A/B Testing
278+
279+
```python
280+
async def compare_models(prompt: str, response_a: str, response_b: str):
281+
"""Compare two model responses."""
282+
result = await judge.evaluate(
283+
response={"a": response_a, "b": response_b},
284+
criteria="helpfulness, accuracy, and clarity",
285+
context=f"User prompt: {prompt}"
286+
)
287+
288+
return {
289+
"winner": result.decision,
290+
"reason": result.reasoning,
291+
"prompt": prompt
292+
}
293+
```
294+
295+
### Multi-Aspect Evaluation
296+
297+
```python
298+
async def comprehensive_evaluation(content: str):
299+
"""Evaluate content on multiple dimensions."""
300+
aspects = {
301+
"accuracy": "factual correctness",
302+
"clarity": "ease of understanding",
303+
"completeness": "thoroughness of coverage",
304+
"engagement": "interesting and engaging presentation"
305+
}
306+
307+
results = {}
308+
for aspect, criteria in aspects.items():
309+
result = await judge.evaluate(
310+
response=content,
311+
criteria=criteria,
312+
scale=(1, 10)
313+
)
314+
results[aspect] = {
315+
"score": result.score,
316+
"feedback": result.reasoning
317+
}
318+
319+
# Calculate overall score
320+
avg_score = sum(r["score"] for r in results.values()) / len(results)
321+
results["overall"] = avg_score
322+
323+
return results
324+
```
325+
326+
## 💡 Best Practices
327+
328+
- Be specific with your criteria.
329+
330+
- Rubric Design
331+
- Make score distinctions clear and meaningful
332+
- Avoid overlapping descriptions
333+
- Include specific indicators for each level
334+
335+
- Add system prompt to control the persona.
336+
337+
- Try to provide context when the evaluation depends on understanding the situation or question that prompted the response.
338+
339+
## Next Steps
340+
341+
- Learn about [Using Pre-built Metrics](metrics.md) for common evaluation tasks
342+
343+
- Explore [Template Variables](templates.md) for dynamic evaluations
344+
345+
<!-- - Understand [Batch Processing](batch.md) for high-volume evaluation
346+
347+
- Discover [Advanced Usage](advanced.md) patterns and techniques -->

0 commit comments

Comments
 (0)