Skip to content

Commit 1cf2a4f

Browse files
batch cookbook v2
1 parent 5461617 commit 1cf2a4f

File tree

3 files changed

+61
-11
lines changed

3 files changed

+61
-11
lines changed

docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -842,6 +842,7 @@
842842
"group": "Use Cases",
843843
"pages": [
844844
"guides/use-cases",
845+
"guides/use-cases/run-batch-evals",
845846
"guides/use-cases/few-shot-prompting",
846847
"guides/use-cases/enforcing-json-schema-with-anyscale-and-together",
847848
"guides/use-cases/emotions-with-gpt-4o",

guides/use-cases.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ title: Overview
33
---
44

55
<CardGroup>
6+
<Card title="How to Run Structure Output Evals on Portkey at Scale" href="/guides/use-cases/run-batch-evals" />
67
<Card title="Few-Shot Prompting" href="/guides/use-cases/few-shot-prompting" />
78
<Card title="Enforcing JSON Schema with Anyscale & Together" href="/guides/use-cases/enforcing-json-schema-with-anyscale-and-together" />
89
<Card title="Detecting Emotions with GPT-4o" href="/guides/use-cases/emotions-with-gpt-4o" />

guides/use-cases/run-batch-evals.mdx

Lines changed: 59 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
---
2-
title: "How to Strctured Output Evals with Portkey"
3-
## Why Evaluations Matter
2+
title: "How to Run Strctured Output Evals at Scale"
43
---
54

6-
75
Building production-ready AI applications requires more than just choosing the right model and infrastructure. Just like traditional software development relies on comprehensive testing, AI applications need rigorous evaluation to ensure they perform reliably in the real world.
86

97
Consider a customer support chatbot for a food delivery service. If it incorrectly processes a refund request, your company bears the financial cost. If it misunderstands a complaint about food allergies, the consequences could be even more severe. These high-stakes scenarios demand systematic evaluation before deployment.
@@ -33,16 +31,9 @@ We'll work through a practical example: an AI agent that analyzes Amazon product
3331
### Our Evaluation Framework
3432
Using real Amazon review data, we'll build evaluations that check:
3533
1. **Format compliance**: Does the output match our required JSON schema `{"sentiment": "positive"}`?
36-
2. **PII detection**: Deny Request with PII using Guardrails
37-
3. **Sentiment classification accuracy**: Do the classifications match human labels?
38-
39-
By the end of this guide, you'll have a reusable framework for building evaluations specific to your own AI applications.
40-
41-
42-
43-
4434

4535

36+
By the end of this guide, you'll have a reusable framework for building evaluations specific to your own AI applications.
4637

4738

4839

@@ -501,13 +492,70 @@ print("\nDetailed Results:")
501492
print(tabulate(display_df, headers='keys', tablefmt='simple', showindex=False))
502493
```
503494

495+
504496
## Understanding the Results
505497

498+
The script outputs a CSV with your evaluation results. Here's what matters:
499+
500+
- **status_code 246**: Request passed JSON schema validation ✅
501+
- **status_code 446**: Request failed JSON schema validation ❌
502+
503+
Example output:
504+
```
505+
Summary:
506+
Total requests: 10
507+
Successful: 8
508+
Schema Validation Passed: 8
509+
Schema Validation Failed: 2
510+
511+
Sentiment Distribution:
512+
positive 5
513+
negative 3
514+
parse_error 2
515+
```
516+
517+
## What to Do Next
506518

519+
<Steps>
520+
<Step title="Fix Schema Failures">
521+
If you see 446 status codes, check:
522+
- Is your prompt output format clear?
523+
- Did you include enough examples?
524+
- Is the JSON structure in your prompt exactly right?
525+
</Step>
507526

527+
<Step title="Scale Up">
528+
Change `length: 10` to `length: 100` or `length: 1000` to test more reviews
529+
</Step>
530+
531+
<Step title="Add More Guardrails">
532+
Create guardrails for:
533+
- Checking if negative reviews mention refunds
534+
- Validating positive reviews aren't too short
535+
- Flagging reviews that need human review
536+
</Step>
537+
</Steps>
508538

509539
## Conclusion
510540

541+
You now have:
542+
- A working evaluation pipeline
543+
- Automatic JSON validation
544+
- Batch processing for scale
545+
- Clear pass/fail metrics
546+
547+
Start with 100 reviews, fix any issues, then scale to thousands.
548+
549+
## Resources
550+
551+
- [Portkey Docs](https://docs.portkey.ai) - API reference
552+
- [Discord](https://discord.gg/portkey) - Get help
553+
554+
555+
---
556+
557+
558+
511559
### Complete Script
512560

513561
Here's the complete script you can save and run:

0 commit comments

Comments
 (0)