|
1 | 1 | ---
|
2 |
| -title: "How to Strctured Output Evals with Portkey" |
3 |
| -## Why Evaluations Matter |
| 2 | +title: "How to Run Strctured Output Evals at Scale" |
4 | 3 | ---
|
5 | 4 |
|
6 |
| - |
7 | 5 | Building production-ready AI applications requires more than just choosing the right model and infrastructure. Just like traditional software development relies on comprehensive testing, AI applications need rigorous evaluation to ensure they perform reliably in the real world.
|
8 | 6 |
|
9 | 7 | Consider a customer support chatbot for a food delivery service. If it incorrectly processes a refund request, your company bears the financial cost. If it misunderstands a complaint about food allergies, the consequences could be even more severe. These high-stakes scenarios demand systematic evaluation before deployment.
|
@@ -33,16 +31,9 @@ We'll work through a practical example: an AI agent that analyzes Amazon product
|
33 | 31 | ### Our Evaluation Framework
|
34 | 32 | Using real Amazon review data, we'll build evaluations that check:
|
35 | 33 | 1. **Format compliance**: Does the output match our required JSON schema `{"sentiment": "positive"}`?
|
36 |
| -2. **PII detection**: Deny Request with PII using Guardrails |
37 |
| -3. **Sentiment classification accuracy**: Do the classifications match human labels? |
38 |
| - |
39 |
| -By the end of this guide, you'll have a reusable framework for building evaluations specific to your own AI applications. |
40 |
| - |
41 |
| - |
42 |
| - |
43 |
| - |
44 | 34 |
|
45 | 35 |
|
| 36 | +By the end of this guide, you'll have a reusable framework for building evaluations specific to your own AI applications. |
46 | 37 |
|
47 | 38 |
|
48 | 39 |
|
@@ -501,13 +492,70 @@ print("\nDetailed Results:")
|
501 | 492 | print(tabulate(display_df, headers='keys', tablefmt='simple', showindex=False))
|
502 | 493 | ```
|
503 | 494 |
|
| 495 | + |
504 | 496 | ## Understanding the Results
|
505 | 497 |
|
| 498 | +The script outputs a CSV with your evaluation results. Here's what matters: |
| 499 | + |
| 500 | +- **status_code 246**: Request passed JSON schema validation ✅ |
| 501 | +- **status_code 446**: Request failed JSON schema validation ❌ |
| 502 | + |
| 503 | +Example output: |
| 504 | +``` |
| 505 | +Summary: |
| 506 | +Total requests: 10 |
| 507 | +Successful: 8 |
| 508 | +Schema Validation Passed: 8 |
| 509 | +Schema Validation Failed: 2 |
| 510 | +
|
| 511 | +Sentiment Distribution: |
| 512 | +positive 5 |
| 513 | +negative 3 |
| 514 | +parse_error 2 |
| 515 | +``` |
| 516 | + |
| 517 | +## What to Do Next |
506 | 518 |
|
| 519 | +<Steps> |
| 520 | +<Step title="Fix Schema Failures"> |
| 521 | +If you see 446 status codes, check: |
| 522 | +- Is your prompt output format clear? |
| 523 | +- Did you include enough examples? |
| 524 | +- Is the JSON structure in your prompt exactly right? |
| 525 | +</Step> |
507 | 526 |
|
| 527 | +<Step title="Scale Up"> |
| 528 | +Change `length: 10` to `length: 100` or `length: 1000` to test more reviews |
| 529 | +</Step> |
| 530 | + |
| 531 | +<Step title="Add More Guardrails"> |
| 532 | +Create guardrails for: |
| 533 | +- Checking if negative reviews mention refunds |
| 534 | +- Validating positive reviews aren't too short |
| 535 | +- Flagging reviews that need human review |
| 536 | +</Step> |
| 537 | +</Steps> |
508 | 538 |
|
509 | 539 | ## Conclusion
|
510 | 540 |
|
| 541 | +You now have: |
| 542 | +- A working evaluation pipeline |
| 543 | +- Automatic JSON validation |
| 544 | +- Batch processing for scale |
| 545 | +- Clear pass/fail metrics |
| 546 | + |
| 547 | +Start with 100 reviews, fix any issues, then scale to thousands. |
| 548 | + |
| 549 | +## Resources |
| 550 | + |
| 551 | +- [Portkey Docs](https://docs.portkey.ai) - API reference |
| 552 | +- [Discord](https://discord.gg/portkey) - Get help |
| 553 | + |
| 554 | + |
| 555 | +--- |
| 556 | + |
| 557 | + |
| 558 | + |
511 | 559 | ### Complete Script
|
512 | 560 |
|
513 | 561 | Here's the complete script you can save and run:
|
|
0 commit comments