|
| 1 | +# Evaluation System Foundation |
| 2 | + |
| 3 | +1. **Evaluation Framework**: You can now systematically test your Vapi voice assistants with the new [`Eval`](https://api.vapi.ai/api#:~:text=Eval) system. Create comprehensive test scenarios to validate assistant behavior, conversation flow, and tool usage through mock conversations. |
| 4 | + |
| 5 | +2. **Mock Conversation Builder**: Design test conversations using [`Eval.messages`](https://api.vapi.ai/api#:~:text=Eval.messages) with support for multiple message types: |
| 6 | + - [`ChatEvalUserMessageMock`](https://api.vapi.ai/api#:~:text=ChatEvalUserMessageMock): Simulate user inputs and questions |
| 7 | + - [`ChatEvalSystemMessageMock`](https://api.vapi.ai/api#:~:text=ChatEvalSystemMessageMock): Inject system messages mid-conversation |
| 8 | + - [`ChatEvalToolResponseMessageMock`](https://api.vapi.ai/api#:~:text=ChatEvalToolResponseMessageMock): Mock tool responses for consistent testing |
| 9 | + - [`ChatEvalAssistantMessageEvaluation`](https://api.vapi.ai/api#:~:text=ChatEvalAssistantMessageEvaluation): Define evaluation checkpoints |
| 10 | + |
| 11 | +3. **Evaluation Types**: Currently focused on `chat.mockConversation` type evaluations, with the framework designed to support additional evaluation methods in future releases. |
| 12 | + |
| 13 | +4. **Evaluation Management**: Organize your tests with [`CreateEvalDTO`](https://api.vapi.ai/api#:~:text=CreateEvalDTO) and [`UpdateEvalDTO`](https://api.vapi.ai/api#:~:text=UpdateEvalDTO): |
| 14 | + - `name`: Descriptive names up to 80 characters (e.g., "Customer Support Flow Validation") |
| 15 | + - `description`: Detailed descriptions up to 500 characters explaining the test purpose |
| 16 | + - `messages`: The complete mock conversation flow |
| 17 | + |
| 18 | +5. **Evaluation Endpoints**: Access your evaluations through the new [`/eval`](https://api.vapi.ai/api#:~:text=/eval) endpoint family: |
| 19 | + - `GET /eval`: List all evaluations with pagination support |
| 20 | + - `POST /eval`: Create new evaluations |
| 21 | + - `GET /eval/{id}`: Retrieve specific evaluation details |
| 22 | + - `PUT /eval/{id}`: Update existing evaluations |
| 23 | + |
| 24 | +6. **Judge Plan Architecture**: Define how assistant responses are validated using [`AssistantMessageJudgePlan`](https://api.vapi.ai/api#:~:text=AssistantMessageJudgePlan) with three evaluation methods: |
| 25 | + - **Exact Match**: [`AssistantMessageJudgePlanExact`](https://api.vapi.ai/api#:~:text=AssistantMessageJudgePlanExact) for precise content and tool call validation |
| 26 | + - **Regex Pattern**: [`AssistantMessageJudgePlanRegex`](https://api.vapi.ai/api#:~:text=AssistantMessageJudgePlanRegex) for flexible pattern-based evaluation |
| 27 | + - **AI Judge**: [`AssistantMessageJudgePlanAI`](https://api.vapi.ai/api#:~:text=AssistantMessageJudgePlanAI) for intelligent evaluation using LLM-as-a-judge |
| 28 | + |
| 29 | +<Info> |
| 30 | + This is the foundation release for the evaluation system. Evaluation execution and results processing will be available in upcoming releases. Start designing your test scenarios now to be ready for full evaluation capabilities. |
| 31 | +</Info> |
| 32 | + |
| 33 | +## Testing Capabilities |
| 34 | +<CardGroup cols={2}> |
| 35 | + <Card title="Mock Conversations" icon="message-square"> |
| 36 | + Create realistic test scenarios with user messages, system prompts, and expected assistant responses for comprehensive flow validation. |
| 37 | + </Card> |
| 38 | + <Card title="Tool Call Testing" icon="wrench"> |
| 39 | + Validate that your assistant calls the right tools with correct parameters using <code>ChatEvalAssistantMessageMockToolCall</code>. |
| 40 | + </Card> |
| 41 | + <Card title="Flexible Validation" icon="check-circle"> |
| 42 | + Choose from exact matching, regex patterns, or AI-powered evaluation to suit different testing needs and complexity levels. |
| 43 | + </Card> |
| 44 | + <Card title="Evaluation Organization" icon="folder"> |
| 45 | + Organize tests with descriptive names and detailed documentation to maintain clear testing workflows across your team. |
| 46 | + </Card> |
| 47 | +</CardGroup> |
0 commit comments