Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion deepeval/prompt/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,13 @@ def build_node(field_list: List[OutputSchemaField]) -> Dict[str, Any]:
field_type = (
field.type.value if hasattr(field.type, "value") else field.type
)
field_schema = {"type": map_type(field.type)}
normalized_type = (
SchemaDataType(field_type)
if not isinstance(field_type, SchemaDataType)
else field_type
)

field_schema = {"type": map_type(normalized_type)}

# Add description if available
if field.description:
Expand Down
39 changes: 39 additions & 0 deletions docs/docs/evaluation-prompts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -405,3 +405,42 @@ There are **TWO** output settings you can associate with a prompt:

- `output_type`: The string specifying the model to use for generation.
- `output_schema`: The schema of type `BaseModel` of the output, if `output_type` is `OutputType.SCHEMA`.

### Tools

The tools in a prompt are used to specify the tools your agent has access to, all tools are identified using thier name and hence must be unique.

```python
from deepeval.prompt import Prompt, Tool
from deepeval.prompt.api import ToolMode
from pydantic import BaseModel

class ToolInputSchema(BaseModel):
result: str
confidence: float

prompt = Prompt(alias="YOUR-PROMPT-ALIAS")
tool = Tool(
name="ExploreTool",
description="Tool used for browsing the internet",
mode=ToolMode.STRICT,
structured_schema=ToolInputSchema,
)

prompt.push(
text="This is a prompt with a tool",
tools=[tool]
)

# You can also update an existing tool by using the new tool in the push / update method:
tool2 = Tool(
name="ExploreTool", # Must have the same name to update a tool
description="Tool used for browsing the internet",
mode=ToolMode.ALLOW_ADDITIONAL,
structured_schema=ToolInputSchema,
)

prompt.update(
tools=[tool2]
)
```
170 changes: 162 additions & 8 deletions docs/docs/metrics-custom.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ sidebar_label: Do it yourself
</head>

import MetricTagsDisplayer from '@site/src/components/MetricTagsDisplayer';
import { Timeline, TimelineItem } from '@site/src/components/Timeline';
import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";

<MetricTagsDisplayer custom={true} usesLLMs={false} />

Expand All @@ -31,26 +32,42 @@ There are many ways one can implement an LLM evaluation metric. Here is a [great

## Rules To Follow When Creating A Custom Metric

<Timeline>
<TimelineItem title="Inherit the `BaseMetric` class"></TimelineItem>
</Timeline>

### 1. Inherit the `BaseMetric` class

To begin, create a class that inherits from `deepeval`'s `BaseMetric` class:

<Tabs groupId="single-multi-turns">

<TabItem value="single-turn" label="Single-Turn">

```python
from deepeval.metrics import BaseMetric

class CustomMetric(BaseMetric):
...
```

This is important because the `BaseMetric` class will help `deepeval` acknowledge your custom metric during evaluation.
This is important because the `BaseMetric` class will help `deepeval` acknowledge your custom metric as a single-turn metric during evaluation.

</TabItem>
<TabItem value="multi-turn" label="Multi-Turn">

```python
from deepeval.metrics import BaseConversationalMetric

class CustomConversationalMetric(BaseConversationalMetric):
...
```

This is important because the `BaseConversationalMetric` class will help `deepeval` acknowledge your custom metric as a multi-turn metric during evaluation.

</TabItem>

</Tabs>

### 2. Implement the `__init__()` method

The `BaseMetric` class gives your custom metric a few properties that you can configure and be displayed post-evaluation, either locally or on Confident AI.
The `BaseMetric` / `BaseConversationalMetric` class gives your custom metric a few properties that you can configure and be displayed post-evaluation, either locally or on Confident AI.

An example is the `threshold` property, which determines whether the `LLMTestCase` being evaluated has passed or not. Although **the `threshold` property is all you need to make a custom metric functional**, here are some additional properties for those who want even more customizability:

Expand All @@ -65,6 +82,10 @@ Don't read too much into the advanced properties for now, we'll go over how they

The `__init__()` method is a great place to set these properties:

<Tabs groupId="single-multi-turns">

<TabItem value="single-turn" label="Single-Turn">

```python
from deepeval.metrics import BaseMetric

Expand All @@ -86,6 +107,33 @@ class CustomMetric(BaseMetric):
self.async_mode = async_mode
```

</TabItem>
<TabItem value="multi-turn" label="Multi-Turn">

```python
from deepeval.metrics import BaseConversationalMetric

class CustomConversationalMetric(BaseConversationalMetric):
def __init__(
self,
threshold: float = 0.5,
# Optional
evaluation_model: str,
include_reason: bool = True,
strict_mode: bool = True,
async_mode: bool = True
):
self.threshold = threshold
# Optional
self.evaluation_model = evaluation_model
self.include_reason = include_reason
self.strict_mode = strict_mode
self.async_mode = async_mode
```
</TabItem>

</Tabs>

### 3. Implement the `measure()` and `a_measure()` methods

The `measure()` and `a_measure()` method is where all the evaluation happens. In `deepeval`, evaluation is the process of applying a metric to an `LLMTestCase` to generate a score and optionally a reason for the score (if you're using an LLM) based on the scoring algorithm.
Expand Down Expand Up @@ -114,6 +162,12 @@ Both `measure()` and `a_measure()` **MUST**:

You can also optionally set `self.reason` in the measure methods (if you're using an LLM for evaluation), or wrap everything in a `try` block to catch any exceptions and set it to `self.error`. Here's a hypothetical example:


<Tabs groupId="single-multi-turns">

<TabItem value="single-turn" label="Single-Turn">


```python
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
Expand Down Expand Up @@ -150,6 +204,49 @@ class CustomMetric(BaseMetric):
raise
```

</TabItem>
<TabItem value="multi-turn" label="Multi-Turn">

```python
from deepeval.metrics import BaseConversationalMetric
from deepeval.test_case import ConversationalTestCase

class CustomConversationalMetric(BaseConversationalMetric):
...

def measure(self, test_case: ConversationalTestCase) -> float:
# Although not required, we recommend catching errors
# in a try block
try:
self.score = generate_hypothetical_score(test_case)
if self.include_reason:
self.reason = generate_hypothetical_reason(test_case)
self.success = self.score >= self.threshold
return self.score
except Exception as e:
# set metric error and re-raise it
self.error = str(e)
raise

async def a_measure(self, test_case: ConversationalTestCase) -> float:
# Although not required, we recommend catching errors
# in a try block
try:
self.score = await async_generate_hypothetical_score(test_case)
if self.include_reason:
self.reason = await async_generate_hypothetical_reason(test_case)
self.success = self.score >= self.threshold
return self.score
except Exception as e:
# set metric error and re-raise it
self.error = str(e)
raise
```

</TabItem>

</Tabs>

:::tip

Often times, the blocking part of an LLM evaluation metric stems from the API calls made to your LLM provider (such as OpenAI's API endpoints), and so ultimately you'll have to ensure that LLM inference can indeed be made asynchronous.
Expand All @@ -174,6 +271,10 @@ You can also [click here to find an example of offloading LLM inference to a sep

Under the hood, `deepeval` calls the `is_successful()` method to determine the status of your metric for a given `LLMTestCase`. We recommend copy and pasting the code below directly as your `is_successful()` implementation:

<Tabs groupId="single-multi-turns">

<TabItem value="single-turn" label="Single-Turn">

```python
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
Expand All @@ -185,13 +286,46 @@ class CustomMetric(BaseMetric):
if self.error is not None:
self.success = False
else:
return self.success
try:
self.success = self.score >= self.threshold
except TypeError:
self.success = False
return self.success
```

</TabItem>
<TabItem value="multi-turn" label="Multi-Turn">

```python
from deepeval.metrics import BaseConversationalMetric
from deepeval.test_case import ConversationalTestCase

class CustomConversationalMetric(BaseConversationalMetric):
...

def is_successful(self) -> bool:
if self.error is not None:
self.success = False
else:
try:
self.success = self.score >= self.threshold
except TypeError:
self.success = False
return self.success
```

</TabItem>

</Tabs>

### 5. Name Your Custom Metric

Probably the easiest step, all that's left is to name your custom metric:

<Tabs groupId="single-multi-turns">

<TabItem value="single-turn" label="Single-Turn">

```python
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
Expand All @@ -204,6 +338,26 @@ class CustomMetric(BaseMetric):
return "My Custom Metric"
```

</TabItem>
<TabItem value="multi-turn" label="Multi-Turn">

```python
from deepeval.metrics import BaseConversationalMetric
from deepeval.test_case import ConversationalTestCase

class CustomConversationalMetric(BaseConversationalMetric):
...

@property
def __name__(self):
return "My Custom Metric"
```

</TabItem>

</Tabs>


**Congratulations 🎉!** You've just learnt how to build a custom metric that is 100% integrated with `deepeval`'s ecosystem. In the following section, we'll go through a few real-life examples.

## More Examples
Expand Down
12 changes: 6 additions & 6 deletions docs/integrations/models/openrouter.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
# id: openrouter
id: openrouter
title: OpenRouter
sidebar_label: OpenRouter
---
Expand Down Expand Up @@ -43,7 +43,7 @@ model = OpenRouterModel(
model="openai/gpt-4.1",
api_key="your-openrouter-api-key",
# Optional: override the default OpenRouter endpoint
# base_url="https://openrouter.ai/api/v1",
base_url="https://openrouter.ai/api/v1",
# Optional: pass OpenRouter headers via **kwargs
default_headers={
"HTTP-Referer": "https://your-site.com",
Expand All @@ -59,12 +59,12 @@ There are **ZERO** mandatory and **SEVEN** optional parameters when creating an
- [Optional] `model`: A string specifying the OpenRouter model to use. Defaults to `OPENROUTER_MODEL_NAME` if set; otherwise falls back to "openai/gpt-4.1".
- [Optional] `api_key`: A string specifying your OpenRouter API key for authentication. Defaults to `OPENROUTER_API_KEY` if not passed; raises an error at runtime if unset.
- [Optional] `base_url`: A string specifying the base URL for the OpenRouter API endpoint. Defaults to `OPENROUTER_BASE_URL` if set; otherwise falls back to "https://openrouter.ai/api/v1".
- [Optional] `temperature`: A float specifying the model temperature. Defaults to `TEMPERATURE` if not passed; falls back to `0.0` if unset; raises if < 0.
- [Optional] `cost_per_input_token`: A float specifying the cost for each input token for the provided model. Defaults to `OPENROUTER_COST_PER_INPUT_TOKEN` if set.
- [Optional] `cost_per_output_token`: A float specifying the cost for each output token for the provided model. Defaults to `OPENROUTER_COST_PER_OUTPUT_TOKEN` if set.
- [Optional] `temperature`: A float specifying the model temperature. Defaults to `TEMPERATURE` if not passed; falls back to `0.0` if unset.
- [Optional] `cost_per_input_token`: A float specifying the cost for each input token for the provided model. Defaults to `OPENROUTER_COST_PER_INPUT_TOKEN` if not passed; raises an error at runtime if unset.
- [Optional] `cost_per_output_token`: A float specifying the cost for each output token for the provided model. Defaults to `OPENROUTER_COST_PER_OUTPUT_TOKEN` if not passed; raises an error at runtime if unset.
- [Optional] `generation_kwargs`: A dictionary of additional generation parameters forwarded to OpenRouter's `chat.completions.create(...)` call

Any additional **kwargs you would like to use for your OpenRouter client can be passed directly to OpenRouterModel(...). These are forwarded to the underlying OpenAI client constructor. We recommend double-checking the parameters and headers supported by your chosen model in the [official OpenRouter docs](https://openrouter.ai/docs).
Any additional `**kwargs` you would like to use for your `OpenRouter` client can be passed directly to `OpenRouterModel(...)`. These are forwarded to the underlying OpenAI client constructor. We recommend double-checking the parameters and headers supported by your chosen model in the [official OpenRouter docs](https://openrouter.ai/docs).

:::tip
Pass headers specific to OpenRouter via kwargs:
Expand Down
1 change: 1 addition & 0 deletions docs/sidebarIntegrations.js
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ module.exports = {
'models/openai',
'models/azure-openai',
'models/ollama',
'models/openrouter',
'models/anthropic',
'models/amazon-bedrock',
'models/gemini',
Expand Down
Loading
Loading