initial commit

mtullalizardi · mtullalizardi · commit a0d0a2b07695 · 2025-10-06T16:22:25.000-04:00
diff --git a/content/en/llm_observability/evaluations/managed_evaluations.md b/content/en/llm_observability/evaluations/managed_evaluations.md
@@ -134,8 +134,8 @@ Each of these metrics has `ml_app`, `model_server`, `model_provider`, `model_nam
 #### Topic relevancy
 
 This check identifies and flags user inputs that deviate from the configured acceptable input topics. This ensures that interactions stay pertinent to the LLM's designated purpose and scope.
-  
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Input | Evaluated using LLM | Topic relevancy assesses whether each prompt-response pair remains aligned with the intended subject matter of the Large Language Model (LLM) application. For instance, an e-commerce chatbot receiving a question about a pizza recipe would be flagged as irrelevant.  |
 
@@ -156,7 +156,7 @@ This check identifies instances where the LLM makes a claim that disagrees with
 
 {{< img src="llm_observability/evaluations/hallucination_1.png" alt="A Hallucination evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Output | Evaluated using LLM | Hallucination flags any output that disagrees with the context provided to the LLM. |
 
@@ -217,13 +217,13 @@ This check identifies instances where the LLM fails to deliver an appropriate re
 
 {{< img src="llm_observability/evaluations/failure_to_answer_1.png" alt="A Failure to Answer evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Output | Evaluated using LLM | Failure To Answer flags whether each prompt-response pair demonstrates that the LLM application has provided a relevant and satisfactory answer to the user's question.  |
 
 ##### Failure to Answer Configuration
 <div class="alert alert-info">Configuring failure to answer evaluation categories is supported if OpenAI or Azure OpenAI is selected as your LLM provider.</div>
-You can configure the Failure to Answer evaluation to use specific categories of failure to answer, listed in the following table. 
+You can configure the Failure to Answer evaluation to use specific categories of failure to answer, listed in the following table.
 
 | Configuration Option | Description | Example(s) |
 |---|---|---|
@@ -245,7 +245,7 @@ Afrikaans, Albanian, Arabic, Armenian, Azerbaijani, Belarusian, Bengali, Norwegi
 
 {{< img src="llm_observability/evaluations/language_mismatch_1.png" alt="A Language Mismatch evaluation detected by an open source model in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Input and Output | Evaluated using Open Source Model | Language Mismatch flags whether each prompt-response pair demonstrates that the LLM application answered the user's question in the same language that the user used.  |
 
@@ -255,7 +255,7 @@ This check helps understand the overall mood of the conversation, gauge user sat
 
 {{< img src="llm_observability/evaluations/sentiment_1.png" alt="A Sentiment evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Input and Output | Evaluated using LLM | Sentiment flags the emotional tone or attitude expressed in the text, categorizing it as positive, negative, or neutral.   |
 
@@ -265,7 +265,7 @@ This check evaluates whether your LLM chatbot can successfully carry out a full
 
 {{< img src="llm_observability/evaluations/goal_completeness.png" alt="A Goal Completeness evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on session | Evaluated using LLM | Goal Completeness assesses whether all user intentions within a multi-turn interaction were successfully resolved. The evaluation identifies resolved and unresolved intentions, providing a completeness score based on the ratio of unresolved to total intentions. |
 
@@ -314,7 +314,7 @@ This check evaluates whether the agent has successfully selected the appropriate
 
 {{< img src="llm_observability/evaluations/tool_selection_failure.png" alt="A tool selection failure detected by the evaluation in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on LLM spans| Evaluated using LLM | Tool Selection verifies that the tools chosen by the LLM align with the user's request and the available tools. The evaluation identifies cases where irrelevant or incorrect tool calls were made.|
 
@@ -339,9 +339,9 @@ def subtract_numbers(a: int, b: int) -> int:
     Subtracts two numbers.
     """
     return a - b
-    
 
-# List of tools available to the agent 
+
+# List of tools available to the agent
 math_tutor_agent = Agent(
     name="Math Tutor",
     handoff_description="Specialist agent for math questions",
@@ -360,21 +360,21 @@ history_tutor_agent = Agent(
 )
 
 # The triage agent decides which specialized agent to hand off the task to — another type of tool selection covered by this evaluation.
-triage_agent = Agent(  
+triage_agent = Agent(
     'openai:gpt-4o',
     model_settings=ModelSettings(temperature=0),
-    instructions='What is the sum of 1 to 10?',  
+    instructions='What is the sum of 1 to 10?',
     handoffs=[math_tutor_agent, history_tutor_agent],
 )
 {{< /code-block >}}
 
 #### Tool argument correctness
 
-This check looks at the arguments provided to a selected tool, and it evaluates whether these arguments match the expected type and make sense given the tool's context. 
+This check looks at the arguments provided to a selected tool, and it evaluates whether these arguments match the expected type and make sense given the tool's context.
 
 {{< img src="llm_observability/evaluations/tool_argument_correctness_error.png" alt="A tool argument correctness error detected by the evaluation in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on LLM spans| Evaluated using LLM | Tool Argument Correctness verifies that the arguments provided to a tool by the LLM are correct and contextually relevant. This evaluation identifies cases where the arguments provided to the tool are incorrect according to the tool schema (for example: the argument is expected to be an integer rather than a string) and are not relevant (for example: the argument is a country, but the model provides the name of a city).|
 
@@ -403,7 +403,7 @@ def subtract_numbers(a: int, b: int) -> int:
     """
     return a - b
 
-    
+
 def multiply_numbers(a: int, b: int) -> int:
     """
     Multiplies two numbers.
@@ -441,7 +441,7 @@ history_tutor_agent = Agent(
 )
 
 # Create the triage agent
-# Note: pydantic_ai handles handoffs differently - you'd typically use result_type 
+# Note: pydantic_ai handles handoffs differently - you'd typically use result_type
 # or custom logic to route between agents
 triage_agent = Agent(
     'openai:gpt-5-nano',
@@ -470,27 +470,27 @@ result = triage_agent.run_sync(
 This check evaluates each input prompt from the user and the response from the LLM application for toxic content. This check identifies and flags toxic content to ensure that interactions remain respectful and safe.
 
 {{< img src="llm_observability/evaluations/toxicity_1.png" alt="A Toxicity evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
-  
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Input and Output | Evaluated using LLM | Toxicity flags any language or behavior that is harmful, offensive, or inappropriate, including but not limited to hate speech, harassment, threats, and other forms of harmful communication. |
 
 ##### Toxicity configuration
 
 <div class="alert alert-info">Configuring toxicity evaluation categories is supported if OpenAI or Azure OpenAI is selected as your LLM provider.</div>
-You can configure toxicity evaluations to use specific categories of toxicity, listed in the following table. 
+You can configure toxicity evaluations to use specific categories of toxicity, listed in the following table.
 
-| Category | Description | 
+| Category | Description |
 |---|---|
-| Discriminatory Content | Content that discriminates against a particular group, including based on race, gender, sexual orientation, culture, etc.  | 
-| Harassment | Content that expresses, incites, or promotes negative or intrusive behavior toward an individual or group. | 
-| Hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. | 
-| Illicit | Content that asks, gives advice, or instruction on how to commit illicit acts. | 
-| Self Harm | Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders. | 
-| Sexual | Content that describes or alludes to sexual activity.  | 
-| Violence | Content that discusses death, violence, or physical injury. | 
-| Profanity | Content containing profanity. | 
-| User Dissatisfaction | Content containing criticism towards the model. *This category is only available for evaluating input toxicity.* | 
+| Discriminatory Content | Content that discriminates against a particular group, including based on race, gender, sexual orientation, culture, etc.  |
+| Harassment | Content that expresses, incites, or promotes negative or intrusive behavior toward an individual or group. |
+| Hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. |
+| Illicit | Content that asks, gives advice, or instruction on how to commit illicit acts. |
+| Self Harm | Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders. |
+| Sexual | Content that describes or alludes to sexual activity.  |
+| Violence | Content that discusses death, violence, or physical injury. |
+| Profanity | Content containing profanity. |
+| User Dissatisfaction | Content containing criticism towards the model. *This category is only available for evaluating input toxicity.* |
 
 The toxicity categories in this table are informed by: [Banko et al. (2020)][14], [Inan et al. (2023)][15], [Ghosh et al. (2024)][16], [Zheng et al. (2024)][17].
 
@@ -500,13 +500,13 @@ This check identifies attempts by unauthorized or malicious authors to manipulat
 
 {{< img src="llm_observability/evaluations/prompt_injection_1.png" alt="A Prompt Injection evaluation detected by an LLM in LLM Observability" style="width:100%;" >}}
 
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Input | Evaluated using LLM | [Prompt Injection][13] flags any unauthorized or malicious insertion of prompts or cues into the conversation by an external party or user. |
 
 ##### Prompt injection configuration
 <div class="alert alert-info">Configuring prompt injection evaluation categories is supported if OpenAI or Azure OpenAI is selected as your LLM provider.</div>
-You can configure the prompt injection evaluation to use specific categories of prompt injection, listed in the following table. 
+You can configure the prompt injection evaluation to use specific categories of prompt injection, listed in the following table.
 
 | Configuration Option | Description | Example(s) |
 |---|---|---|
@@ -520,8 +520,8 @@ You can configure the prompt injection evaluation to use specific categories of
 This check ensures that sensitive information is handled appropriately and securely, reducing the risk of data breaches or unauthorized access.
 
 {{< img src="llm_observability/evaluations/sensitive_data_scanning_1.png" alt="A Security and Safety evaluation detected by the Sensitive Data Scanner in LLM Observability" style="width:100%;" >}}
-  
-| Evaluation Stage | Evaluation Method | Evaluation Definition | 
+
+| Evaluation Stage | Evaluation Method | Evaluation Definition |
 |---|---|---|
 | Evaluated on Input and Output | Sensitive Data Scanner | Powered by the [Sensitive Data Scanner][4], LLM Observability scans, identifies, and redacts sensitive information within every LLM application's prompt-response pairs. This includes personal information, financial data, health records, or any other data that requires protection due to privacy or security concerns. |