Skip to content

Commit 3a91251

Browse files
authored
Create summary.en.md
1 parent ad80c18 commit 3a91251

File tree

1 file changed

+148
-0
lines changed
  • youtube-videos/Evaluate LLMs in Python with DeepEval

1 file changed

+148
-0
lines changed
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Evaluate LLMs in Python with DeepEval
2+
3+
* **Platform**: YouTube
4+
* **Channel/Creator**: NeuralNine
5+
* **Duration**: 00:34:08
6+
* **Release Date**: Sep 15, 2025
7+
* **Video Link**: [https://www.youtube.com/watch?v=HAoKJT3af7Y](https://www.youtube.com/watch?v=HAoKJT3af7Y)
8+
9+
> **Disclaimer**: This is a personal summary and interpretation based on a YouTube video. It is not official material and not endorsed by the original creator. All rights remain with the respective creators.
10+
11+
*This document summarizes the key takeaways from the video. I highly recommend watching the full video for visual context and coding demonstrations.*
12+
13+
## Before You Get Started
14+
- I summarize key points to help you learn and review quickly.
15+
- Simply click on `Ask AI` links to dive into any topic you want.
16+
17+
<!-- LH-BUTTONS:START -->
18+
<!-- auto-generated; do not edit -->
19+
<!-- LH-BUTTONS:END -->
20+
21+
## Introduction to DeepEval
22+
**Summary**: DeepEval is an open-source framework for evaluating large language models (LLMs) in Python, using LLMs as judges to assess outputs based on criteria like correctness or professionalism, especially useful for handling varied responses that aren't strictly identical but semantically equivalent.
23+
**Key Takeaway/Example**: For a prompt like "What is 5 / 2?", responses such as "2.5" or "The result is 2.5" can both be valid, making traditional strict metrics insufficient—DeepEval uses LLM-judged metrics to evaluate these flexibly.
24+
[Ask AI: Introduction to DeepEval](https://alisol.ir/?ai=Introduction%20to%20DeepEval|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
25+
26+
## Setting Up DeepEval
27+
**Summary**: Install DeepEval via pip or UV, set up an OpenAI API key in a .env file, and structure test files starting with "test_" to run evaluations using commands like "deepeval test run".
28+
**Key Takeaway/Example**: Use imports like `from deepeval import assert_test` and `from deepeval.metrics import GEval` to start building tests; the framework integrates with OpenAI for judgments and promotes a paid service called Confident AI, but the open-source version works standalone.
29+
[Ask AI: Setting Up DeepEval](https://alisol.ir/?ai=Setting%20Up%20DeepEval|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
30+
31+
## G-Eval Metric for Correctness
32+
**Summary**: The G-Eval metric allows custom criteria for evaluation, scoring outputs from 0 to 1 based on how well they match expectations, with a threshold like 0.5 to pass tests.
33+
**Key Takeaway/Example**: Define a metric with criteria like "Check if the actual output has the same meaning as the expected output" and use it in an LLMTestCase; for input "What is 5 / 2?", "The result is 2.5" scores 1.0 if semantic match is allowed, but 0 if exact match is required.
34+
```python
35+
from deepeval.metrics import GEval
36+
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
37+
38+
correctness_metric = GEval(
39+
name="correctness",
40+
criteria="Check if the actual output is the same as the expected output. If the meaning is the same, that's fine.",
41+
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
42+
threshold=0.5
43+
)
44+
45+
test_case = LLMTestCase(
46+
input="What is 5 / 2?",
47+
actual_output="The result is 2.5",
48+
expected_output="2.5"
49+
)
50+
51+
assert_test(test_case, [correctness_metric])
52+
```
53+
[Ask AI: G-Eval Metric for Correctness](https://alisol.ir/?ai=G-Eval%20Metric%20for%20Correctness|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
54+
55+
## Evaluating Conversations for Professionalism
56+
**Summary**: Use conversational G-Eval to assess entire dialogues for traits like professionalism and politeness, evaluating turns between user and assistant.
57+
**Key Takeaway/Example**: A rude response like "Damn, you really know nothing" fails with a low score, while polite ones like "Yes, Python is an interpreted language" pass; run evaluations in parallel for multiple conversations.
58+
```python
59+
from deepeval.metrics import ConversationalGEval
60+
from deepeval.test_case import ConversationalTestCase, Turn
61+
62+
professionalism_metric = ConversationalGEval(
63+
name="professionalism",
64+
criteria="Determine whether the assistant answered the questions of the user in a professional and polite manner."
65+
)
66+
67+
conversation = ConversationalTestCase(
68+
actual_turns=[
69+
Turn(role="user", content="Is Python an interpreted language?"),
70+
Turn(role="assistant", content="Yes, Python is an interpreted language.")
71+
]
72+
)
73+
74+
evaluate([conversation], [professionalism_metric])
75+
```
76+
[Ask AI: Evaluating Conversations for Professionalism](https://alisol.ir/?ai=Evaluating%20Conversations%20for%20Professionalism|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
77+
78+
## Answer Relevancy Metric
79+
**Summary**: This predefined metric calculates relevancy as the ratio of relevant statements to total statements in the output, useful for ensuring responses stick to the query without fluff.
80+
**Key Takeaway/Example**: For "What is the capital of France?", "Paris" scores 1.0, but adding irrelevant comments like "This is a good question. It is one of the most beautiful cities in Europe." drops it to 0.25.
81+
```python
82+
from deepeval.metrics import AnswerRelevancyMetric
83+
84+
metric = AnswerRelevancyMetric(threshold=0.5, include_reason=True)
85+
86+
test_case = LLMTestCase(
87+
input="What is the capital of France?",
88+
actual_output="This is a good question. The capital of France is Paris. It is one of the most beautiful cities in Europe."
89+
)
90+
91+
evaluate([test_case], [metric])
92+
```
93+
[Ask AI: Answer Relevancy Metric](https://alisol.ir/?ai=Answer%20Relevancy%20Metric|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
94+
95+
## Faithfulness Metric
96+
**Summary**: Measures how well the output aligns with provided retrieval context, treating the context as "truth" regardless of real-world accuracy.
97+
**Key Takeaway/Example**: If context states "The capital of France is Madrid", answering "Madrid" scores 1.0 for faithfulness, even if factually wrong; deviating to "Paris" should fail but may not always due to model biases.
98+
```python
99+
from deepeval.metrics import FaithfulnessMetric
100+
101+
metric = FaithfulnessMetric(threshold=0.5, include_reason=True)
102+
103+
test_case = LLMTestCase(
104+
input="What is the capital of France?",
105+
actual_output="Madrid",
106+
retrieval_context=["The capital of France is Madrid."]
107+
)
108+
109+
evaluate([test_case], [metric])
110+
```
111+
[Ask AI: Faithfulness Metric](https://alisol.ir/?ai=Faithfulness%20Metric|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
112+
113+
## Working with Datasets
114+
**Summary**: Create evaluation datasets from ground truth "goldens" (inputs and expected outputs), simulate LLM responses, and evaluate multiple test cases in parallel.
115+
**Key Takeaway/Example**: Use a list of goldens to generate test cases, then evaluate against metrics; for example, testing math and geography queries yields high scores if responses match semantically.
116+
```python
117+
from deepeval.data_sets import EvaluationDataset, Golden
118+
119+
goldens = [
120+
Golden(input="What is the capital of France?", expected_output="Paris"),
121+
Golden(input="What is 12 * 3?", expected_output="36")
122+
]
123+
124+
dataset = EvaluationDataset(goldens=goldens)
125+
126+
test_cases = [
127+
LLMTestCase(
128+
input=golden.input,
129+
expected_output=golden.expected_output,
130+
actual_output=simulate_llm_answer(golden.input)
131+
) for golden in dataset.goldens
132+
]
133+
134+
evaluate(test_cases, [correctness_metric])
135+
```
136+
[Ask AI: Working with Datasets](https://alisol.ir/?ai=Working%20with%20Datasets|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
137+
138+
## Practical Example: Invoice Data Extraction
139+
**Summary**: Demonstrate evaluating LLMs on real tasks like extracting data from PDFs (e.g., invoices), using custom G-Eval to score accuracy on fields like date, total, and taxes, especially when notes alter values (e.g., amounts in thousands).
140+
**Key Takeaway/Example**: Models like GPT-4o-mini often miss nuances like "all prices in thousands", leading to 60% accuracy; use PDFPlumber for text extraction and Pydantic for data modeling, then evaluate filled fields.
141+
[Ask AI: Practical Example: Invoice Data Extraction](https://alisol.ir/?ai=Practical%20Example%3A%20Invoice%20Data%20Extraction|NeuralNine|Evaluate%20LLMs%20in%20Python%20with%20DeepEval)
142+
143+
---
144+
**About the summarizer**
145+
146+
I'm *Ali Sol*, a Backend Developer. Learn more:
147+
- Website: [alisol.ir](https://alisol.ir)
148+
- LinkedIn: [linkedin.com/in/alisolphp](https://www.linkedin.com/in/alisolphp)

0 commit comments

Comments
 (0)