Skip to content

Commit 9ca3074

Browse files
authored
docs: hello world & explanation (#2114)
1 parent c7cfb4a commit 9ca3074

File tree

5 files changed

+187
-68
lines changed

5 files changed

+187
-68
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Dataset preparation for Evaluating AI Systems
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Experimentation for Improving AI Systems
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# ๐Ÿ“š Explanation
2+
3+
1. [Metrics for Evaluating AI systems](metrics.md)
4+
2. [Experimentation for improving AI systems](experimentation.md)
5+
3. [Datasets preparation for evaluating AI systems](datasets.md)
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Metrics for evaluating AI Applications
2+
3+
## Why Metrics Matter
4+
5+
You can't improve what you don't measure. Metrics are the feedback loop that makes iteration possible.
6+
7+
In AI systems, progress depends on running many experimentsโ€”each a hypothesis about how to improve performance. But without a clear, reliable metric, you can't tell the difference between a successful experiment (a positive delta between the new score and the old one) and a failed one.
8+
9+
Metrics give you a compass. They let you quantify improvement, detect regressions, and align optimization efforts with user impact and business value.
10+
11+
## Types of Metrics in AI Applications
12+
13+
### 1. End-to-End Metrics
14+
15+
End-to-end metrics evaluate the overall system performance from the user's perspective, treating the AI application as a black box. These metrics quantify key outcomes users care deeply about, based solely on the system's final outputs.
16+
17+
Examples:
18+
19+
- Answer correctness: Measures if the provided answers from a Retrieval-Augmented Generation (RAG) system are accurate.
20+
- Citation accuracy: Evaluates whether the references cited by the RAG system are correctly identified and relevant.
21+
22+
Optimizing end-to-end metrics ensures tangible improvements aligned directly with user expectations.
23+
24+
### 2. Component-Level Metrics
25+
26+
Component-level metrics assess the individual parts of an AI system independently. These metrics are immediately actionable and facilitate targeted improvements but do not necessarily correlate directly with end-user satisfaction.
27+
28+
Example:
29+
30+
- Retrieval accuracy: Measures how effectively a RAG system retrieves relevant information. A low retrieval accuracy (e.g., 50%) signals that improving this component can enhance overall system performance. However, improving a component alone doesn't guarantee better end-to-end outcomes.
31+
32+
### 3. Business Metrics
33+
34+
Business metrics align AI system performance with organizational objectives and quantify tangible business outcomes. These metrics are typically lagging indicators, calculated after a deployment period (days/weeks/months).
35+
36+
Example:
37+
38+
- Ticket deflection rate: Measures the percentage reduction of support tickets due to the deployment of an AI assistant.
39+
40+
## Types of Metrics in Ragas
41+
42+
In Ragas, we categorize metrics based on the type of output they produce. This classification helps clarify how each metric behaves and how its results can be interpreted or aggregated. The three types are:
43+
44+
### 1. Discrete Metrics
45+
46+
These return a single value from a predefined list of categorical classes. There is no implicit ordering among the classes. Common use cases include classifying outputs into categories such as pass/fail or good/okay/bad.
47+
48+
Example:
49+
```python
50+
from ragas_experimental.metrics import discrete_metric
51+
52+
@discrete_metric(name="response_quality", allowed_values=["pass", "fail"])
53+
def my_metric(predicted: str, expected: str) -> str:
54+
return "pass" if predicted.lower() == expected.lower() else "fail"
55+
56+
```
57+
58+
### 2. Numeric Metrics
59+
60+
These return an integer or float value within a specified range. Numeric metrics support aggregation functions such as mean, sum, or mode, making them useful for statistical analysis.
61+
62+
```python
63+
from ragas_experimental.metrics import numeric_metric
64+
65+
@numeric_metric(name="response_accuracy", allowed_values=(0, 1))
66+
def my_metric(predicted: float, expected: float) -> float:
67+
return abs(predicted - expected) / max(expected, 1e-5)
68+
```
69+
70+
### 3. Ranked Metrics
71+
72+
These evaluate multiple outputs at once and return a ranked list based on a defined criterion. They are useful when the goal is to compare outputs relative to one another.
73+
74+
```python
75+
from ragas_experimental.metrics import ranked_metric
76+
@ranked_metric(name="response_ranking", allowed_values=[0,1])
77+
def my_metric(responses: list) -> list:
78+
response_lengths = [len(response) for response in responses]
79+
sorted_indices = sorted(range(len(response_lengths)), key=lambda i: response_lengths[i])
80+
return sorted_indices
81+
```
82+
83+
## LLM-based vs. Non-LLM-based Metrics
84+
85+
### Non-LLM-based Metrics
86+
87+
These metrics are deterministic functions evaluating predefined inputs against clear, finite criteria.
88+
89+
Example:
90+
91+
```python
92+
def my_metric(predicted: str, expected: str) -> str:
93+
return "pass" if predicted.lower() == expected.lower() else "fail"
94+
```
95+
96+
When to use:
97+
98+
- Tasks with strictly defined correct outcomes (e.g., mathematical solutions, deterministic tasks like booking agents updating databases).
99+
100+
### LLM-based Metrics
101+
102+
These leverage LLMs (Large Language Models) to evaluate outcomes, typically useful where correctness is nuanced or highly variable.
103+
104+
Example:
105+
```python
106+
def my_metric(predicted: str, expected: str) -> str:
107+
response = llm.generate(f"Evaluate semantic similarity between '{predicted}' and '{expected}'")
108+
return "pass" if response > 5 else "fail"
109+
```
110+
111+
When to use:
112+
113+
- Tasks with numerous valid outcomes (e.g., paraphrased correct answers).
114+
- Complex evaluation criteria aligned with human or expert preferences (e.g., distinguishing "deep" vs. "shallow" insights in research reports). Although simpler metrics (length or keyword count) are possible, LLM-based metrics capture nuanced human judgment more effectively.
115+
116+
## Choosing the Right Metrics for Your Application
117+
118+
### 1. Prioritize End-to-End Metrics
119+
120+
Focus first on metrics reflecting overall user satisfaction. While many aspects influence user satisfactionโ€”such as factual correctness, response tone, and explanation depthโ€”concentrate initially on the few dimensions delivering maximum user value (e.g., answer and citation accuracy in a RAG-based assistant).
121+
122+
### 2. Ensure Interpretability
123+
124+
Design metrics clear enough for the entire team to interpret and reason about. For example:
125+
126+
- Execution accuracy in a text-to-SQL system: Does the SQL query generated return precisely the same dataset as the ground truth query crafted by domain experts?
127+
128+
### 3. Emphasize Objective Over Subjective Metrics
129+
130+
Prioritize metrics with objective criteria, minimizing subjective judgment. Assess objectivity by independently labeling samples across team members and measuring agreement levels. A high inter-rater agreement (โ‰ฅ80%) indicates greater objectivity.
131+
132+
### 4. Few Strong Signals over Many Weak Signals
133+
134+
Avoid a proliferation of metrics that provide weak signals and impede clear decision-making. Instead, select fewer metrics offering strong, reliable signals. For instance:
135+
136+
- In a conversational AI, using a single metric such as goal accuracy (whether the user's objective for interacting with the AI was met) provides strong proxy for the performance of the system than multiple weak proxies like coherence or helpfulness.

โ€Ždocs/experimental/index.mdโ€Ž

Lines changed: 44 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -40,84 +40,60 @@ cd ragas/experimental && pip install -e .
4040

4141
## Hello World ๐Ÿ‘‹
4242

43-
1. Setup a sample experiment.
43+
Copy this snippet to a file named `hello_world.py` and run `python hello_world.py`
4444

45-
```
46-
ragas hello-world
47-
```
45+
```python
46+
import numpy as np
47+
from ragas_experimental import experiment, Dataset
48+
from ragas_experimental.metrics import MetricResult, numeric_metric
4849

49-
2. Run your first experiment with Ragas CLI.
5050

51-
```
52-
ragas evals hello_world/evals.py --dataset test_data --metrics accuracy --name first_experiment
53-
```
51+
@numeric_metric(name="accuracy_score", allowed_values=(0, 1))
52+
def accuracy_score(response: str, expected: str):
53+
result = 1 if expected.lower().strip() == response.lower().strip() else 0
54+
return MetricResult(result=result, reason=f"Match: {result == 1}")
5455

55-
```
56-
Running evaluation: hello_world/evals.py
57-
Dataset: test_data
58-
Getting dataset: test_data
59-
โœ“ Loaded dataset with 10 rows
60-
Running experiment: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 20/20 [00:00<00:00, 4872.00it/s]
61-
โœ“ Completed experiments successfully
62-
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Ragas Evaluation Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
63-
โ”‚ Experiment: lucid_codd โ”‚
64-
โ”‚ Dataset: test_data (10 rows) โ”‚
65-
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
66-
Numerical Metrics
67-
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
68-
โ”ƒ Metric โ”ƒ Current โ”ƒ
69-
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
70-
โ”‚ accuracy โ”‚ 0.100 โ”‚
71-
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
72-
โœ“ Experiment results displayed
73-
โœ“ Evaluation completed successfully
74-
```
56+
def mock_app_endpoint(**kwargs) -> str:
57+
return np.random.choice(["Paris", "4", "Blue Whale", "Einstein", "Python"])
7558

76-
3. Inspect the results
59+
@experiment()
60+
async def run_experiment(row):
61+
response = mock_app_endpoint(query=row.get("query"))
62+
accuracy = accuracy_score.score(response=response, expected=row.get("expected_output"))
63+
return {**row, "response": response, "accuracy": accuracy.value}
7764

78-
```
79-
tree hello_world/experiments
65+
if __name__ == "__main__":
66+
import asyncio
67+
68+
# Create dataset inline
69+
dataset = Dataset(name="test_dataset", backend="local/csv", root_dir=".")
70+
test_data = [
71+
{"query": "What is the capital of France?", "expected_output": "Paris"},
72+
{"query": "What is 2 + 2?", "expected_output": "4"},
73+
{"query": "What is the largest animal?", "expected_output": "Blue Whale"},
74+
{"query": "Who developed the theory of relativity?", "expected_output": "Einstein"},
75+
{"query": "What programming language is named after a snake?", "expected_output": "Python"},
76+
]
77+
78+
for sample in test_data:
79+
dataset.append(sample)
80+
dataset.save()
81+
82+
# Run experiment
83+
results = asyncio.run(run_experiment.arun(dataset, name="first_experiment"))
8084
```
8185

82-
```
83-
hello_world/experiments
84-
โ””โ”€โ”€ first_experiment.csv
86+
View Results
8587

86-
0 directories, 1 files
8788
```
88-
89-
4. View the results in a spreadsheet application.
90-
91-
```
92-
open hello_world/experiments/first_experiment.csv
89+
โ”œโ”€โ”€ datasets
90+
โ”‚ โ””โ”€โ”€ test_dataset.csv
91+
โ””โ”€โ”€ experiments
92+
โ””โ”€โ”€ first_experiment.csv
9393
```
9494

95-
5. Run your second experiment and compare with the first one.
96-
97-
```
98-
ragas evals hello_world/evals.py --dataset test_data --metrics accuracy --baseline first_experiment
99-
```
95+
Open the results in a CSV file
10096

101-
```
102-
Running evaluation: hello_world/evals.py
103-
Dataset: test_data
104-
Baseline: first_experiment
105-
Getting dataset: test_data
106-
โœ“ Loaded dataset with 10 rows
107-
Running experiment: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 20/20 [00:00<00:00, 4900.46it/s]
108-
โœ“ Completed experiments successfully
109-
Comparing against baseline: first_experiment
110-
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Ragas Evaluation Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
111-
โ”‚ Experiment: vigilant_brin โ”‚
112-
โ”‚ Dataset: test_data (10 rows) โ”‚
113-
โ”‚ Baseline: first_experiment โ”‚
114-
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
115-
Numerical Metrics
116-
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”“
117-
โ”ƒ Metric โ”ƒ Current โ”ƒ Baseline โ”ƒ Delta โ”ƒ Gate โ”ƒ
118-
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”ฉ
119-
โ”‚ accuracy โ”‚ 0.000 โ”‚ 0.000 โ”‚ โ–ผ0.000 โ”‚ pass โ”‚
120-
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
121-
โœ“ Comparison completed
122-
โœ“ Evaluation completed successfully
123-
```
97+
```bash
98+
open experiments/first_experiment.csv
99+
```

0 commit comments

Comments
ย (0)