You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can't improve what you don't measure. Metrics are the feedback loop that makes iteration possible.
6
+
7
+
In AI systems, progress depends on running many experimentsโeach a hypothesis about how to improve performance. But without a clear, reliable metric, you can't tell the difference between a successful experiment (a positive delta between the new score and the old one) and a failed one.
8
+
9
+
Metrics give you a compass. They let you quantify improvement, detect regressions, and align optimization efforts with user impact and business value.
10
+
11
+
## Types of Metrics in AI Applications
12
+
13
+
### 1. End-to-End Metrics
14
+
15
+
End-to-end metrics evaluate the overall system performance from the user's perspective, treating the AI application as a black box. These metrics quantify key outcomes users care deeply about, based solely on the system's final outputs.
16
+
17
+
Examples:
18
+
19
+
- Answer correctness: Measures if the provided answers from a Retrieval-Augmented Generation (RAG) system are accurate.
20
+
- Citation accuracy: Evaluates whether the references cited by the RAG system are correctly identified and relevant.
21
+
22
+
Optimizing end-to-end metrics ensures tangible improvements aligned directly with user expectations.
23
+
24
+
### 2. Component-Level Metrics
25
+
26
+
Component-level metrics assess the individual parts of an AI system independently. These metrics are immediately actionable and facilitate targeted improvements but do not necessarily correlate directly with end-user satisfaction.
27
+
28
+
Example:
29
+
30
+
- Retrieval accuracy: Measures how effectively a RAG system retrieves relevant information. A low retrieval accuracy (e.g., 50%) signals that improving this component can enhance overall system performance. However, improving a component alone doesn't guarantee better end-to-end outcomes.
31
+
32
+
### 3. Business Metrics
33
+
34
+
Business metrics align AI system performance with organizational objectives and quantify tangible business outcomes. These metrics are typically lagging indicators, calculated after a deployment period (days/weeks/months).
35
+
36
+
Example:
37
+
38
+
- Ticket deflection rate: Measures the percentage reduction of support tickets due to the deployment of an AI assistant.
39
+
40
+
## Types of Metrics in Ragas
41
+
42
+
In Ragas, we categorize metrics based on the type of output they produce. This classification helps clarify how each metric behaves and how its results can be interpreted or aggregated. The three types are:
43
+
44
+
### 1. Discrete Metrics
45
+
46
+
These return a single value from a predefined list of categorical classes. There is no implicit ordering among the classes. Common use cases include classifying outputs into categories such as pass/fail or good/okay/bad.
47
+
48
+
Example:
49
+
```python
50
+
from ragas_experimental.metrics import discrete_metric
These return an integer or float value within a specified range. Numeric metrics support aggregation functions such as mean, sum, or mode, making them useful for statistical analysis.
61
+
62
+
```python
63
+
from ragas_experimental.metrics import numeric_metric
These evaluate multiple outputs at once and return a ranked list based on a defined criterion. They are useful when the goal is to compare outputs relative to one another.
73
+
74
+
```python
75
+
from ragas_experimental.metrics import ranked_metric
response = llm.generate(f"Evaluate semantic similarity between '{predicted}' and '{expected}'")
108
+
return"pass"if response >5else"fail"
109
+
```
110
+
111
+
When to use:
112
+
113
+
- Tasks with numerous valid outcomes (e.g., paraphrased correct answers).
114
+
- Complex evaluation criteria aligned with human or expert preferences (e.g., distinguishing "deep" vs. "shallow" insights in research reports). Although simpler metrics (length or keyword count) are possible, LLM-based metrics capture nuanced human judgment more effectively.
115
+
116
+
## Choosing the Right Metrics for Your Application
117
+
118
+
### 1. Prioritize End-to-End Metrics
119
+
120
+
Focus first on metrics reflecting overall user satisfaction. While many aspects influence user satisfactionโsuch as factual correctness, response tone, and explanation depthโconcentrate initially on the few dimensions delivering maximum user value (e.g., answer and citation accuracy in a RAG-based assistant).
121
+
122
+
### 2. Ensure Interpretability
123
+
124
+
Design metrics clear enough for the entire team to interpret and reason about. For example:
125
+
126
+
- Execution accuracy in a text-to-SQL system: Does the SQL query generated return precisely the same dataset as the ground truth query crafted by domain experts?
127
+
128
+
### 3. Emphasize Objective Over Subjective Metrics
129
+
130
+
Prioritize metrics with objective criteria, minimizing subjective judgment. Assess objectivity by independently labeling samples across team members and measuring agreement levels. A high inter-rater agreement (โฅ80%) indicates greater objectivity.
131
+
132
+
### 4. Few Strong Signals over Many Weak Signals
133
+
134
+
Avoid a proliferation of metrics that provide weak signals and impede clear decision-making. Instead, select fewer metrics offering strong, reliable signals. For instance:
135
+
136
+
- In a conversational AI, using a single metric such as goal accuracy (whether the user's objective for interacting with the AI was met) provides strong proxy for the performance of the system than multiple weak proxies like coherence or helpfulness.
0 commit comments