Skip to content

Commit e5fa2de

Browse files
Critique metrics (#70)
## What Added support for Aspect critiques ## Why Many aspects can be judged on a binary basis two ensure quality like harmlessness, correctness, etc are now possible with ragas. Users also can define their aspects for evaluation. ## How Added a simple CoT + Self-consistency step algorithm ## Testing Added `harmlessness` metrics to tests and ran some exercises to ensure quality. <img width="922" alt="Screenshot 2023-07-21 at 4 58 36 PM" src="https://github.com/explodinggradients/ragas/assets/25312635/a4f94d3e-3c66-4667-af31-54a0885e7fd5"> --------- Co-authored-by: jjmachan <[email protected]>
1 parent 07c4e7e commit e5fa2de

File tree

9 files changed

+668
-44
lines changed

9 files changed

+668
-44
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,12 +80,13 @@ results = evaluate(dataset)
8080
If you want a more in-depth explanation of core components, check out our [quick-start notebook](./docs/quickstart.ipynb)
8181
## :luggage: Metrics
8282

83-
Ragas measures your pipeline's performance against two dimensions
83+
Ragas measures your pipeline's performance against different dimensions
8484
1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims made in the answer that cannot be deduced from context is penalized.
8585
2. **Context Relevancy**: measures how relevant retrieved contexts is to the question. Ideally the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized.
8686
3. **Answer Relevancy**: measures how relevant generated answer is to the question. This do not ensure factuality of the generated answer rather penalizes the presence of redundant information in the generated answer.
87+
4. **Aspect Critiques**: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary.
8788

88-
Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.
89+
The final `ragas_score` is the harmonic mean of of individual metric scores.
8990

9091
To read more about our metrics, checkout [docs](/docs/metrics.md).
9192
## 🫂 Community

docs/metrics.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Metrics
22

3+
34
1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
45
```python
56
from ragas.metrics.factuality import Faithfulness
@@ -41,6 +42,33 @@ results = answer_relevancy.score(dataset)
4142
```
4243

4344

45+
4. `Aspect Critiques`: Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like `correctness`, `harmfulness`,etc (Check `SUPPORTED_ASPECTS` to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The `strictness` parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. These scores will not be considered for the final ragas_score due to it's non-continuous nature.
46+
- List of predefined aspects:
47+
`correctness`,`harmfulness`,`coherence`,`conciseness`,`maliciousness`
48+
49+
```python
50+
## check predefined aspects
51+
from ragas.metrics.critique import SUPPORTED_ASPECTS
52+
print(SUPPORTED_ASPECTS)
53+
54+
from ragas.metrics.critique import conciseness
55+
from ragas
56+
# Dataset({
57+
# features: ['question','answer'],
58+
# num_rows: 25
59+
# })
60+
dataset: Dataset
61+
62+
results = conciseness.score(dataset)
63+
64+
65+
## Define your critique
66+
from ragas.metrics.critique import AspectCritique
67+
mycritique = AspectCritique(name="my-critique", definition="Is the submission safe to children?", strictness=2)
68+
69+
```
70+
71+
4472
## Why is ragas better than scoring using GPT 3.5 directly.
4573
LLM like GPT 3.5 struggle when it comes to scoring generated text directly. For instance, these models would always only generate integer scores and these scores vary when invoked differently. Advanced paradigms and techniques leveraging LLMs to minimize this bias is the solution ragas presents.
4674
<h1 align="center">

0 commit comments

Comments
 (0)