Skip to content

Commit 8d658a1

Browse files
authored
feat: Ragas CI/CD (#976)
new feature based on Ragas reproducability docs: https://ragas--976.org.readthedocs.build/en/976/howtos/applications/add_to_ci.html#
1 parent a13ab02 commit 8d658a1

File tree

10 files changed

+204
-13
lines changed

10 files changed

+204
-13
lines changed

docs/getstarted/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,4 +41,4 @@ Find out how to evaluate your RAG pipeline using your test set (your own dataset
4141
:link-type: ref
4242

4343
Discover how to monitor the performance and quality of your RAG application in production.
44-
:::
44+
:::
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Adding to your CI pipeline with Pytest
2+
3+
You can add Ragas evaluations as part of your Continious Integration pipeline
4+
to keep track of the qualitative performance of your RAG pipeline. Consider these as
5+
part of your end-to-end test suite which you run before major changes and releases.
6+
7+
The usage is straight forward but the main things is to set the `in_ci` argument for the
8+
`evaluate()` function to `True`. This runs Ragas metrics in a special mode that ensures
9+
it produces more reproducable metrics but will be more costlier.
10+
11+
You can easily write a pytest test as follows
12+
13+
:::{note}
14+
This dataset that is already populated with outputs from a reference RAG
15+
When testing your own system make sure you use outputs from RAG pipeline
16+
you want to test. For more information on how to build your datasets check
17+
[Building HF `Dataset` with your own Data](./data_preparation.md) docs.
18+
:::
19+
20+
```{code-block} python
21+
:caption: tests/e2e/test_amnesty_e2e.py
22+
:linenos:
23+
import pytest
24+
from datasets import load_dataset
25+
26+
from ragas import evaluate
27+
from ragas.metrics import (
28+
answer_relevancy,
29+
faithfulness,
30+
context_recall,
31+
context_precision,
32+
)
33+
34+
def assert_in_range(score: float, value: float, plus_or_minus: float):
35+
"""
36+
Check if computed score is within the range of value +/- max_range
37+
"""
38+
assert value - plus_or_minus <= score <= value + plus_or_minus
39+
40+
41+
def test_amnesty_e2e():
42+
# loading the V2 dataset
43+
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")["eval"]
44+
45+
46+
result = evaluate(
47+
amnesty_qa,
48+
metrics=[answer_relevancy, faithfulness, context_recall, context_precision],
49+
in_ci=True,
50+
)
51+
assert result["answer_relevancy"] >= 0.9
52+
assert result["context_recall"] >= 0.95
53+
assert result["context_precision"] >= 0.95
54+
assert_in_range(result["faithfulness"], value=0.4, plus_or_minus=0.1)
55+
```
56+
57+
## Using Pytest Markers for Ragas E2E tests
58+
59+
Because these are long end-to-end test one thing that you can leverage is [Pytest Markers](https://docs.pytest.org/en/latest/example/markers.html) which help you mark your tests with special tags. It is recommended to mark Ragas tests with special tags so you can run them only when needed.
60+
61+
To add a new `ragas_ci` tag to pytest add the following to your `conftest.py`
62+
```{code-block} python
63+
:caption: conftest.py
64+
def pytest_configure(config):
65+
"""
66+
configure pytest
67+
"""
68+
# add `ragas_ci`
69+
config.addinivalue_line(
70+
"markers", "ragas_ci: Set of tests that will be run as part of Ragas CI"
71+
)
72+
```
73+
74+
now you can use `ragas_ci` to mark all the tests that are part of Ragas CI.
75+
76+
```{code-block} python
77+
:caption: tests/e2e/test_amnesty_e2e.py
78+
:linenos:
79+
:emphasize-added: 19
80+
import pytest
81+
from datasets import load_dataset
82+
83+
from ragas import evaluate
84+
from ragas.metrics import (
85+
answer_relevancy,
86+
faithfulness,
87+
context_recall,
88+
context_precision,
89+
)
90+
91+
def assert_in_range(score: float, value: float, plus_or_minus: float):
92+
"""
93+
Check if computed score is within the range of value +/- max_range
94+
"""
95+
assert value - plus_or_minus <= score <= value + plus_or_minus
96+
97+
98+
@pytest.mark.ragas_ci
99+
def test_amnesty_e2e():
100+
# loading the V2 dataset
101+
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")["eval"]
102+
103+
104+
result = evaluate(
105+
amnesty_qa,
106+
metrics=[answer_relevancy, faithfulness, context_recall, context_precision],
107+
in_ci=True,
108+
)
109+
assert result["answer_relevancy"] >= 0.9
110+
assert result["context_recall"] >= 0.95
111+
assert result["context_precision"] >= 0.95
112+
assert_in_range(result["faithfulness"], value=0.4, plus_or_minus=0.1)
113+
```

docs/howtos/applications/data_preparation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Prepare data for evaluation
1+
# Building HF Dataset with your own Data
22

33
This tutorial notebook provides a step-by-step guide on how to prepare data for experimenting and evaluating using ragas.
44

@@ -27,4 +27,4 @@ data_samples = {
2727
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
2828
}
2929
dataset = Dataset.from_dict(data_samples)
30-
```
30+
```

docs/howtos/applications/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@ compare_llms
1212
custom_prompts
1313
use_prompt_adaptation
1414
tracing
15+
add_to_ci
1516
```

src/ragas/_analytics.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ class EvaluationEvent(BaseEvent):
8888
evaluation_mode: str
8989
num_rows: int
9090
language: str
91+
in_ci: bool
9192

9293

9394
class TestsetGenerationEvent(BaseEvent):

src/ragas/evaluation.py

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,12 @@
2020
from ragas.llms import llm_factory
2121
from ragas.llms.base import BaseRagasLLM, LangchainLLMWrapper
2222
from ragas.metrics._answer_correctness import AnswerCorrectness
23-
from ragas.metrics.base import Metric, MetricWithEmbeddings, MetricWithLLM
23+
from ragas.metrics.base import (
24+
Metric,
25+
MetricWithEmbeddings,
26+
MetricWithLLM,
27+
is_reproducable,
28+
)
2429
from ragas.metrics.critique import AspectCritique
2530
from ragas.run_config import RunConfig
2631
from ragas.utils import get_feature_language
@@ -43,6 +48,7 @@ def evaluate(
4348
llm: t.Optional[BaseRagasLLM | LangchainLLM] = None,
4449
embeddings: t.Optional[BaseRagasEmbeddings | LangchainEmbeddings] = None,
4550
callbacks: Callbacks = None,
51+
in_ci: bool = False,
4652
is_async: bool = True,
4753
run_config: t.Optional[RunConfig] = None,
4854
raise_exceptions: bool = True,
@@ -71,7 +77,11 @@ def evaluate(
7177
Lifecycle Langchain Callbacks to run during evaluation. Check the
7278
[langchain documentation](https://python.langchain.com/docs/modules/callbacks/)
7379
for more information.
74-
is_async: bool, optional
80+
in_ci: bool
81+
Whether the evaluation is running in CI or not. If set to True then some
82+
metrics will be run to increase the reproducability of the evaluations. This
83+
will increase the runtime and cost of evaluations. Default is False.
84+
is_async: bool
7585
Whether to run the evaluation in async mode or not. If set to True then the
7686
evaluation is run by calling the `metric.ascore` method. In case the llm or
7787
embeddings does not support async then the evaluation can be run in sync mode
@@ -156,9 +166,12 @@ def evaluate(
156166
binary_metrics = []
157167
llm_changed: t.List[int] = []
158168
embeddings_changed: t.List[int] = []
169+
reproducable_metrics: t.List[int] = []
159170
answer_correctness_is_set = -1
160171

172+
# loop through the metrics and perform initializations
161173
for i, metric in enumerate(metrics):
174+
# set llm and embeddings if not set
162175
if isinstance(metric, AspectCritique):
163176
binary_metrics.append(metric.name)
164177
if isinstance(metric, MetricWithLLM) and metric.llm is None:
@@ -174,9 +187,15 @@ def evaluate(
174187
if isinstance(metric, AnswerCorrectness):
175188
if metric.answer_similarity is None:
176189
answer_correctness_is_set = i
190+
# set reproducibility for metrics if in CI
191+
if in_ci and is_reproducable(metric):
192+
if metric.reproducibility == 1: # type: ignore
193+
# only set a value if not already set
194+
metric.reproducibility = 3 # type: ignore
195+
reproducable_metrics.append(i)
177196

178-
# initialize all the models in the metrics
179-
[m.init(run_config) for m in metrics]
197+
# init all the models
198+
metric.init(run_config)
180199

181200
executor = Executor(
182201
desc="Evaluating",
@@ -248,6 +267,9 @@ def evaluate(
248267
AnswerCorrectness, metrics[answer_correctness_is_set]
249268
).answer_similarity = None
250269

270+
for i in reproducable_metrics:
271+
metrics[i].reproducibility = 1 # type: ignore
272+
251273
# log the evaluation event
252274
metrics_names = [m.name for m in metrics]
253275
metric_lang = [get_feature_language(m) for m in metrics]
@@ -259,6 +281,7 @@ def evaluate(
259281
evaluation_mode="",
260282
num_rows=dataset.shape[0],
261283
language=metric_lang[0] if len(metric_lang) > 0 else "",
284+
in_ci=in_ci,
262285
)
263286
)
264287
return result

src/ragas/metrics/base.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,11 @@ def get_required_columns(
6161
class Metric(ABC):
6262
@property
6363
@abstractmethod
64-
def name(self) -> str:
65-
...
64+
def name(self) -> str: ...
6665

6766
@property
6867
@abstractmethod
69-
def evaluation_mode(self) -> EvaluationMode:
70-
...
68+
def evaluation_mode(self) -> EvaluationMode: ...
7169

7270
@abstractmethod
7371
def init(self, run_config: RunConfig):
@@ -129,8 +127,9 @@ async def ascore(
129127
return score
130128

131129
@abstractmethod
132-
async def _ascore(self, row: t.Dict, callbacks: Callbacks, is_async: bool) -> float:
133-
...
130+
async def _ascore(
131+
self, row: t.Dict, callbacks: Callbacks, is_async: bool
132+
) -> float: ...
134133

135134

136135
@dataclass
@@ -219,4 +218,8 @@ def get_segmenter(
219218
)
220219

221220

221+
def is_reproducable(metric: Metric) -> bool:
222+
return hasattr(metric, "_reproducibility")
223+
224+
222225
ensembler = Ensember()

tests/conftest.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,16 @@
1111
from ragas.llms.prompt import PromptValue
1212

1313

14+
def pytest_configure(config):
15+
"""
16+
configure pytest
17+
"""
18+
# adda
19+
config.addinivalue_line(
20+
"markers", "ragas_ci: Set of tests that will be run as part of Ragas CI"
21+
)
22+
23+
1424
class FakeTestLLM(BaseRagasLLM):
1525
def llm(self):
1626
return self

tests/e2e/test_amnesty_in_ci.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import pytest
2+
from datasets import load_dataset
3+
4+
from ragas import evaluate
5+
from ragas.metrics import (
6+
answer_relevancy,
7+
faithfulness,
8+
context_recall,
9+
context_precision,
10+
)
11+
12+
# loading the V2 dataset
13+
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")["eval"]
14+
15+
16+
def assert_in_range(score: float, value: float, plus_or_minus: float):
17+
"""
18+
Check if computed score is within the range of value +/- max_range
19+
"""
20+
assert value - plus_or_minus <= score <= value + plus_or_minus
21+
22+
23+
@pytest.mark.ragas_ci
24+
def test_amnesty_e2e():
25+
result = evaluate(
26+
amnesty_qa,
27+
metrics=[answer_relevancy, faithfulness, context_recall, context_precision],
28+
in_ci=True,
29+
)
30+
assert result["answer_relevancy"] >= 0.9
31+
assert result["context_recall"] >= 0.95
32+
assert result["context_precision"] >= 0.95
33+
assert_in_range(result["faithfulness"], value=0.4, plus_or_minus=0.1)
34+
35+
36+
@pytest.mark.ragas_ci
37+
def test_assert_in_range():
38+
assert_in_range(0.5, value=0.1, plus_or_minus=0.1)

tests/unit/test_analytics.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,15 @@ def test_evaluation_event():
2626
num_rows=1,
2727
evaluation_mode="",
2828
language="english",
29+
in_ci=True,
2930
)
3031

3132
payload = dict(evaluation_event)
3233
assert isinstance(payload.get("user_id"), str)
3334
assert isinstance(payload.get("evaluation_mode"), str)
3435
assert isinstance(payload.get("metrics"), list)
3536
assert isinstance(payload.get("language"), str)
37+
assert isinstance(payload.get("in_ci"), bool)
3638

3739

3840
def setup_user_id_filepath(tmp_path, monkeypatch):

0 commit comments

Comments
 (0)