Skip to content

Commit 7df8cbc

Browse files
authored
feat: open usage tracking (#52)
Adds the following env vars to debug ``` export RAGAS_DEBUG=True export __RAGAS_DEBUG_TRACKING=True export RAGAS_DO_NOT_TRACK=True ``` fixes #51
1 parent 8e29433 commit 7df8cbc

File tree

5 files changed

+129
-16
lines changed

5 files changed

+129
-16
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ clean: ## Clean all generated files
2727
run-ci: format lint type ## Running all CI checks
2828
run-benchmarks: ## Run benchmarks
2929
@echo "Running benchmarks..."
30-
@cd $(GIT_ROOT)/tests/benchmarks && python benchmark.py
30+
@cd $(GIT_ROOT)/tests/benchmarks && python benchmark_eval.py
3131
test: ## Run tests
3232
@echo "Running tests..."
3333
@pytest tests/unit

README.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
<a href="#fire-quickstart">Quickstart</a> |
3434
<a href="#luggage-metrics">Metrics</a> |
3535
<a href="#-community">Community</a> |
36+
<a href="#-open-analytics">Open Analytics</a> |
3637
<a href="#raising_hand_man-faq">FAQ</a> |
3738
<a href="https://huggingface.co/explodinggradients">Hugging Face</a>
3839
<p>
@@ -86,28 +87,29 @@ Ragas measures your pipeline's performance against two dimensions
8687
Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.
8788

8889
To read more about our metrics, checkout [docs](/docs/metrics.md).
89-
## :question: How to use Ragas to improve your pipeline?
90-
*"Measurement is the first step that leads to control and eventually to improvement" - James Harrington*
90+
## 🫂 Community
91+
If you want to get more involved with Ragas, check out our [discord server](https://discord.gg/5djav8GGNZ). It's a fun community where we geek out about LLM, Retrieval, Production issues and more.
9192

92-
Here we assume that you already have your RAG pipeline ready. When it comes to RAG pipelines, there are mainly two parts - Retriever and generator. A change in any of this should also impact your pipelines's quality.
93+
## 🔍 Open Analytics
94+
We track very basic usage metrics to guide us to figure out what our users want, what is working and what's not. As a young startup, we have to be brutally honest about this which is why we are tracking these metrics. But as an Open Startup we open-source all the data we collect. You can read more about this [here](https://github.com/explodinggradients/ragas/issues/49). If you want to take a look at exactly what we track, feel free to check the [code](./src/ragas/_analytics.py)
9395

94-
1. First, decide one parameter that you're interested in adjusting. for example the number of retrieved documents, K.
95-
2. Collect a set of sample prompts (min 20) to form your test set.
96-
3. Run your pipeline using the test set before and after the change. Each time record the prompts with context and generated output.
97-
4. Run ragas evaluation for each of them to generate evaluation scores.
98-
5. Compare the scores and you will know how much the change has affected your pipelines' performance.
96+
You can disable usage-tracking if you want by setting the `RAGAS_DO_NOT_TRACK` flag to true.
9997

100-
## 🫂 Community
101-
If you want to get more involved with Ragas, check out our [discord server](https://discord.gg/5djav8GGNZ). It's a fun community where we geek out about LLM, Retrieval, Production issues and more.
10298

10399
## :raising_hand_man: FAQ
104100
1. Why harmonic mean?
105101

106102
Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (faithfulness = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0
107103

104+
2. How to use Ragas to improve your pipeline?
108105

106+
*"Measurement is the first step that leads to control and eventually to improvement" - James Harrington*
109107

108+
Here we assume that you already have your RAG pipeline ready. When it comes to RAG pipelines, there are mainly two parts - Retriever and generator. A change in any of this should also impact your pipelines's quality.
110109

111-
112-
110+
1. First, decide one parameter that you're interested in adjusting. for example the number of retrieved documents, K.
111+
2. Collect a set of sample prompts (min 20) to form your test set.
112+
3. Run your pipeline using the test set before and after the change. Each time record the prompts with context and generated output.
113+
4. Run ragas evaluation for each of them to generate evaluation scores.
114+
5. Compare the scores and you will know how much the change has affected your pipelines' performance.
113115

src/ragas/_analytics.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
from __future__ import annotations
2+
3+
import logging
4+
import os
5+
import typing as t
6+
from dataclasses import asdict, dataclass
7+
from functools import lru_cache, wraps
8+
9+
import requests
10+
11+
from ragas.utils import get_debug_mode
12+
13+
if t.TYPE_CHECKING:
14+
P = t.ParamSpec("P")
15+
T = t.TypeVar("T")
16+
AsyncFunc = t.Callable[P, t.Coroutine[t.Any, t.Any, t.Any]]
17+
18+
logger = logging.getLogger(__name__)
19+
20+
21+
USAGE_TRACKING_URL = "https://t.explodinggradients.com"
22+
RAGAS_DO_NOT_TRACK = "RAGAS_DO_NOT_TRACK"
23+
RAGAS_DEBUG_TRACKING = "__RAGAS_DEBUG_TRACKING"
24+
USAGE_REQUESTS_TIMEOUT_SEC = 1
25+
26+
27+
@lru_cache(maxsize=1)
28+
def do_not_track() -> bool: # pragma: no cover
29+
# Returns True if and only if the environment variable is defined and has value True
30+
# The function is cached for better performance.
31+
return os.environ.get(RAGAS_DO_NOT_TRACK, str(False)).lower() == "true"
32+
33+
34+
@lru_cache(maxsize=1)
35+
def _usage_event_debugging() -> bool:
36+
# For BentoML developers only - debug and print event payload if turned on
37+
return os.environ.get(RAGAS_DEBUG_TRACKING, str(False)).lower() == "true"
38+
39+
40+
def silent(func: t.Callable[P, T]) -> t.Callable[P, T]: # pragma: no cover
41+
# Silent errors when tracking
42+
@wraps(func)
43+
def wrapper(*args: P.args, **kwargs: P.kwargs) -> t.Any:
44+
try:
45+
return func(*args, **kwargs)
46+
except Exception as err: # pylint: disable=broad-except
47+
if _usage_event_debugging():
48+
if get_debug_mode():
49+
logger.error(
50+
"Tracking Error: %s", err, stack_info=True, stacklevel=3
51+
)
52+
else:
53+
logger.info("Tracking Error: %s", err)
54+
else:
55+
logger.debug("Tracking Error: %s", err)
56+
57+
return wrapper
58+
59+
60+
@dataclass
61+
class BaseEvent:
62+
event_type: str
63+
64+
65+
@dataclass
66+
class EvaluationEvent(BaseEvent):
67+
metrics: list[str]
68+
evaluation_mode: str
69+
num_rows: int
70+
71+
72+
@silent
73+
def track(event_properties: BaseEvent):
74+
if do_not_track():
75+
return
76+
77+
payload = asdict(event_properties)
78+
79+
if _usage_event_debugging():
80+
# For internal debugging purpose
81+
logger.info("Tracking Payload: %s", payload)
82+
return
83+
84+
requests.post(USAGE_TRACKING_URL, json=payload, timeout=USAGE_REQUESTS_TIMEOUT_SEC)

src/ragas/evaluation.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
import numpy as np
77
from datasets import Dataset, concatenate_datasets
88

9+
from ragas._analytics import EvaluationEvent, track
910
from ragas.metrics.base import Metric
1011

1112
EvaluationMode = Enum("EvaluationMode", "generative retrieval grounded")
@@ -17,7 +18,7 @@ def get_evaluation_mode(ds: Dataset):
1718
1819
possible evaluation types
1920
1. (q,a,c)
20-
2. (q)
21+
2. (q,a)
2122
3. (q,c)
2223
4. (g,a)
2324
"""
@@ -87,6 +88,17 @@ def evaluate(
8788
for metric in metrics:
8889
scores.append(metric.score(dataset).select_columns(metric.name))
8990

91+
# log the evaluation event
92+
metrics_names = [m.name for m in metrics]
93+
track(
94+
EvaluationEvent(
95+
event_type="evaluation",
96+
metrics=metrics_names,
97+
evaluation_mode="",
98+
num_rows=dataset.shape[0],
99+
)
100+
)
101+
90102
return Result(scores=concatenate_datasets(scores, axis=1), dataset=dataset)
91103

92104

@@ -117,7 +129,9 @@ def to_pandas(self, batch_size: int | None = None, batched: bool = False):
117129

118130
def __repr__(self) -> str:
119131
scores = self.copy()
120-
ragas_score = scores.pop("ragas_score")
121-
score_strs = [f"'ragas_score': {ragas_score:0.4f}"]
132+
score_strs = []
133+
if "ragas_score" in scores:
134+
ragas_score = scores.pop("ragas_score")
135+
score_strs += f"'ragas_score': {ragas_score:0.4f}"
122136
score_strs.extend([f"'{k}': {v:0.4f}" for k, v in scores.items()])
123137
return "{" + ", ".join(score_strs) + "}"

src/ragas/utils.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
11
from __future__ import annotations
22

3+
import logging
4+
import os
35
import typing as t
6+
from functools import lru_cache
47
from warnings import warn
58

69
import torch
710
from torch import device as Device
811

912
DEVICES = ["cpu", "cuda"]
13+
DEBUG_ENV_VAR = "RAGAS_DEBUG"
1014

1115

1216
def device_check(device: t.Literal["cpu", "cuda"] | Device) -> torch.device:
@@ -19,3 +23,12 @@ def device_check(device: t.Literal["cpu", "cuda"] | Device) -> torch.device:
1923
device = "cpu"
2024

2125
return torch.device(device)
26+
27+
28+
@lru_cache(maxsize=1)
29+
def get_debug_mode() -> bool:
30+
if os.environ.get(DEBUG_ENV_VAR, str(False)).lower() == "true":
31+
logging.basicConfig(level=logging.DEBUG)
32+
return True
33+
else:
34+
return False

0 commit comments

Comments
 (0)