Skip to content

Commit 40ad4f7

Browse files
authored
feat(context managers): added Context Managers to help with tracing (#83)
todo - [x] greenlight this with shahul about the UI Added a context manager to group together runs in order to make it easier to visualise what is happening inside ragas. future steps - connect the output values along with the context too
1 parent 186ccfa commit 40ad4f7

File tree

12 files changed

+504
-223
lines changed

12 files changed

+504
-223
lines changed
240 KB
Loading
117 KB
Loading

docs/integrations/langsmith.ipynb

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "98727749",
6+
"metadata": {},
7+
"source": [
8+
"# Langsmith Integrations\n",
9+
"\n",
10+
"[Langsmith](https://docs.smith.langchain.com/) in a platform for building production-grade LLM applications from the langchain team. It helps you with tracing, debugging and evaluting LLM applications.\n",
11+
"\n",
12+
"The langsmith + ragas integrations offer 2 features\n",
13+
"1. View the traces of ragas `evaluator` \n",
14+
"2. Use ragas metrics in langchain evaluation - (soon)\n",
15+
"\n",
16+
"\n",
17+
"### Tracing ragas metrics\n",
18+
"\n",
19+
"since ragas uses langchain under the hood all you have to do is setup langsmith and your traces will be logged.\n",
20+
"\n",
21+
"to setup langsmith make sure the following env-vars are set (you can read more in the [langsmith docs](https://docs.smith.langchain.com/#quick-start)\n",
22+
"\n",
23+
"```bash\n",
24+
"export LANGCHAIN_TRACING_V2=true\n",
25+
"export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com\n",
26+
"export LANGCHAIN_API_KEY=<your-api-key>\n",
27+
"export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to \"default\"\n",
28+
"```\n",
29+
"\n",
30+
"Once langsmith is setup, just run the evaluations as your normally would"
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": 1,
36+
"id": "27947474",
37+
"metadata": {},
38+
"outputs": [
39+
{
40+
"name": "stderr",
41+
"output_type": "stream",
42+
"text": [
43+
"Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)\n"
44+
]
45+
},
46+
{
47+
"data": {
48+
"application/vnd.jupyter.widget-view+json": {
49+
"model_id": "dc5a62b3aebb45d690d9f0dcc783deea",
50+
"version_major": 2,
51+
"version_minor": 0
52+
},
53+
"text/plain": [
54+
" 0%| | 0/1 [00:00<?, ?it/s]"
55+
]
56+
},
57+
"metadata": {},
58+
"output_type": "display_data"
59+
},
60+
{
61+
"name": "stdout",
62+
"output_type": "stream",
63+
"text": [
64+
"evaluating with [context_relavency]\n"
65+
]
66+
},
67+
{
68+
"name": "stderr",
69+
"output_type": "stream",
70+
"text": [
71+
"100%|████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.90s/it]\n"
72+
]
73+
},
74+
{
75+
"name": "stdout",
76+
"output_type": "stream",
77+
"text": [
78+
"evaluating with [faithfulness]\n"
79+
]
80+
},
81+
{
82+
"name": "stderr",
83+
"output_type": "stream",
84+
"text": [
85+
"100%|████████████████████████████████████████████████████████████| 1/1 [00:21<00:00, 21.01s/it]\n"
86+
]
87+
},
88+
{
89+
"name": "stdout",
90+
"output_type": "stream",
91+
"text": [
92+
"evaluating with [answer_relevancy]\n"
93+
]
94+
},
95+
{
96+
"name": "stderr",
97+
"output_type": "stream",
98+
"text": [
99+
"100%|████████████████████████████████████████████████████████████| 1/1 [00:07<00:00, 7.36s/it]\n"
100+
]
101+
},
102+
{
103+
"data": {
104+
"text/plain": [
105+
"{'ragas_score': 0.1837, 'context_relavency': 0.0707, 'faithfulness': 0.8889, 'answer_relevancy': 0.9403}"
106+
]
107+
},
108+
"execution_count": 1,
109+
"metadata": {},
110+
"output_type": "execute_result"
111+
}
112+
],
113+
"source": [
114+
"from datasets import load_dataset\n",
115+
"from ragas.metrics import context_relevancy, answer_relevancy, faithfulness\n",
116+
"from ragas import evaluate\n",
117+
"\n",
118+
"\n",
119+
"fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")\n",
120+
"\n",
121+
"result = evaluate(\n",
122+
" fiqa_eval[\"baseline\"].select(range(3)), \n",
123+
" metrics=[context_relevancy, faithfulness, answer_relevancy]\n",
124+
")\n",
125+
"\n",
126+
"result"
127+
]
128+
},
129+
{
130+
"cell_type": "markdown",
131+
"id": "0b862b5d",
132+
"metadata": {},
133+
"source": [
134+
"Voila! Now you can head over to your project and see the traces\n",
135+
"\n",
136+
"![](../assets/langsmith-tracing-overview.png)\n",
137+
"this shows the langsmith tracing dashboard overview\n",
138+
"\n",
139+
"![](../assets/langsmith-tracing-faithfullness.png)\n",
140+
"this shows the traces for the faithfullness metrics. As you can see being able to view the reasons why "
141+
]
142+
},
143+
{
144+
"cell_type": "code",
145+
"execution_count": null,
146+
"id": "febeef63",
147+
"metadata": {},
148+
"outputs": [],
149+
"source": [
150+
"\"../assets/langsmith-tracing-overview.png\"\n",
151+
"\"../assets/langsmith-tracing-faithfullness.png\""
152+
]
153+
}
154+
],
155+
"metadata": {
156+
"kernelspec": {
157+
"display_name": "Python 3 (ipykernel)",
158+
"language": "python",
159+
"name": "python3"
160+
},
161+
"language_info": {
162+
"codemirror_mode": {
163+
"name": "ipython",
164+
"version": 3
165+
},
166+
"file_extension": ".py",
167+
"mimetype": "text/x-python",
168+
"name": "python",
169+
"nbconvert_exporter": "python",
170+
"pygments_lexer": "ipython3",
171+
"version": "3.10.12"
172+
}
173+
},
174+
"nbformat": 4,
175+
"nbformat_minor": 5
176+
}

docs/metrics.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# Metrics
22

3+
### `Faithfulness`
34

4-
1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
5+
This measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
56
```python
67
from ragas.metrics.factuality import Faithfulness
78
faithfulness = Faithfulness()
@@ -14,8 +15,9 @@ dataset: Dataset
1415

1516
results = faithfulness.score(dataset)
1617
```
18+
### `ContextRelevancy`
1719

18-
2. `context_relevancy`: measures how relevant is the retrieved context to the prompt. This is done using a combination of OpenAI models and cross-encoder models. To improve the score one can try to optimize the amount of information present in the retrieved context.
20+
This measures how relevant is the retrieved context to the prompt. This is done using a combination of OpenAI models and cross-encoder models. To improve the score one can try to optimize the amount of information present in the retrieved context.
1921
```python
2022
from ragas.metrics.context_relevancy import ContextRelevancy
2123
context_rel = ContextRelevancy(strictness=3)
@@ -28,7 +30,9 @@ dataset: Dataset
2830
results = context_rel.score(dataset)
2931
```
3032

31-
3. `answer_relevancy`: measures how relevant is the generated answer to the prompt. If the generated answer is incomplete or contains redundant information the score will be low. This is quantified by working out the chance of an LLM generating the given question using the generated answer. Values range (0,1), higher the better.
33+
### `AnswerRelevancy`
34+
35+
This measures how relevant is the generated answer to the prompt. If the generated answer is incomplete or contains redundant information the score will be low. This is quantified by working out the chance of an LLM generating the given question using the generated answer. Values range (0,1), higher the better.
3236
```python
3337
from ragas.metrics.answer_relevancy import AnswerRelevancy
3438
answer_relevancy = AnswerRelevancy(model_name="t5-small")
@@ -42,7 +46,9 @@ results = answer_relevancy.score(dataset)
4246
```
4347

4448

45-
4. `Aspect Critiques`: Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like `correctness`, `harmfulness`,etc (Check `SUPPORTED_ASPECTS` to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The `strictness` parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. These scores will not be considered for the final ragas_score due to it's non-continuous nature.
49+
### `AspectCritique`
50+
51+
`Aspect Critiques`: Critiques are LLM evaluators that evaluate the your submission using the provided aspect. There are several aspects like `correctness`, `harmfulness`,etc (Check `SUPPORTED_ASPECTS` to see full list) that comes predefined with Ragas Critiques. If you wish to define your own aspect you can also do this. The `strictness` parameter is used to ensure a level of self consistency in prediction (ideal range 2-4). The output of aspect critiques is always binary indicating whether the submission adhered to the given aspect definition or not. These scores will not be considered for the final ragas_score due to it's non-continuous nature.
4652
- List of predefined aspects:
4753
`correctness`,`harmfulness`,`coherence`,`conciseness`,`maliciousness`
4854

@@ -76,4 +82,4 @@ LLM like GPT 3.5 struggle when it comes to scoring generated text directly. For
7682
src="./assets/bar-graph.svg">
7783
</h1>
7884

79-
Take a look at our experiments [here](/experiments/assesments/metrics_assesments.ipynb)
85+
Take a look at our experiments [here](/experiments/assesments/metrics_assesments.ipynb)

0 commit comments

Comments
 (0)