Skip to content

Commit c2a64d5

Browse files
authored
feat: support for multiple n in llms (#197)
- adds support for all the opensource models - Make `OPENAI_API_KEY` not required when initing the library fixes: #115 #74
1 parent c477e1e commit c2a64d5

File tree

18 files changed

+376
-171
lines changed

18 files changed

+376
-171
lines changed

.github/workflows/ci.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ jobs:
9595
OPTS=(--dist loadfile -n auto)
9696
fi
9797
# Now run the unit tests
98-
OPENAI_API_KEY="test" pytest tests/unit "${OPTS[@]}"
98+
pytest tests/unit "${OPTS[@]}"
9999
100100
codestyle_check:
101101
runs-on: ubuntu-latest

docs/getstarted/evaluation.md

Lines changed: 12 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,10 @@ welcome to the ragas quickstart. We're going to get you up and running with raga
55

66
to kick things of lets start with the data
77

8-
```{note}
9-
Are you using Azure OpenAI endpoints? Then checkout [this quickstart guide](./guides/quickstart-azure-openai.ipynb)
10-
```
8+
:::{note}
9+
Are you using Azure OpenAI endpoints? Then checkout [this quickstart
10+
guide](../howtos/customisations/azure-openai.ipynb)
11+
:::
1112

1213
```bash
1314
pip install ragas
@@ -21,22 +22,15 @@ os.environ["OPENAI_API_KEY"] = "your-openai-key"
2122
```
2223
## The Data
2324

24-
Ragas performs a `ground_truth` free evaluation of your RAG pipelines. This is because for most people building a gold labeled dataset which represents in the distribution they get in production is a very expensive process.
25-
26-
```{note}
27-
While originally ragas was aimed at `ground_truth` free evaluations there is some aspects of the RAG pipeline that need `ground_truth` in order to measure. We're in the process of building a testset generation features that will make it easier. Checkout [issue#136](https://github.com/explodinggradients/ragas/issues/136) for more details.
28-
```
25+
For this tutorial we are going to use an example dataset from one of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/). The dataset has the following columns.
2926

30-
Hence to work with ragas all you need are the following data
3127
- question: `list[str]` - These are the questions your RAG pipeline will be evaluated on.
3228
- answer: `list[str]` - The answer generated from the RAG pipeline and given to the user.
3329
- contexts: `list[list[str]]` - The contexts which were passed into the LLM to answer the question.
3430
- ground_truths: `list[list[str]]` - The ground truth answer to the questions. (only required if you are using context_recall)
3531

3632
Ideally your list of questions should reflect the questions your users give, including those that you have been problematic in the past.
3733

38-
Here we're using an example dataset from on of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/) we created.
39-
4034

4135
```{code-block} python
4236
:caption: import sample dataset
@@ -46,10 +40,10 @@ fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
4640
fiqa_eval
4741
```
4842

49-
```{seealso}
50-
See [prepare-data](/docs/concepts/prepare_data.md) to learn how to prepare your own custom data for evaluation.
43+
:::{seealso}
44+
See [testset generation](./testset_generation.md) to learn how to generate your own synthetic data for evaluation.
45+
:::
5146

52-
```
5347
## Metrics
5448

5549
Ragas provides you with a few metrics to evaluate the different aspects of your RAG systems namely
@@ -78,9 +72,9 @@ here you can see that we are using 4 metrics, but what do the represent?
7872
4. context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.
7973

8074

81-
```{note}
82-
by default these metrics are using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key. You can also try other LLMs for evaluation, check the [llm guide](./guides/llms.ipynb) to learn more
83-
```
75+
:::{note}
76+
by default these metrics are using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key. You can also try other LLMs for evaluation, check the [llm guide](../howtos/customisations/llms.ipynb) to learn more
77+
:::
8478

8579
## Evaluation
8680

@@ -91,13 +85,12 @@ Running the evaluation is as simple as calling evaluate on the `Dataset` with th
9185
from ragas import evaluate
9286
9387
result = evaluate(
94-
fiqa_eval["baseline"].select(range(1)),
88+
fiqa_eval["baseline"].select(range(3)), # selecting only 3
9589
metrics=[
9690
context_precision,
9791
faithfulness,
9892
answer_relevancy,
9993
context_recall,
100-
harmfulness,
10194
],
10295
)
10396

docs/howtos/customisations/quickstart-azure-openai.ipynb renamed to docs/howtos/customisations/azure-openai.ipynb

Lines changed: 6 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,19 @@
55
"id": "7c249b40",
66
"metadata": {},
77
"source": [
8-
"# Using Azure OpenAI Endpoints\n"
8+
"# Using Azure OpenAI\n",
9+
"\n",
10+
"This tutorial will show you how to use Azure OpenAI endpoints instead of OpenAI endpoints."
911
]
1012
},
1113
{
1214
"cell_type": "markdown",
1315
"id": "2e63f667",
1416
"metadata": {},
1517
"source": [
16-
"<p>\n",
17-
" <a href=\"https://colab.research.google.com/github/explodinggradients/ragas/blob/main/docs/quickstart.ipynb\">\n",
18-
" <img alt=\"Open In Colab\" \n",
19-
" align=\"left\"\n",
20-
" src=\"https://colab.research.google.com/assets/colab-badge.svg\">\n",
21-
" </a>\n",
22-
" <br>\n",
23-
"</p>\n",
24-
"\n",
25-
"\n",
26-
"> **Note:** this guide is for folks who are using the Azure OpenAI endpoints. Check the [quickstart guide](../../getstarted/evaluation.md) if your using OpenAI endpoints."
18+
":::{Note}\n",
19+
"this guide is for folks who are using the Azure OpenAI endpoints. Check the [evaluation guide](../../getstarted/evaluation.md) if your using OpenAI endpoints.\n",
20+
":::"
2721
]
2822
},
2923
{

docs/howtos/customisations/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@ How to customize Ragas for your needs
44

55
:::{toctree}
66
llms.ipynb
7-
quickstart-azure-openai.ipynb
7+
azure-openai.ipynb
88
:::

docs/howtos/customisations/llms.ipynb

Lines changed: 142 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,25 @@
1212
"- [Completion LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.llms)\n",
1313
"- [Chat based LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.chat_models)\n",
1414
"\n",
15-
"This guide will show you how to use another or LLM API for evaluation.\n",
16-
"\n",
17-
"> **Note**: If your looking to use Azure OpenAI for evaluation checkout [this guide](./quickstart-azure-openai.ipynb)"
15+
"This guide will show you how to use another or LLM API for evaluation."
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"id": "43b57fcd-5f3f-4dc5-9ba1-c3b152c501cc",
21+
"metadata": {},
22+
"source": [
23+
":::{Note}\n",
24+
"If your looking to use Azure OpenAI for evaluation checkout [this guide](./azure-openai.ipynb)\n",
25+
":::"
1826
]
1927
},
2028
{
2129
"cell_type": "markdown",
2230
"id": "55f0f9b9",
2331
"metadata": {},
2432
"source": [
25-
"### Evaluating with GPT4\n",
33+
"## Evaluating with GPT4\n",
2634
"\n",
2735
"Ragas uses gpt3.5 by default but using gpt4 for evaluation can improve the results so lets use that for the `Faithfulness` metric\n",
2836
"\n",
@@ -71,7 +79,7 @@
7179
"source": [
7280
"from ragas.metrics import faithfulness\n",
7381
"\n",
74-
"faithfulness.llm = gpt4"
82+
"faithfulness.llm.langchain_llm = gpt4"
7583
]
7684
},
7785
{
@@ -100,7 +108,7 @@
100108
{
101109
"data": {
102110
"application/vnd.jupyter.widget-view+json": {
103-
"model_id": "9fb581d4057d4e70a0b70830b2f5f487",
111+
"model_id": "6ecc1636c4f84c7292fc9d8675e691c7",
104112
"version_major": 2,
105113
"version_minor": 0
106114
},
@@ -152,13 +160,13 @@
152160
"name": "stderr",
153161
"output_type": "stream",
154162
"text": [
155-
"100%|████████████████████████████████████████████████████████████| 2/2 [22:28<00:00, 674.38s/it]\n"
163+
"100%|████████████████████████████████████████████████████████████| 1/1 [07:10<00:00, 430.26s/it]\n"
156164
]
157165
},
158166
{
159167
"data": {
160168
"text/plain": [
161-
"{'faithfulness': 0.7237}"
169+
"{'faithfulness': 0.8867}"
162170
]
163171
},
164172
"execution_count": 5,
@@ -170,7 +178,132 @@
170178
"# evaluate\n",
171179
"from ragas import evaluate\n",
172180
"\n",
173-
"result = evaluate(fiqa_eval[\"baseline\"], metrics=[faithfulness])\n",
181+
"result = evaluate(\n",
182+
" fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n",
183+
" metrics=[faithfulness]\n",
184+
")\n",
185+
"\n",
186+
"result"
187+
]
188+
},
189+
{
190+
"cell_type": "markdown",
191+
"id": "f490031e-fb73-4170-8762-61cadb4031e6",
192+
"metadata": {},
193+
"source": [
194+
"## Evaluating with Open-Source LLMs\n",
195+
"\n",
196+
"You can also use any of the Open-Source LLM for evaluating. Ragas support most the the deployment methods like [HuggingFace TGI](https://python.langchain.com/docs/integrations/llms/huggingface_textgen_inference), [Anyscale](https://python.langchain.com/docs/integrations/llms/anyscale), [vLLM](https://python.langchain.com/docs/integrations/llms/vllm) and many [more](https://python.langchain.com/docs/integrations/llms/) through Langchain. \n",
197+
"\n",
198+
"When it comes to selecting open-source language models, there are some rules of thumb to follow, given that the quality of evaluation metrics depends heavily on the model's quality:\n",
199+
"\n",
200+
"1. Opt for models with more than 7 billion parameters. This choice ensures a minimum level of quality in the results for ragas metrics. Models like Llama-2 or Mistral can be an excellent starting point.\n",
201+
"2. Always prioritize finetuned models over base models. Finetuned models tend to follow instructions more effectively, which can significantly improve their performance.\n",
202+
"3. If your project focuses on a specific domain, such as science or finance, prioritize models that have been pre-trained on a larger volume of tokens from your domain of interest. For instance, if you are working with research data, consider models pre-trained on a substantial number of tokens from platforms like arXiv or Semantic Scholar.\n",
203+
"\n",
204+
":::{note}\n",
205+
"Choosing the right Open-Source LLM for evaluation can by tricky. You can also fine-tune these models to get even better performance on Ragas meterics. If you need some help/advice on that feel free to [talk to us](https://calendly.com/shahules/30min)\n",
206+
":::\n",
207+
"\n",
208+
"In this example we are going to use [vLLM](https://github.com/vllm-project/vllm) for hosting a `HuggingFaceH4/zephyr-7b-alpha`. Checkout the [quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) for more details on how to get started with vLLM."
209+
]
210+
},
211+
{
212+
"cell_type": "code",
213+
"execution_count": null,
214+
"id": "85e313f2-e45c-4551-ab20-4e526e098740",
215+
"metadata": {},
216+
"outputs": [],
217+
"source": [
218+
"# start the vLLM server\n",
219+
"!python -m vllm.entrypoints.openai.api_server \\\n",
220+
" --model HuggingFaceH4/zephyr-7b-alpha \\\n",
221+
" --host 0.0.0.0 \\\n",
222+
" --port 8080"
223+
]
224+
},
225+
{
226+
"cell_type": "markdown",
227+
"id": "c9ddf74a-9830-4e1a-a4dd-7e5ec17a71e4",
228+
"metadata": {},
229+
"source": [
230+
"Now lets create an Langchain llm instance. Because vLLM can run in OpenAI compatibilitiy mode, we can use the `ChatOpenAI` class like this."
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": null,
236+
"id": "2fd4adf3-db15-4c95-bf7c-407266517214",
237+
"metadata": {},
238+
"outputs": [],
239+
"source": [
240+
"from langchain.chat_models import ChatOpenAI\n",
241+
"\n",
242+
"inference_server_url = \"http://localhost:8080/v1\"\n",
243+
"\n",
244+
"chat = ChatOpenAI(\n",
245+
" model=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
246+
" openai_api_key=\"no-key\",\n",
247+
" openai_api_base=inference_server_url,\n",
248+
" max_tokens=5,\n",
249+
" temperature=0,\n",
250+
")"
251+
]
252+
},
253+
{
254+
"cell_type": "markdown",
255+
"id": "2dd7932a-7933-4de8-a6af-2830457e02a0",
256+
"metadata": {},
257+
"source": [
258+
"Now lets import all the metrics you want to use and change the llm."
259+
]
260+
},
261+
{
262+
"cell_type": "code",
263+
"execution_count": null,
264+
"id": "20882d05-1b54-4d17-88a0-f7ada2d6a576",
265+
"metadata": {},
266+
"outputs": [],
267+
"source": [
268+
"from ragas.metrics import (\n",
269+
" context_precision,\n",
270+
" answer_relevancy,\n",
271+
" faithfulness,\n",
272+
" context_recall,\n",
273+
")\n",
274+
"from ragas.metrics.critique import harmfulness\n",
275+
"\n",
276+
"# change the LLM\n",
277+
"\n",
278+
"faithfulness.llm.langchain_llm = chat\n",
279+
"answer_relevancy.llm.langchain_llm = chat\n",
280+
"context_precision.llm.langchain_llm = chat\n",
281+
"context_recall.llm.langchain_llm = chat\n",
282+
"harmfulness.llm.langchain_llm = chat"
283+
]
284+
},
285+
{
286+
"cell_type": "markdown",
287+
"id": "58a610f2-19e5-40ec-bb7d-760c1d608a85",
288+
"metadata": {},
289+
"source": [
290+
"Now you can run the evaluations with and analyse the results."
291+
]
292+
},
293+
{
294+
"cell_type": "code",
295+
"execution_count": null,
296+
"id": "d8858300-7985-4c79-8d03-c671afd645ac",
297+
"metadata": {},
298+
"outputs": [],
299+
"source": [
300+
"# evaluate\n",
301+
"from ragas import evaluate\n",
302+
"\n",
303+
"result = evaluate(\n",
304+
" fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n",
305+
" metrics=[faithfulness]\n",
306+
")\n",
174307
"\n",
175308
"result"
176309
]

src/ragas/exceptions.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,10 @@ class RagasException(Exception):
99
def __init__(self, message: str):
1010
self.message = message
1111
super().__init__(message)
12+
13+
14+
class OpenAIKeyNotFound(RagasException):
15+
message: str = "OpenAI API key not found! Seems like your trying to use Ragas metrics with OpenAI endpoints. Please set 'OPENAI_API_KEY' environment variable" # noqa
16+
17+
def __init__(self):
18+
super().__init__(self.message)

src/ragas/metrics/answer_correctness.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,6 @@ def __post_init__(self: t.Self):
5151
if self.faithfulness is None:
5252
self.faithfulness = Faithfulness(llm=self.llm, batch_size=self.batch_size)
5353

54-
def init_model(self: t.Self):
55-
pass
56-
5754
def _score_batch(
5855
self: t.Self,
5956
dataset: Dataset,

src/ragas/metrics/answer_relevance.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import annotations
22

3+
import os
34
import typing as t
45
from dataclasses import dataclass
56

@@ -10,8 +11,8 @@
1011
from langchain.embeddings.base import Embeddings
1112
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
1213

14+
from ragas.exceptions import OpenAIKeyNotFound
1315
from ragas.metrics.base import EvaluationMode, MetricWithLLM
14-
from ragas.metrics.llms import generate
1516

1617
if t.TYPE_CHECKING:
1718
from langchain.callbacks.manager import CallbackManager
@@ -57,13 +58,16 @@ class AnswerRelevancy(MetricWithLLM):
5758
embeddings: Embeddings | None = None
5859

5960
def __post_init__(self: t.Self):
60-
self.temperature = 0.2 if self.strictness > 0 else 0
61-
6261
if self.embeddings is None:
63-
self.embeddings = OpenAIEmbeddings() # type: ignore
62+
oai_key = os.getenv("OPENAI_API_KEY", "no-key")
63+
self.embeddings = OpenAIEmbeddings(openai_api_key=oai_key) # type: ignore
64+
65+
def init_model(self):
66+
super().init_model()
6467

65-
def init_model(self: t.Self):
66-
pass
68+
if isinstance(self.embeddings, OpenAIEmbeddings):
69+
if self.embeddings.openai_api_key == "no-key":
70+
raise OpenAIKeyNotFound
6771

6872
def _score_batch(
6973
self: t.Self,
@@ -80,11 +84,9 @@ def _score_batch(
8084
human_prompt = QUESTION_GEN.format(answer=ans)
8185
prompts.append(ChatPromptTemplate.from_messages([human_prompt]))
8286

83-
results = generate(
87+
results = self.llm.generate(
8488
prompts,
85-
self.llm,
8689
n=self.strictness,
87-
temperature=self.temperature,
8890
callbacks=batch_group,
8991
)
9092
results = [[i.text for i in r] for r in results.generations]

0 commit comments

Comments
 (0)