|
12 | 12 | "- [Completion LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.llms)\n", |
13 | 13 | "- [Chat based LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.chat_models)\n", |
14 | 14 | "\n", |
15 | | - "This guide will show you how to use another or LLM API for evaluation.\n", |
16 | | - "\n", |
17 | | - "> **Note**: If your looking to use Azure OpenAI for evaluation checkout [this guide](./quickstart-azure-openai.ipynb)" |
| 15 | + "This guide will show you how to use another or LLM API for evaluation." |
| 16 | + ] |
| 17 | + }, |
| 18 | + { |
| 19 | + "cell_type": "markdown", |
| 20 | + "id": "43b57fcd-5f3f-4dc5-9ba1-c3b152c501cc", |
| 21 | + "metadata": {}, |
| 22 | + "source": [ |
| 23 | + ":::{Note}\n", |
| 24 | + "If your looking to use Azure OpenAI for evaluation checkout [this guide](./azure-openai.ipynb)\n", |
| 25 | + ":::" |
18 | 26 | ] |
19 | 27 | }, |
20 | 28 | { |
21 | 29 | "cell_type": "markdown", |
22 | 30 | "id": "55f0f9b9", |
23 | 31 | "metadata": {}, |
24 | 32 | "source": [ |
25 | | - "### Evaluating with GPT4\n", |
| 33 | + "## Evaluating with GPT4\n", |
26 | 34 | "\n", |
27 | 35 | "Ragas uses gpt3.5 by default but using gpt4 for evaluation can improve the results so lets use that for the `Faithfulness` metric\n", |
28 | 36 | "\n", |
|
71 | 79 | "source": [ |
72 | 80 | "from ragas.metrics import faithfulness\n", |
73 | 81 | "\n", |
74 | | - "faithfulness.llm = gpt4" |
| 82 | + "faithfulness.llm.langchain_llm = gpt4" |
75 | 83 | ] |
76 | 84 | }, |
77 | 85 | { |
|
100 | 108 | { |
101 | 109 | "data": { |
102 | 110 | "application/vnd.jupyter.widget-view+json": { |
103 | | - "model_id": "9fb581d4057d4e70a0b70830b2f5f487", |
| 111 | + "model_id": "6ecc1636c4f84c7292fc9d8675e691c7", |
104 | 112 | "version_major": 2, |
105 | 113 | "version_minor": 0 |
106 | 114 | }, |
|
152 | 160 | "name": "stderr", |
153 | 161 | "output_type": "stream", |
154 | 162 | "text": [ |
155 | | - "100%|████████████████████████████████████████████████████████████| 2/2 [22:28<00:00, 674.38s/it]\n" |
| 163 | + "100%|████████████████████████████████████████████████████████████| 1/1 [07:10<00:00, 430.26s/it]\n" |
156 | 164 | ] |
157 | 165 | }, |
158 | 166 | { |
159 | 167 | "data": { |
160 | 168 | "text/plain": [ |
161 | | - "{'faithfulness': 0.7237}" |
| 169 | + "{'faithfulness': 0.8867}" |
162 | 170 | ] |
163 | 171 | }, |
164 | 172 | "execution_count": 5, |
|
170 | 178 | "# evaluate\n", |
171 | 179 | "from ragas import evaluate\n", |
172 | 180 | "\n", |
173 | | - "result = evaluate(fiqa_eval[\"baseline\"], metrics=[faithfulness])\n", |
| 181 | + "result = evaluate(\n", |
| 182 | + " fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n", |
| 183 | + " metrics=[faithfulness]\n", |
| 184 | + ")\n", |
| 185 | + "\n", |
| 186 | + "result" |
| 187 | + ] |
| 188 | + }, |
| 189 | + { |
| 190 | + "cell_type": "markdown", |
| 191 | + "id": "f490031e-fb73-4170-8762-61cadb4031e6", |
| 192 | + "metadata": {}, |
| 193 | + "source": [ |
| 194 | + "## Evaluating with Open-Source LLMs\n", |
| 195 | + "\n", |
| 196 | + "You can also use any of the Open-Source LLM for evaluating. Ragas support most the the deployment methods like [HuggingFace TGI](https://python.langchain.com/docs/integrations/llms/huggingface_textgen_inference), [Anyscale](https://python.langchain.com/docs/integrations/llms/anyscale), [vLLM](https://python.langchain.com/docs/integrations/llms/vllm) and many [more](https://python.langchain.com/docs/integrations/llms/) through Langchain. \n", |
| 197 | + "\n", |
| 198 | + "When it comes to selecting open-source language models, there are some rules of thumb to follow, given that the quality of evaluation metrics depends heavily on the model's quality:\n", |
| 199 | + "\n", |
| 200 | + "1. Opt for models with more than 7 billion parameters. This choice ensures a minimum level of quality in the results for ragas metrics. Models like Llama-2 or Mistral can be an excellent starting point.\n", |
| 201 | + "2. Always prioritize finetuned models over base models. Finetuned models tend to follow instructions more effectively, which can significantly improve their performance.\n", |
| 202 | + "3. If your project focuses on a specific domain, such as science or finance, prioritize models that have been pre-trained on a larger volume of tokens from your domain of interest. For instance, if you are working with research data, consider models pre-trained on a substantial number of tokens from platforms like arXiv or Semantic Scholar.\n", |
| 203 | + "\n", |
| 204 | + ":::{note}\n", |
| 205 | + "Choosing the right Open-Source LLM for evaluation can by tricky. You can also fine-tune these models to get even better performance on Ragas meterics. If you need some help/advice on that feel free to [talk to us](https://calendly.com/shahules/30min)\n", |
| 206 | + ":::\n", |
| 207 | + "\n", |
| 208 | + "In this example we are going to use [vLLM](https://github.com/vllm-project/vllm) for hosting a `HuggingFaceH4/zephyr-7b-alpha`. Checkout the [quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) for more details on how to get started with vLLM." |
| 209 | + ] |
| 210 | + }, |
| 211 | + { |
| 212 | + "cell_type": "code", |
| 213 | + "execution_count": null, |
| 214 | + "id": "85e313f2-e45c-4551-ab20-4e526e098740", |
| 215 | + "metadata": {}, |
| 216 | + "outputs": [], |
| 217 | + "source": [ |
| 218 | + "# start the vLLM server\n", |
| 219 | + "!python -m vllm.entrypoints.openai.api_server \\\n", |
| 220 | + " --model HuggingFaceH4/zephyr-7b-alpha \\\n", |
| 221 | + " --host 0.0.0.0 \\\n", |
| 222 | + " --port 8080" |
| 223 | + ] |
| 224 | + }, |
| 225 | + { |
| 226 | + "cell_type": "markdown", |
| 227 | + "id": "c9ddf74a-9830-4e1a-a4dd-7e5ec17a71e4", |
| 228 | + "metadata": {}, |
| 229 | + "source": [ |
| 230 | + "Now lets create an Langchain llm instance. Because vLLM can run in OpenAI compatibilitiy mode, we can use the `ChatOpenAI` class like this." |
| 231 | + ] |
| 232 | + }, |
| 233 | + { |
| 234 | + "cell_type": "code", |
| 235 | + "execution_count": null, |
| 236 | + "id": "2fd4adf3-db15-4c95-bf7c-407266517214", |
| 237 | + "metadata": {}, |
| 238 | + "outputs": [], |
| 239 | + "source": [ |
| 240 | + "from langchain.chat_models import ChatOpenAI\n", |
| 241 | + "\n", |
| 242 | + "inference_server_url = \"http://localhost:8080/v1\"\n", |
| 243 | + "\n", |
| 244 | + "chat = ChatOpenAI(\n", |
| 245 | + " model=\"HuggingFaceH4/zephyr-7b-alpha\",\n", |
| 246 | + " openai_api_key=\"no-key\",\n", |
| 247 | + " openai_api_base=inference_server_url,\n", |
| 248 | + " max_tokens=5,\n", |
| 249 | + " temperature=0,\n", |
| 250 | + ")" |
| 251 | + ] |
| 252 | + }, |
| 253 | + { |
| 254 | + "cell_type": "markdown", |
| 255 | + "id": "2dd7932a-7933-4de8-a6af-2830457e02a0", |
| 256 | + "metadata": {}, |
| 257 | + "source": [ |
| 258 | + "Now lets import all the metrics you want to use and change the llm." |
| 259 | + ] |
| 260 | + }, |
| 261 | + { |
| 262 | + "cell_type": "code", |
| 263 | + "execution_count": null, |
| 264 | + "id": "20882d05-1b54-4d17-88a0-f7ada2d6a576", |
| 265 | + "metadata": {}, |
| 266 | + "outputs": [], |
| 267 | + "source": [ |
| 268 | + "from ragas.metrics import (\n", |
| 269 | + " context_precision,\n", |
| 270 | + " answer_relevancy,\n", |
| 271 | + " faithfulness,\n", |
| 272 | + " context_recall,\n", |
| 273 | + ")\n", |
| 274 | + "from ragas.metrics.critique import harmfulness\n", |
| 275 | + "\n", |
| 276 | + "# change the LLM\n", |
| 277 | + "\n", |
| 278 | + "faithfulness.llm.langchain_llm = chat\n", |
| 279 | + "answer_relevancy.llm.langchain_llm = chat\n", |
| 280 | + "context_precision.llm.langchain_llm = chat\n", |
| 281 | + "context_recall.llm.langchain_llm = chat\n", |
| 282 | + "harmfulness.llm.langchain_llm = chat" |
| 283 | + ] |
| 284 | + }, |
| 285 | + { |
| 286 | + "cell_type": "markdown", |
| 287 | + "id": "58a610f2-19e5-40ec-bb7d-760c1d608a85", |
| 288 | + "metadata": {}, |
| 289 | + "source": [ |
| 290 | + "Now you can run the evaluations with and analyse the results." |
| 291 | + ] |
| 292 | + }, |
| 293 | + { |
| 294 | + "cell_type": "code", |
| 295 | + "execution_count": null, |
| 296 | + "id": "d8858300-7985-4c79-8d03-c671afd645ac", |
| 297 | + "metadata": {}, |
| 298 | + "outputs": [], |
| 299 | + "source": [ |
| 300 | + "# evaluate\n", |
| 301 | + "from ragas import evaluate\n", |
| 302 | + "\n", |
| 303 | + "result = evaluate(\n", |
| 304 | + " fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n", |
| 305 | + " metrics=[faithfulness]\n", |
| 306 | + ")\n", |
174 | 307 | "\n", |
175 | 308 | "result" |
176 | 309 | ] |
|
0 commit comments