|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# Opik by Comet\n", |
| 7 | + "# Comet Opik\n", |
8 | 8 | "\n", |
9 | 9 | "In this notebook, we will showcase how to use Opik with Ragas for monitoring and evaluation of RAG (Retrieval-Augmented Generation) pipelines.\n", |
10 | 10 | "\n", |
|
13 | 13 | "1. Using Ragas metrics to score traces\n", |
14 | 14 | "2. Using the Ragas `evaluate` function to score a dataset\n", |
15 | 15 | "\n", |
| 16 | + "<center><img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-project-dashboard.png\" alt=\"Comet Opik project dashboard screenshot with list of traces and spans\" width=\"600\" style=\"border: 0.5px solid #ddd;\"/></center>\n", |
| 17 | + "\n", |
16 | 18 | "## Setup\n", |
17 | 19 | "\n", |
18 | | - "[Comet](https://www.comet.com/site?utm_medium=github&utm_source=ragas&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_medium=github&utm_source=ragas&utm_campaign=opik) and grab you API Key.\n", |
| 20 | + "[Comet](https://www.comet.com/site?utm_medium=docs&utm_source=ragas&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_medium=docs&utm_source=ragas&utm_campaign=opik) and grab you API Key.\n", |
19 | 21 | "\n", |
20 | | - "> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/self_hosting_opik?utm_medium=github&utm_source=ragas&utm_campaign=opik/) for more information." |
| 22 | + "> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/self_hosting_opik?utm_medium=docs&utm_source=ragas&utm_campaign=opik/) for more information." |
21 | 23 | ] |
22 | 24 | }, |
23 | 25 | { |
24 | 26 | "cell_type": "code", |
25 | | - "execution_count": null, |
| 27 | + "execution_count": 1, |
26 | 28 | "metadata": {}, |
27 | 29 | "outputs": [], |
28 | 30 | "source": [ |
|
44 | 46 | }, |
45 | 47 | { |
46 | 48 | "cell_type": "code", |
47 | | - "execution_count": 1, |
| 49 | + "execution_count": 2, |
48 | 50 | "metadata": {}, |
49 | 51 | "outputs": [], |
50 | 52 | "source": [ |
|
63 | 65 | }, |
64 | 66 | { |
65 | 67 | "cell_type": "code", |
66 | | - "execution_count": 1, |
| 68 | + "execution_count": 3, |
67 | 69 | "metadata": {}, |
68 | 70 | "outputs": [], |
69 | 71 | "source": [ |
|
97 | 99 | }, |
98 | 100 | { |
99 | 101 | "cell_type": "code", |
100 | | - "execution_count": 2, |
| 102 | + "execution_count": 4, |
101 | 103 | "metadata": {}, |
102 | 104 | "outputs": [], |
103 | 105 | "source": [ |
|
126 | 128 | }, |
127 | 129 | { |
128 | 130 | "cell_type": "code", |
129 | | - "execution_count": 3, |
| 131 | + "execution_count": 5, |
130 | 132 | "metadata": {}, |
131 | 133 | "outputs": [], |
132 | 134 | "source": [ |
|
138 | 140 | }, |
139 | 141 | { |
140 | 142 | "cell_type": "code", |
141 | | - "execution_count": 4, |
| 143 | + "execution_count": 6, |
142 | 144 | "metadata": {}, |
143 | 145 | "outputs": [ |
144 | 146 | { |
145 | 147 | "name": "stdout", |
146 | 148 | "output_type": "stream", |
147 | 149 | "text": [ |
148 | | - "Answer Relevancy score: 0.9616931041269692\n" |
| 150 | + "Answer Relevancy score: 1.0\n" |
149 | 151 | ] |
150 | 152 | } |
151 | 153 | ], |
152 | 154 | "source": [ |
153 | 155 | "import asyncio\n", |
154 | 156 | "from ragas.integrations.opik import OpikTracer\n", |
| 157 | + "from ragas.dataset_schema import SingleTurnSample\n", |
155 | 158 | "\n", |
156 | 159 | "\n", |
157 | 160 | "# Define the scoring function\n", |
158 | | - "def compute_metric(opik_tracer, metric, row):\n", |
| 161 | + "def compute_metric(metric, row):\n", |
| 162 | + " row = SingleTurnSample(**row)\n", |
| 163 | + "\n", |
| 164 | + " opik_tracer = OpikTracer()\n", |
| 165 | + "\n", |
159 | 166 | " async def get_score(opik_tracer, metric, row):\n", |
160 | | - " score = await metric.ascore(row, callbacks=[opik_tracer])\n", |
| 167 | + " score = await metric.single_turn_ascore(row, callbacks=[OpikTracer()])\n", |
161 | 168 | " return score\n", |
162 | 169 | "\n", |
163 | 170 | " # Run the async function using the current event loop\n", |
164 | 171 | " loop = asyncio.get_event_loop()\n", |
| 172 | + "\n", |
165 | 173 | " result = loop.run_until_complete(get_score(opik_tracer, metric, row))\n", |
166 | 174 | " return result\n", |
167 | 175 | "\n", |
168 | 176 | "\n", |
169 | 177 | "# Score a simple example\n", |
170 | 178 | "row = {\n", |
171 | | - " \"question\": \"What is the capital of France?\",\n", |
172 | | - " \"answer\": \"Paris\",\n", |
173 | | - " \"contexts\": [\"Paris is the capital of France.\", \"Paris is in France.\"],\n", |
| 179 | + " \"user_input\": \"What is the capital of France?\",\n", |
| 180 | + " \"response\": \"Paris\",\n", |
| 181 | + " \"retrieved_contexts\": [\"Paris is the capital of France.\", \"Paris is in France.\"],\n", |
174 | 182 | "}\n", |
175 | 183 | "\n", |
176 | | - "opik_tracer = OpikTracer()\n", |
177 | | - "score = compute_metric(opik_tracer, answer_relevancy_metric, row)\n", |
| 184 | + "score = compute_metric(answer_relevancy_metric, row)\n", |
178 | 185 | "print(\"Answer Relevancy score:\", score)" |
179 | 186 | ] |
180 | 187 | }, |
|
186 | 193 | "\n", |
187 | 194 | "#### Score traces\n", |
188 | 195 | "\n", |
189 | | - "You can score traces by using the `get_current_trace` function to get the current trace and then calling the `log_feedback_score` function.\n", |
| 196 | + "You can score traces by using the `update_current_trace` function to get the current trace and passing the feedback scores to that function.\n", |
190 | 197 | "\n", |
191 | 198 | "The advantage of this approach is that the scoring span is added to the trace allowing for a more fine-grained analysis of the RAG pipeline. It will however run the Ragas metric calculation synchronously and so might not be suitable for production use-cases." |
192 | 199 | ] |
193 | 200 | }, |
194 | 201 | { |
195 | 202 | "cell_type": "code", |
196 | | - "execution_count": 5, |
| 203 | + "execution_count": 7, |
197 | 204 | "metadata": {}, |
198 | 205 | "outputs": [ |
199 | 206 | { |
|
202 | 209 | "'Paris'" |
203 | 210 | ] |
204 | 211 | }, |
205 | | - "execution_count": 5, |
| 212 | + "execution_count": 7, |
206 | 213 | "metadata": {}, |
207 | 214 | "output_type": "execute_result" |
208 | 215 | } |
209 | 216 | ], |
210 | 217 | "source": [ |
211 | 218 | "from opik import track\n", |
212 | | - "from opik.opik_context import get_current_trace\n", |
| 219 | + "from opik.opik_context import update_current_trace\n", |
213 | 220 | "\n", |
214 | 221 | "\n", |
215 | 222 | "@track\n", |
|
227 | 234 | "@track(name=\"Compute Ragas metric score\", capture_input=False)\n", |
228 | 235 | "def compute_rag_score(answer_relevancy_metric, question, answer, contexts):\n", |
229 | 236 | " # Define the score function\n", |
230 | | - " row = {\"question\": question, \"answer\": answer, \"contexts\": contexts}\n", |
| 237 | + " row = {\"user_input\": question, \"response\": answer, \"retrieved_contexts\": contexts}\n", |
231 | 238 | " score = compute_metric(answer_relevancy_metric, row)\n", |
232 | 239 | " return score\n", |
233 | 240 | "\n", |
|
238 | 245 | " contexts = retrieve_contexts(question)\n", |
239 | 246 | " answer = answer_question(question, contexts)\n", |
240 | 247 | "\n", |
241 | | - " trace = get_current_trace()\n", |
242 | 248 | " score = compute_rag_score(answer_relevancy_metric, question, answer, contexts)\n", |
243 | | - " trace.log_feedback_score(\"answer_relevancy\", round(score, 4), category_name=\"ragas\")\n", |
| 249 | + " update_current_trace(\n", |
| 250 | + " feedback_scores=[{\"name\": \"answer_relevancy\", \"value\": round(score, 4)}]\n", |
| 251 | + " )\n", |
244 | 252 | "\n", |
245 | 253 | " return answer\n", |
246 | 254 | "\n", |
|
261 | 269 | }, |
262 | 270 | { |
263 | 271 | "cell_type": "code", |
264 | | - "execution_count": 6, |
| 272 | + "execution_count": 8, |
265 | 273 | "metadata": {}, |
266 | 274 | "outputs": [ |
267 | | - { |
268 | | - "name": "stderr", |
269 | | - "output_type": "stream", |
270 | | - "text": [ |
271 | | - "passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`\n" |
272 | | - ] |
273 | | - }, |
274 | 275 | { |
275 | 276 | "data": { |
276 | 277 | "application/vnd.jupyter.widget-view+json": { |
277 | | - "model_id": "985d2e27ce8a48daad673666e6e6e953", |
| 278 | + "model_id": "07abcf96a39b4fd183756d5dc3b617c9", |
278 | 279 | "version_major": 2, |
279 | 280 | "version_minor": 0 |
280 | 281 | }, |
281 | 282 | "text/plain": [ |
282 | | - "Evaluating: 0%| | 0/9 [00:00<?, ?it/s]" |
| 283 | + "Evaluating: 0%| | 0/6 [00:00<?, ?it/s]" |
283 | 284 | ] |
284 | 285 | }, |
285 | 286 | "metadata": {}, |
|
289 | 290 | "name": "stdout", |
290 | 291 | "output_type": "stream", |
291 | 292 | "text": [ |
292 | | - "{'context_precision': 1.0000, 'faithfulness': 0.8250, 'answer_relevancy': 0.9755}\n" |
| 293 | + "{'context_precision': 1.0000, 'faithfulness': 0.7375, 'answer_relevancy': 0.9889}\n" |
293 | 294 | ] |
294 | 295 | } |
295 | 296 | ], |
|
301 | 302 | "\n", |
302 | 303 | "fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")\n", |
303 | 304 | "\n", |
| 305 | + "# Reformat the dataset to match the schema expected by the Ragas evaluate function\n", |
| 306 | + "dataset = fiqa_eval[\"baseline\"].select(range(3))\n", |
| 307 | + "\n", |
| 308 | + "dataset = dataset.map(\n", |
| 309 | + " lambda x: {\n", |
| 310 | + " \"user_input\": x[\"question\"],\n", |
| 311 | + " \"reference\": x[\"ground_truths\"][0],\n", |
| 312 | + " \"retrieved_contexts\": x[\"contexts\"],\n", |
| 313 | + " }\n", |
| 314 | + ")\n", |
| 315 | + "\n", |
304 | 316 | "opik_tracer_eval = OpikTracer(tags=[\"ragas_eval\"], metadata={\"evaluation_run\": True})\n", |
305 | 317 | "\n", |
306 | 318 | "result = evaluate(\n", |
307 | | - " fiqa_eval[\"baseline\"].select(range(3)),\n", |
| 319 | + " dataset,\n", |
308 | 320 | " metrics=[context_precision, faithfulness, answer_relevancy],\n", |
309 | 321 | " callbacks=[opik_tracer_eval],\n", |
310 | 322 | ")\n", |
|
0 commit comments