diff --git a/third_party/Phoenix/README.md b/third_party/Phoenix/README.md index ccd6956b..f1646006 100644 --- a/third_party/Phoenix/README.md +++ b/third_party/Phoenix/README.md @@ -6,7 +6,7 @@ | GitHub | - Community + Community

@@ -26,8 +26,9 @@ Phoenix runs practically anywhere, including your Jupyter notebook, local machin The latest Phoenix + Mistral AI docs can be found [here](https://docs.arize.com/phoenix/tracing/integrations-tracing/mistralai). ## Examples -- [Tracing a Mistral AI application](./examples/arize_phoenix_tracing.ipynb) -- [Evaluating a Mistral RAG pipeline](./examples/arize_phoenix_evaluate_rag.ipynb) +- [Tracing a Mistral AI application](third_party/Phoenix/arize_phoenix_tracing.ipynb) +- [Evaluating a Mistral RAG pipeline](third_party/Phoenix/arize_phoenix_evaluate_rag.ipynb) +- [Evaluating a Python agent workflow](third_party/Phoenix/analytical_agent_workflow.ipynb) ## See it in action @@ -35,5 +36,5 @@ The latest Phoenix + Mistral AI docs can be found [here](https://docs.arize.com/ ## Other Resources -- 🀝 [Join the Phoenix community](https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q) +- 🀝 [Join the Phoenix community](https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg?__hstc=259489365.a667dfafcfa0169c8aee4178d115dc81.1733501603539.1733501603539.1733501603539.1&__hssc=259489365.1.1733501603539&__hsfp=3822854628&submissionGuid=381a0676-8f38-437b-96f2-fc10875658df#/shared-invite/email) - πŸ› [Submit an issue or feature request](https://github.com/Arize-ai/phoenix/issues) \ No newline at end of file diff --git a/third_party/Phoenix/analytical_agent_workflow.ipynb b/third_party/Phoenix/analytical_agent_workflow.ipynb new file mode 100644 index 00000000..c96709a9 --- /dev/null +++ b/third_party/Phoenix/analytical_agent_workflow.ipynb @@ -0,0 +1,866 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "JYV9wLKiS6Dl" + }, + "source": [ + "# Data analytical multi-agent workflow:\n", + "\n", + "![image info](../../images/agent_demo2.png)\n", + "\n", + "You can also use multiple agents in a workflow. Here is an example:\n", + "\n", + "1. Data Analysis Planning:\n", + "\n", + " The planning agent writes a comprehensive data analysis plan, outlining the steps required to analyze the data.\n", + "\n", + "2. Code Generation and Execution:\n", + "\n", + " For each step in the analysis plan, the Python agent generates the corresponding code.\n", + "The Python agent then executes the generated code to perform the specified analysis.\n", + "\n", + "3. Analysis Report Summarization:\n", + "\n", + " Based on the results of the executed code, the summarization agent writes an analysis report.\n", + "The report summarizes the findings and insights derived from the data analysis.\n", + "\n", + "\n", + "## Install dependencies\n", + "\n", + "First we will install the python SDK and set our API key!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "idrU7lh3jJSG", + "outputId": "34b9da8e-6ed8-4912-c6df-bfef0e6e7a9c" + }, + "outputs": [], + "source": [ + "!pip install mistralai==1.0.0" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "YyLxYRmKjKQf" + }, + "outputs": [], + "source": [ + "import os\n", + "from mistralai import Mistral\n", + "import re\n", + "from getpass import getpass\n", + "\n", + "if not (api_key := os.getenv(\"MISTRAL_API_KEY\")):\n", + " api_key = getpass(\"πŸ”‘ Enter your Mistral API key: \")\n", + "os.environ[\"MISTRAL_API_KEY\"] = api_key\n", + "\n", + "client = Mistral(api_key=api_key)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V2dVirKpjaID" + }, + "source": [ + "## Agents\n", + "You can create an Agent in https://console.mistral.ai/build/agents/new, for this notebook we will use mistral-large-2407 as the model powering our agents!\n", + "\n", + "Here are the instructions provided to the agents we created:\n", + "\n", + "### Planning agent:\n", + "\n", + "```\n", + "You are a data analytical planning assistant. Given a dataset and its description,\n", + "your task is to provide specific and simple analysis plans, detailed instructions,\n", + "and suggested Python code that can later be given to a separate Python agent to generate\n", + "the Python code for executing the analysis plan.\n", + "Do not create figures.\n", + "\n", + "Return output with the following format:\n", + "\n", + "## Total number of steps:\n", + "\n", + "## Step 1:\n", + "```\n", + "\n", + "### Python agent:\n", + "```\n", + "You are a Python coding assistant that only outputs Python code without any explanations or comments.\n", + "Given an instruction and the suggested Python code, return the correct Python code.\n", + "```\n", + "\n", + "### Summarization agent:\n", + "```\n", + "You are an analysis summarization assistant.\n", + "Given a dataset's description and the analysis results. Provide an analysis report.\n", + "```\n", + "\n", + "### Agents IDs\n", + "Next, we will retrieve the Agents IDs from the UI where we created the agents.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "FEl6UMM7kJOU" + }, + "outputs": [], + "source": [ + "planning_agent_id = \"ag:ad73bfd7:20241009:planning-agent:40a0d3e8\"\n", + "summarization_agent_id = \"ag:ad73bfd7:20241009:summarization-agent:3036db8a\"\n", + "python_agent_id = \"ag:ad73bfd7:20240912:python-codegen-agent:0375a7cf\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "csZ3Pw5iTZxU" + }, + "source": [ + "# Analysis Planning" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "FQ2E6s_I5WXe" + }, + "outputs": [], + "source": [ + "def run_analysis_planning_agent(query):\n", + " \"\"\"\n", + " Sends a user query to a Python agent and returns the response.\n", + "\n", + " Args:\n", + " query (str): The user query to be sent to the Python agent.\n", + "\n", + " Returns:\n", + " str: The response content from the Python agent.\n", + " \"\"\"\n", + " print(\"### Run Planning agent\")\n", + " print(f\"User query: {query}\")\n", + " try:\n", + " response = client.agents.complete(\n", + " agent_id= planning_agent_id,\n", + " messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": query\n", + " },\n", + " ]\n", + " )\n", + " result = response.choices[0].message.content\n", + " return result\n", + " except Exception as e:\n", + " print(f\"Request failed: {e}. Please check your request.\")\n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "W3VN-YxmFF1M" + }, + "outputs": [], + "source": [ + "query = \"\"\"\n", + "Load this data: https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv\n", + "\n", + "The dataset consists of 51 datapoints and has eight columns:\n", + "- State\n", + "- Number of drivers involved in fatal collisions per billion miles\n", + "- Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding\n", + "- Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired\n", + "- Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted\n", + "- Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents\n", + "- Car Insurance Premiums ($)\n", + "- Losses incurred by insurance companies for collisions per insured driver ($)\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "AjUIAUMS5aSY", + "outputId": "5fa82d2f-0b36-461d-a8e8-6fcb77a32d24" + }, + "outputs": [], + "source": [ + "planning_result = run_analysis_planning_agent(query)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4rg10DgI1nQ0", + "outputId": "56b1d835-e591-4cf4-c59c-069f9e70842a" + }, + "outputs": [], + "source": [ + "print(planning_result)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wzMIy7VvTfOv" + }, + "source": [ + "# Generate and execute Python code for each planning step" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "cHaJLA_Yx4sN" + }, + "outputs": [], + "source": [ + "class PythonAgentWorkflow:\n", + " def __init__(self):\n", + " pass\n", + "\n", + " def extract_pattern(self, text, pattern):\n", + " \"\"\"\n", + " Extracts a pattern from the given text.\n", + "\n", + " Args:\n", + " text (str): The text to search within.\n", + " pattern (str): The regex pattern to search for.\n", + "\n", + " Returns:\n", + " str: The extracted pattern or None if not found.\n", + " \"\"\"\n", + " match = re.search(pattern, text, flags=re.DOTALL)\n", + " if match:\n", + " return match.group(1).strip()\n", + " return None\n", + "\n", + " def extract_step_i(self, planning_result, i, n_step):\n", + " \"\"\"\n", + " Extracts the content of a specific step from the planning result.\n", + "\n", + " Args:\n", + " planning_result (str): The planning result text.\n", + " i (int): The step number to extract.\n", + " n_step (int): The total number of steps.\n", + "\n", + " Returns:\n", + " str: The extracted step content or None if not found.\n", + " \"\"\"\n", + " if i < n_step:\n", + " pattern = rf'## Step {i}:(.*?)## Step {i+1}'\n", + " elif i == n_step:\n", + " pattern = rf'## Step {i}:(.*)'\n", + " else:\n", + " print(f\"Invalid step number {i}. It should be between 1 and {n_step}.\")\n", + " return None\n", + "\n", + " step_i = self.extract_pattern(planning_result, pattern)\n", + " if not step_i:\n", + " print(f\"Failed to extract Step {i} content.\")\n", + " return None\n", + "\n", + " return step_i\n", + "\n", + " def extract_code(self, python_agent_result):\n", + " \"\"\"\n", + " Extracts Python function and test case from the response content.\n", + "\n", + " Args:\n", + " result (str): The response content from the Python agent.\n", + "\n", + " Returns:\n", + " tuple: A tuple containing the extracted Python function and a retry flag.\n", + " \"\"\"\n", + " retry = False\n", + " print(\"### Extracting Python code\")\n", + " python_code = self.extract_pattern(python_agent_result, r'```python(.*?)```')\n", + " if not python_code:\n", + " retry = True\n", + " print(\"Python function failed to generate or wrong output format. Setting retry to True.\")\n", + "\n", + " return python_code, retry\n", + "\n", + " def run_python_agent(self, query):\n", + " \"\"\"\n", + " Sends a user query to a Python agent and returns the response.\n", + "\n", + " Args:\n", + " query (str): The user query to be sent to the Python agent.\n", + "\n", + " Returns:\n", + " str: The response content from the Python agent.\n", + " \"\"\"\n", + " print(\"### Run Python agent\")\n", + " print(f\"User query: {query}\")\n", + " try:\n", + " response = client.agents.complete(\n", + " agent_id= python_agent_id,\n", + " messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": query\n", + " },\n", + " ]\n", + " )\n", + " result = response.choices[0].message.content\n", + " return result\n", + "\n", + " except Exception as e:\n", + " print(f\"Request failed: {e}. Please check your request.\")\n", + " return None\n", + "\n", + " def check_code(self, python_function, state):\n", + " \"\"\"\n", + " Executes the Python function and checks for any errors.\n", + "\n", + " Args:\n", + " python_function (str): The Python function to be executed.\n", + "\n", + " Returns:\n", + " bool: A flag indicating whether the code execution needs to be retried.\n", + "\n", + " Warning:\n", + " This code is designed to run code that’s been generated by a model, which may not be entirely reliable.\n", + " It's strongly recommended to run this in a sandbox environment.\n", + " \"\"\"\n", + " retry = False\n", + " try:\n", + " print(f\"### Python function to run: {python_function}\")\n", + " exec(python_function, state)\n", + " print(\"Code executed successfully.\")\n", + " except Exception:\n", + " print(f\"Code failed.\")\n", + " retry = True\n", + " print(\"Setting retry to True\")\n", + " return retry\n", + "\n", + " def process_step(self, planning_result, i, n_step, max_retries, state):\n", + " \"\"\"\n", + " Processes a single step, including retries.\n", + "\n", + " Args:\n", + " planning_result (str): The planning result text.\n", + " i (int): The step number to process.\n", + " n_step (int): The total number of steps.\n", + " max_retries (int): The maximum number of retries.\n", + "\n", + " Returns:\n", + " str: The extracted step content or None if not found.\n", + " \"\"\"\n", + "\n", + " retry = True\n", + " j = 0\n", + " while j < max_retries and retry:\n", + " print(f\"TRY # {j}\")\n", + " j += 1\n", + " step_i = self.extract_step_i(planning_result, i, n_step)\n", + " if step_i:\n", + " print(step_i)\n", + " python_agent_result = self.run_python_agent(step_i)\n", + " python_code, retry = self.extract_code(python_agent_result)\n", + " print(python_code)\n", + " retry = self.check_code(python_code, state)\n", + " return None\n", + "\n", + " def workflow(self, planning_result):\n", + " \"\"\"\n", + " Executes the workflow for processing planning results.\n", + "\n", + " Args:\n", + " planning_result (str): The planning result text.\n", + " \"\"\"\n", + " state = {}\n", + " print(\"### ENTER WORKFLOW\")\n", + " n_step = int(self.extract_pattern(planning_result, '## Total number of steps:\\s*(\\d+)'))\n", + " for i in range(1, n_step + 1):\n", + " print(f\"STEP # {i}\")\n", + " self.process_step(planning_result, i, n_step, max_retries=2, state=state)\n", + "\n", + "\n", + " print(\"### Exit WORKFLOW\")\n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "xND0_v3z1wp2", + "outputId": "b433658a-3a10-4230-d6fa-530451a8bead" + }, + "outputs": [], + "source": [ + "import sys\n", + "import io\n", + "\n", + "# See the output of print statements in the console while also capturing it in a variable,\n", + "class Tee(io.StringIO):\n", + " def __init__(self, *args, **kwargs):\n", + " super().__init__(*args, **kwargs)\n", + " self.original_stdout = sys.stdout\n", + "\n", + " def write(self, data):\n", + " self.original_stdout.write(data)\n", + " super().write(data)\n", + "\n", + " def flush(self):\n", + " self.original_stdout.flush()\n", + " super().flush()\n", + "\n", + "# Create an instance of the Tee class\n", + "tee_stream = Tee()\n", + "\n", + "# Redirect stdout to the Tee instance\n", + "sys.stdout = tee_stream\n", + "\n", + "\n", + "Python_agent = PythonAgentWorkflow()\n", + "Python_agent.workflow(planning_result)\n", + "\n", + "# Restore the original stdout\n", + "sys.stdout = tee_stream.original_stdout\n", + "\n", + "# Get the captured output\n", + "captured_output = tee_stream.getvalue()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gP2y1VHuTlPU" + }, + "source": [ + "# Summarization" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "v6z9ZX7Ck1l6" + }, + "outputs": [], + "source": [ + "response = client.agents.complete(\n", + " agent_id= summarization_agent_id,\n", + " messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": query + captured_output\n", + " },\n", + " ]\n", + ")\n", + "result = response.choices[0].message.content\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uTJ0WOu5OEYE", + "outputId": "515f673d-9185-401f-cba3-ad60ee8ca75a" + }, + "outputs": [], + "source": [ + "print(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# (Optional) Trace and Evaluate your Agent\n", + "\n", + "Now that your agent is running, you can optionally trace and evaluate it with Arize Phoenix. Phoenix is an open-source framework for tracing and evaluating LLM applications, including agents and RAG pipelines.\n", + "\n", + "Tracing refers to the process of recording the calls made between your application and the LLM. Evaluation can be thought of as the performance testing of your agent. Phoenix provides a UI for you to view traces and evaluations, as well as a suite of evaluation templates.\n", + "\n", + "To start off, create a Phoenix account and get your API key [here](https://phoenix.arize.com)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set up Phoenix" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q openinference-instrumentation-mistralai arize-phoenix 'arize-phoenix-evals>=0.18.0'" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "if not (api_key := os.getenv(\"PHOENIX_API_KEY\")):\n", + " api_key = getpass(\"πŸ”‘ Enter your Phoenix API key: \")\n", + "os.environ[\"PHOENIX_API_KEY\"] = api_key" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from openinference.instrumentation.mistralai import MistralAIInstrumentor\n", + "from phoenix.otel import register\n", + "import os\n", + "\n", + "# Add Phoenix API Key for tracing\n", + "os.environ[\"PHOENIX_CLIENT_HEADERS\"] = f\"api_key={os.getenv('PHOENIX_API_KEY')}\"\n", + "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n", + "\n", + "# configure the Phoenix tracer\n", + "tracer_provider = register() \n", + "\n", + "# Phoenix provides an openinference package that automatically traces all requests to Mistral\n", + "MistralAIInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run your agent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "planning_result = run_analysis_planning_agent(query)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create an instance of the Tee class\n", + "tee_stream = Tee()\n", + "\n", + "# Redirect stdout to the Tee instance\n", + "sys.stdout = tee_stream\n", + "\n", + "\n", + "Python_agent = PythonAgentWorkflow()\n", + "Python_agent.workflow(planning_result)\n", + "\n", + "# Restore the original stdout\n", + "sys.stdout = tee_stream.original_stdout\n", + "\n", + "# Get the captured output\n", + "captured_output = tee_stream.getvalue()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "response = client.agents.complete(\n", + " agent_id= summarization_agent_id,\n", + " messages = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": query + captured_output\n", + " },\n", + " ]\n", + ")\n", + "result = response.choices[0].message.content" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### View your traces\n", + "You should now be able to view traces in [Phoenix](https://app.phoenix.arize.com)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluate your agent\n", + "\n", + "Now let's evaluate your agent. The flow for batch evaluation is as follows:\n", + "\n", + "1. Export traces from Phoenix\n", + "2. Attach labels to the traces. These can be created using an LLM as a judge, using code-based evaluation, or using a combination of both.\n", + "3. Import the labeled traces into Phoenix." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Export traces from Phoenix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import phoenix as px\n", + "\n", + "spans = px.Client().get_spans_dataframe()\n", + "\n", + "spans.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When it comes to evaluating agents, a good general approach is to break down the steps your agent must complete, and evaluate each step individually.\n", + "\n", + "In this case, we can evaluate:\n", + "1. The code generated by the Python agent\n", + "2. The analysis report written by the summarization agent\n", + "\n", + "We'll evaluate each of these steps individually." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Evaluate Code Generation\n", + "\n", + "Phoenix has a [built-in LLM Judge template that can be used to evaluate Code Generation Agents](https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/code-generation-eval). We'll use that template here.\n", + "\n", + "The template requires two columns to be added to the dataframe:\n", + "- output: The code generated by the agent\n", + "- input: The original user query\n", + "\n", + "We already have the input, so just need to extract solely the generated code from the `attributes.llm.output_messages` column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "code_gen_spans = spans[spans['attributes.llm.invocation_parameters'].str.contains(python_agent_id)].copy()\n", + "code_gen_spans.loc[:,'input'] = code_gen_spans['attributes.input.value']\n", + "code_gen_spans.loc[:,'output'] = code_gen_spans['attributes.llm.output_messages'].apply(lambda x: PythonAgentWorkflow().extract_code(x[0]['message.content']))\n", + "code_gen_spans.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from phoenix.evals import (CODE_READABILITY_PROMPT_TEMPLATE, \n", + " CODE_READABILITY_PROMPT_RAILS_MAP, \n", + " llm_classify, \n", + " MistralAIModel)\n", + "\n", + "import nest_asyncio\n", + "nest_asyncio.apply()\n", + "\n", + "eval_model = MistralAIModel(api_key=os.getenv(\"MISTRAL_API_KEY\"))\n", + "\n", + "code_gen_evals = llm_classify(\n", + " model=eval_model,\n", + " template=CODE_READABILITY_PROMPT_TEMPLATE,\n", + " dataframe=code_gen_spans,\n", + " concurrency=20,\n", + " provide_explanation=True,\n", + " rails=list(CODE_READABILITY_PROMPT_RAILS_MAP.values())\n", + ")\n", + "code_gen_evals['score'] = code_gen_evals['label'].apply(lambda x: 1 if x == \"readable\" else 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This leaves us with a dataframe of evaluations. We'll add this back into Phoenix later on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "code_gen_evals.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Evaluate Summarization\n", + "\n", + "We'll use a simple prompt to evaluate the summarization agent, this time using a different prebuilt evaluation template." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Filter the spans to only include those from our summarization agent\n", + "summarization_spans = spans[spans['attributes.llm.invocation_parameters'].str.contains(summarization_agent_id)].copy()\n", + "summarization_spans = summarization_spans.assign(\n", + " input=summarization_spans['attributes.input.value'],\n", + " output=summarization_spans['attributes.llm.output_messages']\n", + ")\n", + "summarization_spans.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import phoenix.evals.default_templates as templates\n", + "\n", + "summarization_evals = llm_classify(\n", + " model=eval_model,\n", + " template=templates.SUMMARIZATION_PROMPT_TEMPLATE,\n", + " dataframe=summarization_spans,\n", + " concurrency=20,\n", + " provide_explanation=True,\n", + " rails=list(templates.SUMMARIZATION_PROMPT_RAILS_MAP.values())\n", + ")\n", + "summarization_evals['score'] = summarization_evals['label'].apply(lambda x: 1 if x == \"good\" else 0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "summarization_evals.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Import labeled traces into Phoenix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from phoenix.trace import SpanEvaluations\n", + "\n", + "px.Client().log_evaluations(\n", + " SpanEvaluations(eval_name=\"Code Quality\", dataframe=code_gen_evals),\n", + " SpanEvaluations(eval_name=\"Summarization\", dataframe=summarization_evals)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can now view your evaluations in the Phoenix UI:\n", + "\n", + "![image info](../../third_party/Phoenix/images/phoenix-agent-summarization-eval.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Congratulations! You've now successfully evaluated your agent.\n", + "\n", + "Check out [Phoenix](https://phoenix.arize.com) for more!\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/third_party/Phoenix/arize_phoenix_evaluate_rag.ipynb b/third_party/Phoenix/arize_phoenix_evaluate_rag.ipynb index bfaa1623..d4a2f72c 100644 --- a/third_party/Phoenix/arize_phoenix_evaluate_rag.ipynb +++ b/third_party/Phoenix/arize_phoenix_evaluate_rag.ipynb @@ -55,14 +55,14 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ - "!pip install -qq arize-phoenix gcsfs nest_asyncio openinference-instrumentation-llama_index\n", - "!pip install -q llama-index-embeddings-mistralai\n", - "!pip install -q llama-index-llms-mistralai\n", - "!pip install -qq \"mistralai>=1.0.0\"" + "!pip install -qq \"arize-phoenix>=7.2.0\" \"arize-phoenix-evals>=0.18.0\" gcsfs==2024.10.0 nest_asyncio==1.6.0 openinference-instrumentation-llama_index==3.0.4\n", + "!pip install -q llama-index-embeddings-mistralai==0.3.0 llama-index-readers-file==0.4.0\n", + "!pip install -q llama-index-llms-mistralai==0.3.0\n", + "!pip install -qq mistralai==1.2.5" ] }, { @@ -102,7 +102,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -116,16 +116,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex." + "During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. Phoenix is a free-to-use open-source platform. You can get an API key from the [Phoenix website](https://phoenix.arize.com/). Alternatively, see here for details on [how to run Phoenix locally](https://docs.arize.com/phoenix/deployment/environments)." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ - "session =px.launch_app()" + "if not (api_key := os.getenv(\"PHOENIX_API_KEY\")):\n", + " api_key = getpass(\"πŸ”‘ Enter your Phoenix API key: \")\n", + "os.environ[\"PHOENIX_API_KEY\"] = api_key" ] }, { @@ -141,7 +143,10 @@ "metadata": {}, "outputs": [], "source": [ - "tracer_provider = register()\n", + "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n", + "os.environ[\"PHOENIX_CLIENT_HEADERS\"] = f\"api_key={os.getenv('PHOENIX_API_KEY')}\"\n", + "\n", + "tracer_provider = register(project_name='Mistral-RAG')\n", "LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)" ] }, @@ -161,7 +166,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -178,7 +183,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -205,7 +210,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -214,7 +219,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -241,16 +246,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"phoenix URL\", session.url)" + "Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the [Phoenix application](https://app.phoenix.arize.com/)." ] }, { @@ -273,7 +269,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -311,8 +307,8 @@ "source": [ "from phoenix.session.evaluation import get_retrieved_documents\n", "\n", - "retrieved_documents_df = get_retrieved_documents(px.Client())\n", - "retrieved_documents_df" + "retrieved_documents_df = get_retrieved_documents(px.Client(), project_name='Mistral-RAG')\n", + "print(retrieved_documents_df.head())" ] }, { @@ -334,7 +330,9 @@ " run_evals,\n", ")\n", "\n", - "relevance_evaluator = RelevanceEvaluator(MistralAIModel)\n", + "eval_model = MistralAIModel(api_key=os.getenv(\"MISTRAL_API_KEY\"))\n", + "\n", + "relevance_evaluator = RelevanceEvaluator(eval_model)\n", "\n", "retrieved_documents_relevance_df = run_evals(\n", " evaluators=[relevance_evaluator],\n", @@ -369,7 +367,7 @@ "documents_with_relevance_df = pd.concat(\n", " [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix(\"eval_\")], axis=1\n", ")\n", - "documents_with_relevance_df" + "print(documents_with_relevance_df.head())" ] }, { @@ -389,8 +387,8 @@ "source": [ "from phoenix.session.evaluation import get_qa_with_reference\n", "\n", - "qa_with_reference_df = get_qa_with_reference(px.Client())\n", - "qa_with_reference_df" + "qa_with_reference_df = get_qa_with_reference(px.Client(), project_name='Mistral-RAG')\n", + "print(qa_with_reference_df.head())" ] }, { @@ -413,8 +411,8 @@ " run_evals,\n", ")\n", "\n", - "qa_evaluator = QAEvaluator(MistralAIModel())\n", - "hallucination_evaluator = HallucinationEvaluator(MistralAIModel())\n", + "qa_evaluator = QAEvaluator(eval_model)\n", + "hallucination_evaluator = HallucinationEvaluator(eval_model)\n", "\n", "qa_correctness_eval_df, hallucination_eval_df = run_evals(\n", " evaluators=[qa_evaluator, hallucination_evaluator],\n", @@ -430,7 +428,7 @@ "metadata": {}, "outputs": [], "source": [ - "qa_correctness_eval_df.head()" + "print(qa_correctness_eval_df.head())" ] }, { @@ -439,7 +437,7 @@ "metadata": {}, "outputs": [], "source": [ - "hallucination_eval_df.head()" + "print(hallucination_eval_df.head())" ] }, { @@ -471,15 +469,6 @@ "We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response." ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"phoenix URL\", session.url)" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -515,7 +504,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.14" + "version": "3.11.11" } }, "nbformat": 4, diff --git a/third_party/Phoenix/arize_phoenix_tracing.ipynb b/third_party/Phoenix/arize_phoenix_tracing.ipynb index 78e497f7..96a2b0a6 100644 --- a/third_party/Phoenix/arize_phoenix_tracing.ipynb +++ b/third_party/Phoenix/arize_phoenix_tracing.ipynb @@ -45,12 +45,12 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ - "!pip install -q arize-phoenix jsonschema openinference-instrumentation-mistralai\n", - "!pip install -qU mistralai " + "!pip install -q arize-phoenix==5.2.2 jsonschema==4.23.0 openinference-instrumentation-mistralai==1.0.0\n", + "!pip install -qU mistralai==1.2.5" ] }, { @@ -91,7 +91,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -106,7 +106,19 @@ "source": [ "## 3. Run Phoenix in the Background\n", "\n", - "Launch Phoenix as a background session to collect the trace data emitted by your instrumented Mistral client. For details on how to self-host Phoenix or connect to a remote Phoenix instance, see the [Phoenix documentation](https://docs.arize.com/phoenix/quickstart)." + "Phoenix is a free-to-use open-source platform. You can get an API key from the [Phoenix website](https://phoenix.arize.com/). Alternatively, see here for details on [how to run Phoenix locally](https://docs.arize.com/phoenix/deployment/environments)." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "if not (api_key := os.getenv(\"PHOENIX_API_KEY\")):\n", + " api_key = getpass(\"πŸ”‘ Enter your Phoenix API key: \")\n", + "os.environ[\"PHOENIX_API_KEY\"] = api_key" ] }, { @@ -115,7 +127,9 @@ "metadata": {}, "outputs": [], "source": [ - "session = px.launch_app()\n", + "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n", + "os.environ[\"PHOENIX_CLIENT_HEADERS\"] = f\"api_key={os.getenv('PHOENIX_API_KEY')}\"\n", + "\n", "tracer_provider = register()" ] }, @@ -130,7 +144,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -148,7 +162,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -181,7 +195,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -274,16 +288,7 @@ "source": [ "## 6. View traces in Phoenix\n", "\n", - "You should now be able to view traces in Phoenix." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(f\"Current Phoenix URL: {session.url}\")" + "You should now be able to view traces in [Phoenix](https://app.phoenix.arize.com/)." ] }, { diff --git a/third_party/Phoenix/images/phoenix-agent-eval.png b/third_party/Phoenix/images/phoenix-agent-eval.png new file mode 100644 index 00000000..1f577af5 Binary files /dev/null and b/third_party/Phoenix/images/phoenix-agent-eval.png differ diff --git a/third_party/Phoenix/images/phoenix-agent-summarization-eval.png b/third_party/Phoenix/images/phoenix-agent-summarization-eval.png new file mode 100644 index 00000000..8d3314cd Binary files /dev/null and b/third_party/Phoenix/images/phoenix-agent-summarization-eval.png differ