docs: add llamaindex tutorial (#5402)

sdiazlor · web-flow · commit 517be5251fff · 2024-09-02T16:00:53.000+02:00
# Description  Closes #<issue_number> **Type of change**  - Bug fix (non-breaking change which fixes an issue) - New feature (non-breaking change which adds functionality) - Breaking change (fix or feature that would cause existing functionality to not work as expected) - Refactor (change restructuring the codebase without changing functionality) - Improvement (change adding some improvement to an existing functionality) - Documentation update **How Has This Been Tested**  **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
diff --git a/argilla/docs/assets/images/community/integrations/llamaindex_rag_1.png b/argilla/docs/assets/images/community/integrations/llamaindex_rag_1.png
diff --git a/argilla/docs/community/integrations/llamaindex_rag_github.ipynb b/argilla/docs/community/integrations/llamaindex_rag_github.ipynb
@@ -0,0 +1,339 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🕵🏻‍♀️ Create a RAG system expert in a GitHub repository and log your predictions in Argilla\n",
+    "\n",
+    "In this tutorial, we'll show you how to create a RAG system that can answer questions about a specific GitHub repository. As example, we will target the [Argilla repository](https://github.com/argilla-io/argilla). This RAG system will target the docs of the repository, as that's where most of the natural language information about the repository can be found.\n",
+    "\n",
+    "This tutorial includes the following steps:\n",
+    "\n",
+    "-   Setting up the Argilla callback handler for LlamaIndex.\n",
+    "-   Initializing a GitHub client\n",
+    "-   Creating an index with a specific set of files from the GitHub repository of our choice.\n",
+    "-   Create a RAG system out of the Argilla repository, ask questions, and automatically log the answers to Argilla.\n",
+    "\n",
+    "This tutorial is based on the [Github Repository Reader](https://docs.llamaindex.ai/en/stable/examples/data_connectors/GithubRepositoryReaderDemo/) made by LlamaIndex."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting started\n",
+    "\n",
+    "### Deploy the Argilla server¶\n",
+    "\n",
+    "If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Set up the environment¶\n",
+    "\n",
+    "To complete this tutorial, you need to install this integration and a third-party library via pip.\n",
+    "\n",
+    "!!! note\n",
+    "    Check the integration GitHub repository [here](https://github.com/argilla-io/argilla-llama-index)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install \"argilla-llama-index\"\n",
+    "!pip install \"llama-index-readers-github==0.1.9\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's make the required imports:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core import (\n",
+    "    Settings,\n",
+    "    VectorStoreIndex,\n",
+    "    set_global_handler,\n",
+    ")\n",
+    "from llama_index.llms.openai import OpenAI\n",
+    "from llama_index.readers.github import (\n",
+    "    GithubClient,\n",
+    "    GithubRepositoryReader,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We need to set the OpenAI API key and the GitHub token. The OpenAI API key is required to run queries using GPT models, while the GitHub token ensures you have access to the repository you're using. Although the GitHub token might not be necessary for public repositories, it is still recommended."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
+    "openai_api_key = os.getenv(\"OPENAI_API_KEY\")\n",
+    "\n",
+    "os.environ[\"GITHUB_TOKEN\"] = \"ghp_...\"\n",
+    "github_token = os.getenv(\"GITHUB_TOKEN\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set the Argilla's LlamaIndex handler\n",
+    "\n",
+    "To easily log your data into Argilla within your LlamaIndex workflow, you only need a simple step. Just call the Argilla global handler for Llama Index before starting production with your LLM. This ensured that the predictions obtained using Llama Index are automatically logged to the Argilla instance.\n",
+    "\n",
+    "- `dataset_name`: The name of the dataset. If the dataset does not exist, it will be created with the specified name. Otherwise, it will be updated.\n",
+    "- `api_url`: The URL to connect to the Argilla instance.\n",
+    "- `api_key`: The API key to authenticate with the Argilla instance.\n",
+    "- `number_of_retrievals`: The number of retrieved documents to be logged. Defaults to 0.\n",
+    "- `workspace_name`: The name of the workspace to log the data. By default, the first available workspace.\n",
+    "\n",
+    "> For more information about the credentials, check the documentation for [users](https://docs.argilla.io/latest/how_to_guides/user/) and [workspaces](https://docs.argilla.io/latest/how_to_guides/workspace/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "set_global_handler(\n",
+    "    \"argilla\",\n",
+    "    dataset_name=\"github_query_model\",\n",
+    "    api_url=\"http://localhost:6900\",\n",
+    "    api_key=\"argilla.apikey\",\n",
+    "    number_of_retrievals=2,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Retrieve the data from GitHub\n",
+    "\n",
+    "First, we need to initialize the GitHub client, which will include the GitHub token for repository access."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "github_client = GithubClient(github_token=github_token, verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before creating our `GithubRepositoryReader` instance, we need to adjust the nesting. Since the Jupyter kernel operates on an event loop, we must prevent this loop from finishing before the repository is fully read."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let’s create a GithubRepositoryReader instance with the necessary repository details. In this case, we'll target the `main` branch of the `argilla` repository. As we will focus on the documentation, we will focus on the `argilla/docs/` folder, excluding images, json files, and ipynb files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents = GithubRepositoryReader(\n",
+    "    github_client=github_client,\n",
+    "    owner=\"argilla-io\",\n",
+    "    repo=\"argilla\",\n",
+    "    use_parser=False,\n",
+    "    verbose=False,\n",
+    "    filter_directories=(\n",
+    "        [\"argilla/docs/\"],\n",
+    "        GithubRepositoryReader.FilterType.INCLUDE,\n",
+    "    ),\n",
+    "    filter_file_extensions=(\n",
+    "        [\n",
+    "            \".png\",\n",
+    "            \".jpg\",\n",
+    "            \".jpeg\",\n",
+    "            \".gif\",\n",
+    "            \".svg\",\n",
+    "            \".ico\",\n",
+    "            \".json\",\n",
+    "            \".ipynb\",   # Erase this line if you want to include notebooks\n",
+    "\n",
+    "        ],\n",
+    "        GithubRepositoryReader.FilterType.EXCLUDE,\n",
+    "    ),\n",
+    ").load_data(branch=\"main\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the index and make some queries"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's create a LlamaIndex index out of this document, and we can start querying the RAG system."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# LLM settings\n",
+    "Settings.llm = OpenAI(\n",
+    "    model=\"gpt-3.5-turbo\", temperature=0.8, openai_api_key=openai_api_key\n",
+    ")\n",
+    "\n",
+    "# Load the data and create the index\n",
+    "index = VectorStoreIndex.from_documents(documents)\n",
+    "\n",
+    "# Create the query engine\n",
+    "query_engine = index.as_query_engine()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = query_engine.query(\"How do I create a Dataset in Argilla?\")\n",
+    "response"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The generated response will be automatically logged in our Argilla instance. Check it out! From Argilla you can quickly have a look at your predictions and annotate them, so you can combine both synthetic data and human feedback.\n",
+    "\n",
+    "![Argilla UI](../../assets/images/community/integrations/llamaindex_rag_1.png)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's ask a couple of more questions to see the overall behavior of the RAG chatbot. Remember that the answers are automatically logged into your Argilla instance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question: How can I list the available datasets?\n",
+      "Answer: You can list all the datasets available in a workspace by utilizing the `datasets` attribute of the `Workspace` class. Additionally, you can determine the number of datasets in a workspace by using `len(workspace.datasets)`. To list the datasets, you can iterate over them and print out each dataset. Remember that dataset settings are not preloaded when listing datasets, and if you need to work with settings, you must load them explicitly for each dataset.\n",
+      "----------------------------\n",
+      "Question: Which are the user credentials?\n",
+      "Answer: The user credentials in Argilla consist of a username, password, and API key.\n",
+      "----------------------------\n",
+      "Question: Can I use markdown in Argilla?\n",
+      "Answer: Yes, you can use Markdown in Argilla.\n",
+      "----------------------------\n",
+      "Question: Could you explain how to annotate datasets in Argilla?\n",
+      "Answer: To annotate datasets in Argilla, users can manage their data annotation projects by setting up `Users`, `Workspaces`, `Datasets`, and `Records`. By deploying Argilla on the Hugging Face Hub or with `Docker`, installing the Python SDK with `pip`, and creating the first project, users can get started in just 5 minutes. The tool allows for interacting with data in a more engaging way through features like quick labeling with filters, AI feedback suggestions, and semantic search, enabling users to focus on training models and monitoring their performance effectively.\n",
+      "----------------------------\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "questions = [\n",
+    "    \"How can I list the available datasets?\",\n",
+    "    \"Which are the user credentials?\",\n",
+    "    \"Can I use markdown in Argilla?\",\n",
+    "    \"Could you explain how to annotate datasets in Argilla?\",\n",
+    "]\n",
+    "\n",
+    "answers = []\n",
+    "\n",
+    "for question in questions:\n",
+    "    answers.append(query_engine.query(question))\n",
+    "\n",
+    "for question, answer in zip(questions, answers):\n",
+    "    print(f\"Question: {question}\")\n",
+    "    print(f\"Answer: {answer}\")\n",
+    "    print(\"----------------------------\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/argilla/mkdocs.yml b/argilla/mkdocs.yml
@@ -187,5 +187,7 @@ nav:
       - How to contribute?: community/contributor.md
       - Issue dashboard: community/popular_issues.md
       - Changelog: community/changelog.md
+      - Integrations:
+        - LlamaIndex: community/integrations/llamaindex_rag_github.ipynb
   - UI Demo ↗:
       - https://demo.argilla.io/sign-in?auth=ZGVtbzoxMjM0NTY3OA==