Merge pull request #364 from junefish/langchain-prefix

zackproser · web-flow · commit 632c4b4b2069 · 2024-06-18T11:14:27.000-04:00
Create example notebook for using ID prefixes with LangChain
diff --git a/learn/generation/langchain/managing_rag_docs_langchain.ipynb b/learn/generation/langchain/managing_rag_docs_langchain.ipynb
@@ -0,0 +1,389 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+       "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/experimental/merge-namespaces/merge-namespaces.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/experimental/merge-namespaces/merge-namespaces.ipynb)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Managing RAG Documents with LangChain"
+      ],
+      "metadata": {
+        "id": "6pAOU9GazGY2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "When upserting documents with LangChain's [`PineconeVectorStore`](https://api.python.langchain.com/en/latest/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html#langchain_pinecone.vectorstores.PineconeVectorStore) method, by default the vector IDs generated are random UUIDs. As a best practice when [managing RAG documents](https://docs.pinecone.io/guides/data/manage-rag-documents), ID prefixes should be used.\n",
+        "\n",
+        "This notebook gives an example of specifying ID prefixes when upserting to a Pinecone index with LangChain."
+      ],
+      "metadata": {
+        "id": "ZdCSDz10zbB4"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Setup"
+      ],
+      "metadata": {
+        "id": "7AdAv8Jg4U27"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "BSUyXksS3bkk",
+        "outputId": "559433f2-ed32-48f3-fbad-22be3a286a2f"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m974.6/974.6 kB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.2/2.2 MB\u001b[0m \u001b[31m16.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m315.6/315.6 kB\u001b[0m \u001b[31m16.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m215.9/215.9 kB\u001b[0m \u001b[31m17.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m325.5/325.5 kB\u001b[0m \u001b[31m20.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m22.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m125.2/125.2 kB\u001b[0m \u001b[31m15.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.2/49.2 kB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m145.0/145.0 kB\u001b[0m \u001b[31m15.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m9.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m7.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25h"
+          ]
+        }
+      ],
+      "source": [
+        "%pip install --upgrade --quiet  \\\n",
+        "    langchain-pinecone \\\n",
+        "    langchain-openai \\\n",
+        "    langchain \\\n",
+        "    langchain-community \\\n",
+        "    pinecone-notebooks"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import getpass\n",
+        "openai_api_key = getpass.getpass(prompt='Enter your OpenAI API key:')"
+      ],
+      "metadata": {
+        "id": "tjnKw_Qf36KK",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "e46b4c3a-5105-4ba9-8a95-5864965d92e1"
+      },
+      "execution_count": 11,
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Enter your OpenAI API key:··········\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Chunk the file"
+      ],
+      "metadata": {
+        "id": "6ulCGTfj4OkW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "filepath = \"/content/sample_data/state_of_the_union.txt\""
+      ],
+      "metadata": {
+        "id": "VUNVpZiO30tc"
+      },
+      "execution_count": 16,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from langchain_community.document_loaders import TextLoader\n",
+        "from langchain_openai import OpenAIEmbeddings\n",
+        "from langchain_text_splitters import CharacterTextSplitter\n",
+        "\n",
+        "loader = TextLoader(filepath)\n",
+        "documents = loader.load()\n",
+        "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
+        "docs = text_splitter.split_documents(documents)\n",
+        "\n",
+        "embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)"
+      ],
+      "metadata": {
+        "id": "d_gZtjGO3j2d"
+      },
+      "execution_count": 17,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Connect to Pinecone"
+      ],
+      "metadata": {
+        "id": "TAdv4PJj4Y0x"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from pinecone_notebooks.colab import Authenticate\n",
+        "\n",
+        "Authenticate()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 247
+        },
+        "id": "i0dIZL6d4MOK",
+        "outputId": "3bc6cbb4-aa99-4fa2-9e58-2d569fee7771"
+      },
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "<script type=\"text/javascript\" src=\"https://connect.pinecone.io/embed.js\"></script>"
+            ]
+          },
+          "metadata": {}
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import os\n",
+        "pinecone_api_key = os.environ.get(\"PINECONE_API_KEY\")\n",
+        "\n",
+        "import time\n",
+        "\n",
+        "from pinecone import Pinecone, ServerlessSpec\n",
+        "\n",
+        "pc = Pinecone(api_key=pinecone_api_key)"
+      ],
+      "metadata": {
+        "id": "P7voylv74ezR"
+      },
+      "execution_count": 18,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import time\n",
+        "\n",
+        "index_name = \"langchain-id-test\" # change to match an existing index\n",
+        "\n",
+        "existing_indexes = [index_info[\"name\"] for index_info in pc.list_indexes()]\n",
+        "\n",
+        "if index_name not in existing_indexes:\n",
+        "    pc.create_index(\n",
+        "        name=index_name,\n",
+        "        dimension=1536,\n",
+        "        metric=\"cosine\",\n",
+        "        spec=ServerlessSpec(cloud=\"aws\", region=\"us-east-1\"),\n",
+        "    )\n",
+        "    while not pc.describe_index(index_name).status[\"ready\"]:\n",
+        "        time.sleep(1)\n",
+        "\n",
+        "index = pc.Index(index_name)"
+      ],
+      "metadata": {
+        "id": "XWddh1Lq4h5Z"
+      },
+      "execution_count": 19,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Upsert data"
+      ],
+      "metadata": {
+        "id": "x4KaMq5P4tjU"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Generate IDs with the specified prefix"
+      ],
+      "metadata": {
+        "id": "1lKRfEYJ0GKj"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "prefix = \"sotu\" # change to reflect your document\n",
+        "\n",
+        "ids = []\n",
+        "for i in range(len(docs)):\n",
+        "  ids.append(prefix+\"#\"+str(i))"
+      ],
+      "metadata": {
+        "id": "nrONNB070LRj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Upsert to Pinecone"
+      ],
+      "metadata": {
+        "id": "yx5yPTIj0M_H"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from langchain_pinecone import PineconeVectorStore\n",
+        "\n",
+        "docsearch = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)\n",
+        "\n",
+        "vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)\n",
+        "vectorstore.add_documents(docs, ids=ids) # prints IDs of upserted vectors"
+      ],
+      "metadata": {
+        "id": "2OhSEwVY4ut6",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "23f077c4-a80d-4bf0-9555-027038e5d152"
+      },
+      "execution_count": 26,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "['sotu#0',\n",
+              " 'sotu#1',\n",
+              " 'sotu#2',\n",
+              " 'sotu#3',\n",
+              " 'sotu#4',\n",
+              " 'sotu#5',\n",
+              " 'sotu#6',\n",
+              " 'sotu#7',\n",
+              " 'sotu#8',\n",
+              " 'sotu#9',\n",
+              " 'sotu#10',\n",
+              " 'sotu#11',\n",
+              " 'sotu#12',\n",
+              " 'sotu#13',\n",
+              " 'sotu#14',\n",
+              " 'sotu#15',\n",
+              " 'sotu#16',\n",
+              " 'sotu#17',\n",
+              " 'sotu#18',\n",
+              " 'sotu#19',\n",
+              " 'sotu#20',\n",
+              " 'sotu#21',\n",
+              " 'sotu#22',\n",
+              " 'sotu#23',\n",
+              " 'sotu#24',\n",
+              " 'sotu#25',\n",
+              " 'sotu#26',\n",
+              " 'sotu#27',\n",
+              " 'sotu#28',\n",
+              " 'sotu#29',\n",
+              " 'sotu#30',\n",
+              " 'sotu#31',\n",
+              " 'sotu#32',\n",
+              " 'sotu#33',\n",
+              " 'sotu#34',\n",
+              " 'sotu#35',\n",
+              " 'sotu#36',\n",
+              " 'sotu#37',\n",
+              " 'sotu#38',\n",
+              " 'sotu#39',\n",
+              " 'sotu#40',\n",
+              " 'sotu#41']"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 26
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "query = \"What did the president say about Ketanji Brown Jackson\"\n",
+        "docs = docsearch.similarity_search(query)\n",
+        "print(docs[0].page_content)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "l3KkioA24z6-",
+        "outputId": "627683e5-e8c9-4b35-b799-50967ce7f61e"
+      },
+      "execution_count": 24,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
+            "\n",
+            "Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
+            "\n",
+            "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
+            "\n",
+            "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
+          ]
+        }
+      ]
+    }
+  ]
+}