From a4ba044ec618913a6d09196d58c106c6ecbe0a67 Mon Sep 17 00:00:00 2001 From: Christophe Bornet Date: Thu, 25 Jul 2024 00:20:21 +0200 Subject: [PATCH 1/3] Add a tutorial for GraphVectorStore --- docs/docs/tutorials/graph_vectorstore.ipynb | 562 ++++++++++++++++++++ docs/docs/tutorials/index.mdx | 1 + 2 files changed, 563 insertions(+) create mode 100644 docs/docs/tutorials/graph_vectorstore.ipynb diff --git a/docs/docs/tutorials/graph_vectorstore.ipynb b/docs/docs/tutorials/graph_vectorstore.ipynb new file mode 100644 index 0000000000000..216e14e2f0833 --- /dev/null +++ b/docs/docs/tutorials/graph_vectorstore.ipynb @@ -0,0 +1,562 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Build a Tech Support Bot from an existing Knowledge Base\n", + "\n", + "## Preliminaries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q langchain-community beautifulsoup4 markdownify python-dotenv" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the Astra Documentation into GraphVectorStore\n", + "\n", + "First, we'll crawl the DataStax documentation. LangChain includes a `SiteMapLoader` but it loads all of the pages into memory simultaneously, which makes it impossible to index larger sites from small environments (such as CoLab). So, we'll scrape the sitemap ourselves and iterate over the URLs, allowing us to process documents in batches and flush them to Astra DB. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scrape the URLs from the Site Maps\n", + "First, we use Beautiful Soup to parse the XML content of each sitemap and get the list of URLs.\n", + "We also add a few extra URLs for external sites that are also useful to include in the index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from bs4 import BeautifulSoup\n", + "\n", + "# Use sitemaps to crawl the content\n", + "SITEMAPS = [\n", + " \"https://docs.datastax.com/en/sitemap-astra-db-vector.xml\",\n", + " \"https://docs.datastax.com/en/sitemap-cql.xml\",\n", + " \"https://docs.datastax.com/en/sitemap-dev-app-drivers.xml\",\n", + " \"https://docs.datastax.com/en/sitemap-glossary.xml\",\n", + " \"https://docs.datastax.com/en/sitemap-astra-db-serverless.xml\",\n", + "]\n", + "\n", + "# Additional URLs to crawl for content.\n", + "EXTRA_URLS = [\"https://github.com/jbellis/jvector\"]\n", + "\n", + "SITE_PREFIX = \"astra\"\n", + "\n", + "\n", + "def load_pages(sitemap_url):\n", + " r = requests.get(\n", + " sitemap_url,\n", + " headers={\n", + " # Astra docs only return a sitemap with a user agent set.\n", + " \"User-Agent\": \"Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 \"\n", + " \"Firefox/58.0\",\n", + " },\n", + " timeout=30,\n", + " )\n", + " xml = r.text\n", + "\n", + " soup = BeautifulSoup(xml, features=\"xml\")\n", + " url_tags = soup.find_all(\"url\")\n", + " for url in url_tags:\n", + " yield (url.find(\"loc\").text)\n", + "\n", + "\n", + "# For maintenance purposes, we could check only the new articles since a given time.\n", + "URLS = [url for sitemap_url in SITEMAPS for url in load_pages(sitemap_url)] + EXTRA_URLS\n", + "len(URLS)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the content from each URL\n", + "Next, we create the code to load each page. This performs the following steps:\n", + "\n", + "1. Parses the HTML with BeautifulSoup\n", + "2. Locates the \"content\" of the HTML using an appropriate selector based on the URL\n", + "3. Find the link (``) tags in the content and collect the absolute URLs (for creating edges).\n", + "\n", + "Adding the URLs of these references to the metadata allows the graph store to create edges between the document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import AsyncIterator, Iterable\n", + "\n", + "from langchain_community.document_loaders import AsyncHtmlLoader\n", + "from langchain_community.graph_vectorstores.extractors import HtmlInput, HtmlLinkExtractor\n", + "from langchain_core.documents import Document\n", + "from langchain_core.graph_vectorstores.links import add_links\n", + "from markdownify import MarkdownConverter\n", + "\n", + "markdown_converter = MarkdownConverter(heading_style=\"ATX\")\n", + "html_link_extractor = HtmlLinkExtractor()\n", + "\n", + "\n", + "def select_content(soup: BeautifulSoup, url: str) -> BeautifulSoup:\n", + " if url.startswith(\"https://docs.datastax.com/en/\"):\n", + " return soup.select_one(\"article.doc\")\n", + " if url.startswith(\"https://github.com\"):\n", + " return soup.select_one(\"article.entry-content\")\n", + " return soup\n", + "\n", + "\n", + "async def load_pages(urls: Iterable[str]) -> AsyncIterator[Document]:\n", + " loader = AsyncHtmlLoader(\n", + " urls,\n", + " requests_per_second=4,\n", + " # Astra docs require a user agent\n", + " header_template={\n", + " \"User-Agent\": \"Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 \"\n", + " \"Firefox/58.0\"\n", + " },\n", + " )\n", + " async for html in loader.alazy_load():\n", + " url = html.metadata[\"source\"]\n", + "\n", + " # Use the URL as the doc ID.\n", + " html.id = url\n", + "\n", + " # Apply the selectors while loading. This reduces the size of\n", + " # the document as early as possible for reduced memory usage.\n", + " soup = BeautifulSoup(html.page_content, \"html.parser\")\n", + " content = select_content(soup, url)\n", + "\n", + " # Extract HTML links from the content.\n", + " add_links(html, html_link_extractor.extract_one(HtmlInput(content, url)))\n", + "\n", + " # Convert the content to markdown\n", + " html.page_content = markdown_converter.convert_soup(content)\n", + "\n", + " yield html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize Environment\n", + "Before we initialize the Graph Store and write the documents we need to set some environment variables.\n", + "In colab, this will prompt you for input. When running locally, this will load from `.env`." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Not in colab. Loading '.env' (see 'env.template' for example)\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " # (Option 1) - Set the environment variables from getpass.\n", + " print(\"In colab. Using getpass/input for environment variables.\")\n", + " import getpass\n", + " import os\n", + "\n", + " os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter OpenAI API Key: \")\n", + " os.environ[\"ASTRA_DB_DATABASE_ID\"] = input(\"Enter Astra DB Database ID: \")\n", + " os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"] = getpass.getpass(\n", + " \"Enter Astra DB Application Token: \"\n", + " )\n", + "\n", + " keyspace = input(\"Enter Astra DB Keyspace (Empty for default): \")\n", + " if keyspace:\n", + " os.environ[\"ASTRA_DB_KEYSPACE\"] = keyspace\n", + " else:\n", + " os.environ.pop(\"ASTRA_DB_KEYSPACE\", None)\n", + "else:\n", + " print(\"Not in colab. Loading '.env' (see 'env.template' for example)\")\n", + " import dotenv\n", + "\n", + " dotenv.load_dotenv()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialize Cassio and GraphVectorStore\n", + "With the environment variables set we initialize the Cassio library for talking to Cassandra / Astra DB.\n", + "We also create the `GraphVectorStore`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "SITE_PREFIX = \"astra_docs\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "answer = input(\"Drop Tables? [(Y)es/(N)o]\")\n", + "if answer.lower() in [\"y\", \"yes\"]:\n", + " import cassio\n", + "\n", + " cassio.init(auto=True)\n", + " from cassio.config import check_resolve_keyspace, check_resolve_session\n", + "\n", + " session = check_resolve_session()\n", + " keyspace = check_resolve_keyspace()\n", + " session.execute(f\"DROP TABLE IF EXISTS {keyspace}.{SITE_PREFIX}_nodes\")\n", + " session.execute(f\"DROP TABLE IF EXISTS {keyspace}.{SITE_PREFIX}_targets\")\n", + "else:\n", + " # Handle no / \"wrong\" input\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import cassio\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_community.graph_vectorstores import CassandraGraphVectorStore\n", + "\n", + "cassio.init(auto=True)\n", + "embeddings = OpenAIEmbeddings()\n", + "graph_vectorstore = CassandraGraphVectorStore(\n", + " embeddings,\n", + " node_table=f\"{SITE_PREFIX}_nodes\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the Documents\n", + "Finally, we fetch pages and write them to the graph store in batches of 50." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "not_found = 0\n", + "found = 0\n", + "BATCH_SIZE = 50\n", + "\n", + "docs = []\n", + "async for doc in load_pages(URLS):\n", + " if doc.page_content.startswith(\"\\n# Page Not Found\"):\n", + " not_found += 1\n", + " continue\n", + "\n", + " docs.append(doc)\n", + " found += 1\n", + "\n", + " if len(docs) >= BATCH_SIZE:\n", + " graph_vectorstore.add_documents(docs)\n", + " docs.clear()\n", + "\n", + "if docs:\n", + " graph_vectorstore.add_documents(docs)\n", + "print(f\"{not_found} (of {not_found + found}) URLs were not found\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create and execute the RAG Chains" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI(model=\"gpt-4o\")\n", + "\n", + "template = \"\"\"You are a helpful technical support bot. You should provide complete answers explaining the options the user has available to address their problem. Answer the question based only on the following context:\n", + "{context}\n", + "\n", + "Question: {question}\n", + "\"\"\" # noqa: E501\n", + "prompt = ChatPromptTemplate.from_template(template)\n", + "\n", + "\n", + "def format_docs(docs):\n", + " return \"\\n\\n\".join(\n", + " f\"From {doc.metadata['content_id']}: {doc.page_content}\" for doc in docs\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll use the following question. This is an interesting question because the ideal answer should be concise and in-depth, based on how the vector indexing is actually implemented." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "QUESTION = \"What vector indexing algorithms does Astra use?\"" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import Markdown, display\n", + "\n", + "\n", + "# Helper method to render markdown in responses to a chain.\n", + "def run_and_render(chain, question):\n", + " result = chain.invoke(question)\n", + " display(Markdown(result))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Vector-Only Retrieval" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Depth 0 doesn't traverses edges and is equivalent to vector similarity only.\n", + "vector_retriever = graph_vectorstore.as_retriever(search_kwargs={\"depth\": 0})\n", + "\n", + "vector_rag_chain = (\n", + " {\"context\": vector_retriever | format_docs, \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_and_render(vector_rag_chain, QUESTION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Graph Traversal Retrieval" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Depth 1 does vector similarity and then traverses 1 level of edges.\n", + "graph_retriever = graph_vectorstore.as_retriever(search_kwargs={\"depth\": 1})\n", + "\n", + "graph_rag_chain = (\n", + " {\"context\": graph_retriever | format_docs, \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_and_render(graph_rag_chain, QUESTION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## MMR Graph Traversal" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "mmr_graph_retriever = graph_vectorstore.as_retriever(\n", + " search_type=\"mmr_traversal\",\n", + " search_kwargs={\n", + " \"k\": 4,\n", + " \"fetch_k\": 10,\n", + " \"depth\": 2,\n", + " # \"score_threshold\": 0.2,\n", + " },\n", + ")\n", + "\n", + "mmr_graph_rag_chain = (\n", + " {\"context\": mmr_graph_retriever | format_docs, \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "Astra DB Serverless uses the JVector vector search engine to construct a graph-based index. JVector is part of the DiskANN family and is designed to facilitate approximate nearest neighbor (ANN) search, which is crucial for handling high-dimensional vector spaces efficiently.\n", + "\n", + "Here are the key aspects of JVector and its indexing algorithms:\n", + "\n", + "1. **Graph-Based Index**: JVector constructs a single-layer graph with nonblocking concurrency control. This allows for scalable and efficient search operations.\n", + "\n", + "2. **Incremental Updates**: JVector supports incremental construction and updates to the index, making it suitable for dynamic datasets.\n", + "\n", + "3. **Two-Pass Search**: JVector employs a two-pass search strategy:\n", + " - **First Pass**: Uses lossily compressed representations of vectors stored in memory to quickly narrow down candidates.\n", + " - **Second Pass**: Uses more accurate representations read from disk to refine the search results.\n", + "\n", + "4. **Compression Techniques**: JVector supports various vector compression techniques to optimize memory usage and performance:\n", + " - **Product Quantization (PQ)**: A method that compresses vectors by splitting them into subspaces and quantizing each subspace separately.\n", + " - **Binary Quantization (BQ)**: Another compression method, although it is generally less effective than PQ for most embedding models.\n", + " - **Fused ADC (Asymmetric Distance Computation)**: Combines PQ with efficient distance computation methods to enhance search speed.\n", + "\n", + "5. **DiskANN Architecture**: JVector builds on the DiskANN design, allowing it to handle larger-than-memory indexes by storing additional data on disk.\n", + "\n", + "6. **High-Dimensional Optimization**: JVector uses the Panama Vector API (SIMD) to optimize ANN indexing and search operations, ensuring high performance even with large datasets.\n", + "\n", + "In summary, Astra DB Serverless leverages the JVector engine, which employs a graph-based index with advanced compression and search optimization techniques to provide efficient vector search capabilities." + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "run_and_render(mmr_graph_rag_chain, QUESTION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Check Retrieval Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the question and see what documents each technique retrieves.\n", + "for i, doc in enumerate(vector_retriever.invoke(QUESTION)):\n", + " print(f\"Vector [{i}]: {doc.metadata['content_id']}\")\n", + "\n", + "for i, doc in enumerate(graph_retriever.invoke(QUESTION)):\n", + " print(f\"Graph [{i}]: {doc.metadata['content_id']}\")\n", + "\n", + "for i, doc in enumerate(mmr_graph_retriever.invoke(QUESTION)):\n", + " print(f\"MMR Graph [{i}]: {doc.metadata['content_id']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion\n", + "With vector only we retrieved chunks from the Astra documentation explaining that it used JVector.\n", + "Since it didn't follow the link to [JVector on GitHub](https://github.com/jbellis/jvector) it didn't actually answer the question.\n", + "\n", + "The graph retrieval started with the same set of chunks, but it followed the edge to the documents we loaded from GitHub.\n", + "This allowed the LLM to read in more depth how JVector is implemented, which allowed it to answer the question more clearly and with more detail." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "agent-framework-aiP65pJh-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/docs/tutorials/index.mdx b/docs/docs/tutorials/index.mdx index a4e1840bf9744..0b6468f3d129c 100644 --- a/docs/docs/tutorials/index.mdx +++ b/docs/docs/tutorials/index.mdx @@ -15,6 +15,7 @@ New to LangChain or to LLM app development in general? Read this material to qui ## Working with external knowledge - [Build a Retrieval Augmented Generation (RAG) Application](/docs/tutorials/rag) - [Build a Conversational RAG Application](/docs/tutorials/qa_chat_history) +- [Build a Tech Support Bot from an existing Knowledge Base](/docs/tutorials/graph_vectorstore) - [Build a Question/Answering system over SQL data](/docs/tutorials/sql_qa) - [Build a Query Analysis System](/docs/tutorials/query_analysis) - [Build a local RAG application](/docs/tutorials/local_rag) From 8d73b211ea651d03aa0de252c62576b60d84706d Mon Sep 17 00:00:00 2001 From: Christophe Bornet Date: Mon, 29 Jul 2024 15:58:30 +0200 Subject: [PATCH 2/3] Update tutorial --- docs/docs/tutorials/graph_vectorstore.ipynb | 33 ++++++++++++--------- docs/docs/tutorials/index.mdx | 2 +- 2 files changed, 20 insertions(+), 15 deletions(-) diff --git a/docs/docs/tutorials/graph_vectorstore.ipynb b/docs/docs/tutorials/graph_vectorstore.ipynb index 216e14e2f0833..6416be21237ed 100644 --- a/docs/docs/tutorials/graph_vectorstore.ipynb +++ b/docs/docs/tutorials/graph_vectorstore.ipynb @@ -4,7 +4,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Build a Tech Support Bot from an existing Knowledge Base\n", + "# Build a RAG application with graph links between documents\n", + "\n", + "This tutorial shows how to create links between documents in a GraphVectorStore and use it to get more relevant responses when querying.\n", + "It contains 2 main sections:\n", + "* [Preparation: load the Astra Documentation into GraphVectorStore](#preparation-load-the-astra-documentation-into-graphvectorstore)\n", + "* [Create and execute the RAG Chains](#create-and-execute-the-rag-chains)\n", "\n", "## Preliminaries" ] @@ -22,16 +27,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Load the Astra Documentation into GraphVectorStore\n", + "## Preparation: load the Astra Documentation into GraphVectorStore\n", "\n", - "First, we'll crawl the DataStax documentation. LangChain includes a `SiteMapLoader` but it loads all of the pages into memory simultaneously, which makes it impossible to index larger sites from small environments (such as CoLab). So, we'll scrape the sitemap ourselves and iterate over the URLs, allowing us to process documents in batches and flush them to Astra DB. " + "First, we'll crawl the DataStax documentation. At the moment, `SiteMapLoader` loads all of the pages into memory simultaneously, which makes it impossible to index larger sites from small environments (such as CoLab). So, we'll scrape the sitemap ourselves and iterate over the URLs, allowing us to process documents in batches and flush them to Astra DB. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Scrape the URLs from the Site Maps\n", + "### Scrape the URLs from the Site Maps\n", "First, we use Beautiful Soup to parse the XML content of each sitemap and get the list of URLs.\n", "We also add a few extra URLs for external sites that are also useful to include in the index." ] @@ -87,7 +92,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Load the content from each URL\n", + "### Load the content from each URL\n", "Next, we create the code to load each page. This performs the following steps:\n", "\n", "1. Parses the HTML with BeautifulSoup\n", @@ -157,7 +162,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Initialize Environment\n", + "### Initialize Environment\n", "Before we initialize the Graph Store and write the documents we need to set some environment variables.\n", "In colab, this will prompt you for input. When running locally, this will load from `.env`." ] @@ -206,7 +211,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Initialize Cassio and GraphVectorStore\n", + "### Initialize Cassio and GraphVectorStore\n", "With the environment variables set we initialize the Cassio library for talking to Cassandra / Astra DB.\n", "We also create the `GraphVectorStore`." ] @@ -264,7 +269,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Load the Documents\n", + "### Load the Documents\n", "Finally, we fetch pages and write them to the graph store in batches of 50." ] }, @@ -300,7 +305,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Create and execute the RAG Chains" + "## Create and execute the RAG Chains" ] }, { @@ -365,7 +370,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Vector-Only Retrieval" + "### Vector-Only Retrieval" ] }, { @@ -398,7 +403,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Graph Traversal Retrieval" + "### Graph Traversal Retrieval" ] }, { @@ -431,7 +436,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## MMR Graph Traversal" + "### MMR Graph Traversal" ] }, { @@ -505,7 +510,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Check Retrieval Results" + "### Check Retrieval Results" ] }, { @@ -529,7 +534,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Conclusion\n", + "## Conclusion\n", "With vector only we retrieved chunks from the Astra documentation explaining that it used JVector.\n", "Since it didn't follow the link to [JVector on GitHub](https://github.com/jbellis/jvector) it didn't actually answer the question.\n", "\n", diff --git a/docs/docs/tutorials/index.mdx b/docs/docs/tutorials/index.mdx index 0b6468f3d129c..81d31e7a47c53 100644 --- a/docs/docs/tutorials/index.mdx +++ b/docs/docs/tutorials/index.mdx @@ -15,7 +15,7 @@ New to LangChain or to LLM app development in general? Read this material to qui ## Working with external knowledge - [Build a Retrieval Augmented Generation (RAG) Application](/docs/tutorials/rag) - [Build a Conversational RAG Application](/docs/tutorials/qa_chat_history) -- [Build a Tech Support Bot from an existing Knowledge Base](/docs/tutorials/graph_vectorstore) +- [Build a RAG application with graph links between documents](/docs/tutorials/graph_vectorstore) - [Build a Question/Answering system over SQL data](/docs/tutorials/sql_qa) - [Build a Query Analysis System](/docs/tutorials/query_analysis) - [Build a local RAG application](/docs/tutorials/local_rag) From 485853d0f23759dd25bef8bd7ba2b3ab916041d1 Mon Sep 17 00:00:00 2001 From: Christophe Bornet Date: Mon, 29 Jul 2024 16:40:39 +0200 Subject: [PATCH 3/3] Update tutorial --- docs/docs/tutorials/graph_vectorstore.ipynb | 142 +++++--------------- 1 file changed, 32 insertions(+), 110 deletions(-) diff --git a/docs/docs/tutorials/graph_vectorstore.ipynb b/docs/docs/tutorials/graph_vectorstore.ipynb index 6416be21237ed..d618edfc02847 100644 --- a/docs/docs/tutorials/graph_vectorstore.ipynb +++ b/docs/docs/tutorials/graph_vectorstore.ipynb @@ -7,8 +7,9 @@ "# Build a RAG application with graph links between documents\n", "\n", "This tutorial shows how to create links between documents in a GraphVectorStore and use it to get more relevant responses when querying.\n", + "\n", "It contains 2 main sections:\n", - "* [Preparation: load the Astra Documentation into GraphVectorStore](#preparation-load-the-astra-documentation-into-graphvectorstore)\n", + "* [Preparation: load the DataStaxAstra Documentation into GraphVectorStore](#preparation-load-the-datastax-astra-documentation-into-graphvectorstore)\n", "* [Create and execute the RAG Chains](#create-and-execute-the-rag-chains)\n", "\n", "## Preliminaries" @@ -27,7 +28,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Preparation: load the Astra Documentation into GraphVectorStore\n", + "## Preparation: load the DataStax Astra Documentation into GraphVectorStore\n", "\n", "First, we'll crawl the DataStax documentation. At the moment, `SiteMapLoader` loads all of the pages into memory simultaneously, which makes it impossible to index larger sites from small environments (such as CoLab). So, we'll scrape the sitemap ourselves and iterate over the URLs, allowing us to process documents in batches and flush them to Astra DB. " ] @@ -38,7 +39,7 @@ "source": [ "### Scrape the URLs from the Site Maps\n", "First, we use Beautiful Soup to parse the XML content of each sitemap and get the list of URLs.\n", - "We also add a few extra URLs for external sites that are also useful to include in the index." + "We also add a few extra URLs for external sites that are useful to include in the index." ] }, { @@ -62,8 +63,6 @@ "# Additional URLs to crawl for content.\n", "EXTRA_URLS = [\"https://github.com/jbellis/jvector\"]\n", "\n", - "SITE_PREFIX = \"astra\"\n", - "\n", "\n", "def load_pages(sitemap_url):\n", " r = requests.get(\n", @@ -97,9 +96,9 @@ "\n", "1. Parses the HTML with BeautifulSoup\n", "2. Locates the \"content\" of the HTML using an appropriate selector based on the URL\n", - "3. Find the link (``) tags in the content and collect the absolute URLs (for creating edges).\n", + "3. Use an HtmlLinkExtractor to find the link (``) tags in the content and collect the absolute URLs (for creating edges).\n", "\n", - "Adding the URLs of these references to the metadata allows the graph store to create edges between the document." + "Adding the URLs of these references to the metadata allows the graph store to create edges between the documents." ] }, { @@ -163,48 +162,36 @@ "metadata": {}, "source": [ "### Initialize Environment\n", - "Before we initialize the Graph Store and write the documents we need to set some environment variables.\n", - "In colab, this will prompt you for input. When running locally, this will load from `.env`." + "Before we initialize the Graph Store and write the documents we need to set some environment variables." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Not in colab. Loading '.env' (see 'env.template' for example)\n" - ] - } - ], + "outputs": [], "source": [ + "import getpass\n", "import os\n", + "from dotenv import load_dotenv\n", "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " # (Option 1) - Set the environment variables from getpass.\n", - " print(\"In colab. Using getpass/input for environment variables.\")\n", - " import getpass\n", - " import os\n", + "load_dotenv()\n", "\n", + "if \"OPENAI_API_KEY\" not in os.environ:\n", " os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter OpenAI API Key: \")\n", + " \n", + "if \"ASTRA_DB_DATABASE_ID\" not in os.environ:\n", " os.environ[\"ASTRA_DB_DATABASE_ID\"] = input(\"Enter Astra DB Database ID: \")\n", + " \n", + "if \"ASTRA_DB_APPLICATION_TOKEN\" not in os.environ:\n", " os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"] = getpass.getpass(\n", " \"Enter Astra DB Application Token: \"\n", - " )\n", - "\n", + ")\n", + " \n", + "if \"ASTRA_DB_KEYSPACE\" not in os.environ:\n", " keyspace = input(\"Enter Astra DB Keyspace (Empty for default): \")\n", " if keyspace:\n", - " os.environ[\"ASTRA_DB_KEYSPACE\"] = keyspace\n", - " else:\n", - " os.environ.pop(\"ASTRA_DB_KEYSPACE\", None)\n", - "else:\n", - " print(\"Not in colab. Loading '.env' (see 'env.template' for example)\")\n", - " import dotenv\n", - "\n", - " dotenv.load_dotenv()" + " os.environ[\"ASTRA_DB_KEYSPACE\"] = keyspace" ] }, { @@ -212,46 +199,15 @@ "metadata": {}, "source": [ "### Initialize Cassio and GraphVectorStore\n", - "With the environment variables set we initialize the Cassio library for talking to Cassandra / Astra DB.\n", + "With the environment variables set, we initialize the Cassio library for talking to Cassandra / Astra DB.\n", "We also create the `GraphVectorStore`." ] }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "SITE_PREFIX = \"astra_docs\"" - ] - }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "answer = input(\"Drop Tables? [(Y)es/(N)o]\")\n", - "if answer.lower() in [\"y\", \"yes\"]:\n", - " import cassio\n", - "\n", - " cassio.init(auto=True)\n", - " from cassio.config import check_resolve_keyspace, check_resolve_session\n", - "\n", - " session = check_resolve_session()\n", - " keyspace = check_resolve_keyspace()\n", - " session.execute(f\"DROP TABLE IF EXISTS {keyspace}.{SITE_PREFIX}_nodes\")\n", - " session.execute(f\"DROP TABLE IF EXISTS {keyspace}.{SITE_PREFIX}_targets\")\n", - "else:\n", - " # Handle no / \"wrong\" input\n", - " pass" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], "source": [ "import cassio\n", "from langchain_openai import OpenAIEmbeddings\n", @@ -261,7 +217,7 @@ "embeddings = OpenAIEmbeddings()\n", "graph_vectorstore = CassandraGraphVectorStore(\n", " embeddings,\n", - " node_table=f\"{SITE_PREFIX}_nodes\",\n", + " node_table=f\"astra_docs_nodes\",\n", ")" ] }, @@ -310,7 +266,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -344,7 +300,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -353,7 +309,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -441,7 +397,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -465,43 +421,9 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/markdown": [ - "Astra DB Serverless uses the JVector vector search engine to construct a graph-based index. JVector is part of the DiskANN family and is designed to facilitate approximate nearest neighbor (ANN) search, which is crucial for handling high-dimensional vector spaces efficiently.\n", - "\n", - "Here are the key aspects of JVector and its indexing algorithms:\n", - "\n", - "1. **Graph-Based Index**: JVector constructs a single-layer graph with nonblocking concurrency control. This allows for scalable and efficient search operations.\n", - "\n", - "2. **Incremental Updates**: JVector supports incremental construction and updates to the index, making it suitable for dynamic datasets.\n", - "\n", - "3. **Two-Pass Search**: JVector employs a two-pass search strategy:\n", - " - **First Pass**: Uses lossily compressed representations of vectors stored in memory to quickly narrow down candidates.\n", - " - **Second Pass**: Uses more accurate representations read from disk to refine the search results.\n", - "\n", - "4. **Compression Techniques**: JVector supports various vector compression techniques to optimize memory usage and performance:\n", - " - **Product Quantization (PQ)**: A method that compresses vectors by splitting them into subspaces and quantizing each subspace separately.\n", - " - **Binary Quantization (BQ)**: Another compression method, although it is generally less effective than PQ for most embedding models.\n", - " - **Fused ADC (Asymmetric Distance Computation)**: Combines PQ with efficient distance computation methods to enhance search speed.\n", - "\n", - "5. **DiskANN Architecture**: JVector builds on the DiskANN design, allowing it to handle larger-than-memory indexes by storing additional data on disk.\n", - "\n", - "6. **High-Dimensional Optimization**: JVector uses the Panama Vector API (SIMD) to optimize ANN indexing and search operations, ensuring high performance even with large datasets.\n", - "\n", - "In summary, Astra DB Serverless leverages the JVector engine, which employs a graph-based index with advanced compression and search optimization techniques to provide efficient vector search capabilities." - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "run_and_render(mmr_graph_rag_chain, QUESTION)" ] @@ -521,13 +443,13 @@ "source": [ "# Set the question and see what documents each technique retrieves.\n", "for i, doc in enumerate(vector_retriever.invoke(QUESTION)):\n", - " print(f\"Vector [{i}]: {doc.metadata['content_id']}\")\n", + " print(f\"Vector [{i}]: {doc.id}\")\n", "\n", "for i, doc in enumerate(graph_retriever.invoke(QUESTION)):\n", - " print(f\"Graph [{i}]: {doc.metadata['content_id']}\")\n", + " print(f\"Graph [{i}]: {doc.id}\")\n", "\n", "for i, doc in enumerate(mmr_graph_retriever.invoke(QUESTION)):\n", - " print(f\"MMR Graph [{i}]: {doc.metadata['content_id']}\")" + " print(f\"MMR Graph [{i}]: {doc.id}\")" ] }, {