[RAG-75] docs: add quickstart notebook (#151)

jordanrfrazier · web-flow · commit c905676b22b0 · 2023-12-14T12:52:59.000-08:00
Add quickstart notebook
diff --git a/examples/notebooks/quickstart.ipynb b/examples/notebooks/quickstart.ipynb
@@ -0,0 +1,310 @@
+{
+    "cells": [
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "<a href=\"https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/quickstart.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "# Quickstart with RAGStack\n",
+                "\n",
+                "This notebook demonstrates how to set up a simple RAG pipeline with RAGStack. At the end of this notebook, you will have a fully functioning Question/Answer model that can answer questions using your supplied documents. \n",
+                "\n",
+                "A RAG pipeline requires, at minimum, a vector store, an embedding model, and an LLM. In this tutorial, you will use an Astra DB vector store, an OpenAI embedding model, an OpenAI LLM, and LangChain to orchestrate it all together."
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "## Prerequisites\n",
+                "\n",
+                "You will need a vector-enabled Astra database and an OpenAI Account.\n",
+                "\n",
+                "* Create an [Astra vector database](https://docs.datastax.com/en/astra-serverless/docs/getting-started/create-db-choices.html).\n",
+                "* Create an [OpenAI account](https://openai.com/)\n",
+                "* Within your database, create an [Astra DB Access Token](https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html) with Database Administrator permissions.\n",
+                "* Get your Astra DB Endpoint: \n",
+                "  * `https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com`\n",
+                "\n",
+                "See the [Prerequisites](https://docs.datastax.com/en/ragstack/docs/prerequisites.html) page for more details."
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "## Setup\n",
+                "`ragstack-ai` includes all the packages you need to build a RAG pipeline. \n",
+                "\n",
+                "`datasets` is used to import a sample dataset"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 3,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "! pip install -q ragstack-ai datasets"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 4,
+            "metadata": {
+                "nbmake": {
+                    "post_cell_execute": [
+                        "import string\n",
+                        "import random\n",
+                        "collection = ''.join(random.choice(string.ascii_lowercase) for _ in range(8))\n"
+                    ]
+                }
+            },
+            "outputs": [],
+            "source": [
+                "import os\n",
+                "from getpass import getpass\n",
+                "\n",
+                "# Enter your settings for Astra DB and OpenAI:\n",
+                "keys = [\"ASTRA_DB_APPLICATION_TOKEN\", \"ASTRA_DB_API_ENDPOINT\", \"OPENAI_API_KEY\"]\n",
+                "for key in keys:\n",
+                "    if key not in os.environ:\n",
+                "        os.environ[key] = getpass(f\"Enter {key}: \")"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 5,
+            "metadata": {
+                "tags": [
+                    "skip-execution"
+                ]
+            },
+            "outputs": [],
+            "source": [
+                "# Collections are where documents are stored. ex: test\n",
+                "collection = input(\"Collection: \")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "## Create RAG Pipeline"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "### Embedding Model and Vector Store"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 13,
+            "metadata": {},
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "Astra vector store configured\n"
+                    ]
+                }
+            ],
+            "source": [
+                "from langchain.vectorstores.astradb import AstraDB\n",
+                "from langchain.embeddings import OpenAIEmbeddings\n",
+                "\n",
+                "# Configure your embedding model and vector store\n",
+                "embedding = OpenAIEmbeddings()\n",
+                "vstore = AstraDB(\n",
+                "    collection_name=collection,\n",
+                "    embedding=embedding,\n",
+                "    token=os.getenv(\"ASTRA_DB_APPLICATION_TOKEN\"),\n",
+                "    api_endpoint=os.getenv(\"ASTRA_DB_API_ENDPOINT\"),\n",
+                ")\n",
+                "print(\"Astra vector store configured\")"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 8,
+            "metadata": {},
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "An example entry:\n",
+                        "{'author': 'aristotle', 'quote': 'Love well, be loved and do something of value.', 'tags': 'love;ethics'}\n"
+                    ]
+                }
+            ],
+            "source": [
+                "from datasets import load_dataset\n",
+                "\n",
+                "# Load a sample dataset\n",
+                "philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n",
+                "print(\"An example entry:\")\n",
+                "print(philo_dataset[16])"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 9,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "from langchain.schema import Document\n",
+                "\n",
+                "# Constructs a set of documents from your data. Documents can be used as inputs to your vector store.\n",
+                "docs = []\n",
+                "for entry in philo_dataset:\n",
+                "    metadata = {\"author\": entry[\"author\"]}\n",
+                "    if entry[\"tags\"]:\n",
+                "        # Add metadata tags to the metadata dictionary\n",
+                "        for tag in entry[\"tags\"].split(\";\"):\n",
+                "            metadata[tag] = \"y\"\n",
+                "    # Create a LangChain document with the quote and metadata tags\n",
+                "    doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n",
+                "    docs.append(doc)"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 10,
+            "metadata": {
+                "nbmake": {
+                    "post_cell_execute": [
+                        "assert len(inserted_ids) > 0"
+                    ]
+                }
+            },
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "\n",
+                        "Inserted 450 documents.\n"
+                    ]
+                }
+            ],
+            "source": [
+                "# Create embeddings by inserting your documents into the vector store.\n",
+                "inserted_ids = vstore.add_documents(docs)\n",
+                "print(f\"\\nInserted {len(inserted_ids)} documents.\")"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "# Checks your collection to verify the documents are embedded.\n",
+                "print(vstore.astra_db.collection(collection).find())"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "### Basic Retrieval\n",
+                "\n",
+                "Retrieve context from your vector database, and pass it to the model with a prompt."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 12,
+            "metadata": {},
+            "outputs": [
+                {
+                    "data": {
+                        "text/plain": [
+                            "'In the given context, philosophers are most concerned with truth and knowledge.'"
+                        ]
+                    },
+                    "execution_count": 12,
+                    "metadata": {},
+                    "output_type": "execute_result"
+                }
+            ],
+            "source": [
+                "from langchain.prompts import ChatPromptTemplate\n",
+                "from langchain.chat_models import ChatOpenAI\n",
+                "from langchain.schema.output_parser import StrOutputParser\n",
+                "from langchain.schema.runnable import RunnablePassthrough\n",
+                "\n",
+                "retriever = vstore.as_retriever(search_kwargs={\"k\": 3})\n",
+                "\n",
+                "prompt_template = \"\"\"\n",
+                "Answer the question based only on the supplied context. If you don't know the answer, say you don't know the answer.\n",
+                "Context: {context}\n",
+                "Question: {question}\n",
+                "Your answer:\n",
+                "\"\"\"\n",
+                "prompt = ChatPromptTemplate.from_template(prompt_template)\n",
+                "model = ChatOpenAI()\n",
+                "\n",
+                "chain = (\n",
+                "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
+                "    | prompt\n",
+                "    | model\n",
+                "    | StrOutputParser()\n",
+                ")\n",
+                "\n",
+                "chain.invoke(\"In the given context, what subject are philosophers most concerned with?\")"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "metadata": {},
+            "outputs": [],
+            "source": [
+                "# Add your questions here!\n",
+                "# chain.invoke(\"<your question>\")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "You now have a fully functioning RAG pipeline! Note that there are several different ways to accomplish this, depending on your input data format, vector store, embedding, model, output type, and more. There are also more advanced RAG techniques that leverage new ingestion, retrieval, and generation patterns.  \n",
+                "\n",
+                "RAG is a powerful solution used in tandem with the capabilities of LLMs. Check out our other examples for ideas on how you can build innovative solutions using RAGStack!"
+            ]
+        }
+    ],
+    "metadata": {
+        "kernelspec": {
+            "display_name": "Python 3",
+            "language": "python",
+            "name": "python3"
+        },
+        "language_info": {
+            "codemirror_mode": {
+                "name": "ipython",
+                "version": 3
+            },
+            "file_extension": ".py",
+            "mimetype": "text/x-python",
+            "name": "python",
+            "nbconvert_exporter": "python",
+            "pygments_lexer": "ipython3",
+            "version": "3.11.4"
+        }
+    },
+    "nbformat": 4,
+    "nbformat_minor": 2
+}