|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Retrieval Augmented Generative Question Answer Using OCI OpenSearch as Retriever\n", |
| 8 | + "\n", |
| 9 | + "In this tutorial, we will walk through the steps to set up a retrieval-augmented generative QA using OCI OpenSearch as retriever.\n", |
| 10 | + "\n", |
| 11 | + "### Prerequesites\n", |
| 12 | + "- You have a Running Instance of OCI Search.\n", |
| 13 | + "- OpenSearch version has to be at least 2.8.0.\n", |
| 14 | + "- You need to install langchain, opensearch-py, and oracle-ads.\n", |
| 15 | + "\n", |
| 16 | + "To check how to spin up an instance of OCI search, see [Search and visualize data using OCI Search Service with OpenSearch](https://docs.oracle.com/en/learn/oci-opensearch/index.html#introduction)" |
| 17 | + ] |
| 18 | + }, |
| 19 | + { |
| 20 | + "cell_type": "code", |
| 21 | + "execution_count": null, |
| 22 | + "metadata": {}, |
| 23 | + "outputs": [], |
| 24 | + "source": [ |
| 25 | + "!pip install --upgrade oracle-ads langchain opensearch-py" |
| 26 | + ] |
| 27 | + }, |
| 28 | + { |
| 29 | + "cell_type": "markdown", |
| 30 | + "metadata": {}, |
| 31 | + "source": [ |
| 32 | + "### Step 1: Load and Split your Documents Into Chunks\n", |
| 33 | + "Let's say you're looking to create a search engine that enables users to search through documentation stored as markdown files. Actually, it does not matter what file format your documentation are in as Langchain offers support for various types of document loaders. In this tutorial, we will just use markdown file as an example." |
| 34 | + ] |
| 35 | + }, |
| 36 | + { |
| 37 | + "cell_type": "code", |
| 38 | + "execution_count": 1, |
| 39 | + "metadata": {}, |
| 40 | + "outputs": [], |
| 41 | + "source": [ |
| 42 | + "import fsspec\n", |
| 43 | + "from langchain.text_splitter import MarkdownHeaderTextSplitter\n", |
| 44 | + "\n", |
| 45 | + "with fsspec.open(\n", |
| 46 | + " \"https://raw.githubusercontent.com/oracle-samples/oci-data-science-ai-samples/main/distributed_training/Tensorboard.md\",\n", |
| 47 | + " \"r\"\n", |
| 48 | + ") as f:\n", |
| 49 | + " report = f.read()\n", |
| 50 | + " \n", |
| 51 | + " \n", |
| 52 | + "headers_to_split_on = [\n", |
| 53 | + " (\"#\", \"Header 1\"),\n", |
| 54 | + " (\"##\", \"Header 2\"),\n", |
| 55 | + " (\"###\", \"Header 3\"),\n", |
| 56 | + "]\n", |
| 57 | + "\n", |
| 58 | + "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n", |
| 59 | + "md_header_splits = markdown_splitter.split_text(report)\n", |
| 60 | + "texts = [text.page_content for text in md_header_splits]" |
| 61 | + ] |
| 62 | + }, |
| 63 | + { |
| 64 | + "cell_type": "code", |
| 65 | + "execution_count": 2, |
| 66 | + "metadata": {}, |
| 67 | + "outputs": [ |
| 68 | + { |
| 69 | + "name": "stdout", |
| 70 | + "output_type": "stream", |
| 71 | + "text": [ |
| 72 | + "Number of documents: 4\n", |
| 73 | + "First document:\n", |
| 74 | + "TensorBoard helps visualizing your experiments. You bring up a ``TensorBoard`` session on your workstation and point to\n", |
| 75 | + "the directory that contains the TensorBoard logs. \n", |
| 76 | + "`OCI` = Oracle Cloud Infrastructure\n", |
| 77 | + "`DT` = Distributed Training\n", |
| 78 | + "`ADS` = Oracle Accelerated Data Science Library\n", |
| 79 | + "`OCIR` = Oracle Cloud Infrastructure Registry\n" |
| 80 | + ] |
| 81 | + } |
| 82 | + ], |
| 83 | + "source": [ |
| 84 | + "print(f\"Number of documents: {len(texts)}\")\n", |
| 85 | + "print(f\"First document:\\n{texts[0]}\")" |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "cell_type": "markdown", |
| 90 | + "metadata": {}, |
| 91 | + "source": [ |
| 92 | + "### Step 2: Embed your Documents\n", |
| 93 | + "\n", |
| 94 | + "You can use oracle-ads to access the GenerativeAI embedding models. The embedding models returns embedding vectors of length 1024. oracle-ads is an open source library. It speeds up common data science activities by providing tools that automate and simplify common data science tasks. Additionally, provides data scientists a friendly pythonic interface to OCI services. Check [oracle-ads github](https://github.com/oracle/accelerated-data-science) for more information." |
| 95 | + ] |
| 96 | + }, |
| 97 | + { |
| 98 | + "cell_type": "code", |
| 99 | + "execution_count": null, |
| 100 | + "metadata": {}, |
| 101 | + "outputs": [], |
| 102 | + "source": [ |
| 103 | + "from ads.llm import GenerativeAIEmbeddings\n", |
| 104 | + " \n", |
| 105 | + "oci_embedings = GenerativeAIEmbeddings(\n", |
| 106 | + " compartment_id=\"ocid1.compartment.oc1.######\",\n", |
| 107 | + " client_kwargs=dict(service_endpoint=\"https://generativeai.aiservice.us-chicago-1.oci.oraclecloud.com\") # this can be omitted after Generative AI service is GA.\n", |
| 108 | + ")\n", |
| 109 | + "embeddings = oci_embedings.embed_documents(texts=texts)" |
| 110 | + ] |
| 111 | + }, |
| 112 | + { |
| 113 | + "cell_type": "code", |
| 114 | + "execution_count": 4, |
| 115 | + "metadata": {}, |
| 116 | + "outputs": [ |
| 117 | + { |
| 118 | + "name": "stdout", |
| 119 | + "output_type": "stream", |
| 120 | + "text": [ |
| 121 | + "Number of embeddings: 4\n", |
| 122 | + "Embedding dimensions: 1024\n" |
| 123 | + ] |
| 124 | + } |
| 125 | + ], |
| 126 | + "source": [ |
| 127 | + "print(f\"Number of embeddings: {len(embeddings)}\")\n", |
| 128 | + "print(f\"Embedding dimensions: {len(embeddings[0])}\")" |
| 129 | + ] |
| 130 | + }, |
| 131 | + { |
| 132 | + "cell_type": "markdown", |
| 133 | + "metadata": {}, |
| 134 | + "source": [ |
| 135 | + "### Step 3: Create an Index for your Documents\n", |
| 136 | + "\n", |
| 137 | + "First connect to your OCI search cluster. We can use the opensearchpy library to connect to the OpenSearch cluster." |
| 138 | + ] |
| 139 | + }, |
| 140 | + { |
| 141 | + "cell_type": "code", |
| 142 | + "execution_count": 6, |
| 143 | + "metadata": {}, |
| 144 | + "outputs": [], |
| 145 | + "source": [ |
| 146 | + "# Connect to the opensearch cluster.\n", |
| 147 | + "from opensearchpy import OpenSearch\n", |
| 148 | + " \n", |
| 149 | + "# Create a connection to your OpenSearch cluster\n", |
| 150 | + "es = OpenSearch(\n", |
| 151 | + " ['https://####'], # Replace with your OpenSearch endpoint URL\n", |
| 152 | + " http_auth=('username', 'password'), # Replace with your credentials\n", |
| 153 | + " verify_certs=False, # Set to True if you want to verify SSL certificates\n", |
| 154 | + " timeout=30\n", |
| 155 | + ")" |
| 156 | + ] |
| 157 | + }, |
| 158 | + { |
| 159 | + "cell_type": "markdown", |
| 160 | + "metadata": {}, |
| 161 | + "source": [ |
| 162 | + "First, you must create a k-NN index and set the ``index.knn`` parameter to true. This settings tells the plugin to generate native library indexes specifically tailored for k-NN searches. \n", |
| 163 | + "\n", |
| 164 | + "Next, you must add one or more fields of the knn_vector data type. This example creates an index with one ``knn_vector``: ``embedding_vector`` and one ``text``: ``text``. \n", |
| 165 | + "\n", |
| 166 | + "The knn_vector uses Lucene fields that specify the configuration of the k-NN search algorithms. It employs the Hierarchical Navigable Small Worlds [HNSW](https://www.pinecone.io/learn/series/faiss/hnsw/) algorithm for super fast search and fantastic recall and consine similarity to measure distance. \n", |
| 167 | + "\n", |
| 168 | + "- ``efSearch`` controls how many entry points will be explored between layers during the search. A higher value of ef_search typically results in a more thorough and potentially higher-quality search, but increased computational cost. \n", |
| 169 | + "\n", |
| 170 | + "- ``efConstruction`` controls how many entry points will be explored when building the index. A higher value of \"ef_constructions\" typically results in a higher-quality graph structure but may also increase the computational cost of building the index.\n", |
| 171 | + "\n", |
| 172 | + "The ``dimension`` field defines the size of the embedding vector. In our case, we are using embedding vectors returned from the genAI embedding model, which is of length 1024. \n", |
| 173 | + "\n", |
| 174 | + "See [documentation](https://opensearch.org/docs/2.8/search-plugins/knn/knn-index#method-definitions) for more details on parameters' definitions. You\n", |
| 175 | + "\n", |
| 176 | + "**Note**: The Lucene engine can support dimension up to 1,024." |
| 177 | + ] |
| 178 | + }, |
| 179 | + { |
| 180 | + "cell_type": "code", |
| 181 | + "execution_count": 7, |
| 182 | + "metadata": {}, |
| 183 | + "outputs": [ |
| 184 | + { |
| 185 | + "data": { |
| 186 | + "text/plain": [ |
| 187 | + "{'acknowledged': True, 'shards_acknowledged': True, 'index': 'tensorboard'}" |
| 188 | + ] |
| 189 | + }, |
| 190 | + "execution_count": 7, |
| 191 | + "metadata": {}, |
| 192 | + "output_type": "execute_result" |
| 193 | + } |
| 194 | + ], |
| 195 | + "source": [ |
| 196 | + "INDEX_NAME = \"tensorboard\"\n", |
| 197 | + "VECTOR_1_NAME = \"embedding_vector\"\n", |
| 198 | + "VECTOR_2_NAME = \"text\"\n", |
| 199 | + " \n", |
| 200 | + "body = {\n", |
| 201 | + " # Index setting: https://opensearch.org/docs/2.11/search-plugins/knn/knn-index\n", |
| 202 | + " \"settings\": {\"index\": {\"knn\": \"true\", \"knn.algo_param.ef_search\": 100}},\n", |
| 203 | + " # Explicit mapping: https://opensearch.org/docs/2.11/field-types/index/#explicit-mapping\n", |
| 204 | + " \"mappings\": { \n", |
| 205 | + " \"properties\": {\n", |
| 206 | + " VECTOR_1_NAME: {\n", |
| 207 | + " # Supported field types: https://opensearch.org/docs/2.11/field-types/supported-field-types/index/\n", |
| 208 | + " \"type\": \"knn_vector\", \n", |
| 209 | + " \"dimension\": 1024,\n", |
| 210 | + " # Method definition: https://opensearch.org/docs/2.11/search-plugins/knn/knn-index#method-definitions\n", |
| 211 | + " \"method\": { \n", |
| 212 | + " \"name\": \"hnsw\",\n", |
| 213 | + " \"space_type\": \"cosinesimil\",\n", |
| 214 | + " \"engine\": \"lucene\",\n", |
| 215 | + " \"parameters\": {\"ef_construction\": 128, \"m\": 24},\n", |
| 216 | + " },\n", |
| 217 | + " },\n", |
| 218 | + " VECTOR_2_NAME: {\n", |
| 219 | + " \"type\": \"text\"\n", |
| 220 | + " },\n", |
| 221 | + " }\n", |
| 222 | + " },\n", |
| 223 | + "}\n", |
| 224 | + "response = es.indices.create(INDEX_NAME, body=body)\n", |
| 225 | + "response" |
| 226 | + ] |
| 227 | + }, |
| 228 | + { |
| 229 | + "cell_type": "markdown", |
| 230 | + "metadata": {}, |
| 231 | + "source": [ |
| 232 | + "### Step 4: Insert the Embedding Vectors for your Documents\n", |
| 233 | + "Now let's populate the index using the embedding vectors calculated from your documents using Cohere Embedding Models. " |
| 234 | + ] |
| 235 | + }, |
| 236 | + { |
| 237 | + "cell_type": "code", |
| 238 | + "execution_count": 8, |
| 239 | + "metadata": {}, |
| 240 | + "outputs": [], |
| 241 | + "source": [ |
| 242 | + "i = 0\n", |
| 243 | + "# insert each row one-at-a-time to the document index\n", |
| 244 | + "for text, embed in zip(texts, embeddings):\n", |
| 245 | + " \n", |
| 246 | + " try:\n", |
| 247 | + " \n", |
| 248 | + " body = {\n", |
| 249 | + " VECTOR_1_NAME: embed,\n", |
| 250 | + " VECTOR_2_NAME: text,\n", |
| 251 | + " }\n", |
| 252 | + " response = es.index(index=INDEX_NAME, body=body)\n", |
| 253 | + " except Exception as e:\n", |
| 254 | + " print(f\"[ERROR]: {e}\")\n", |
| 255 | + " continue\n", |
| 256 | + " i += 1" |
| 257 | + ] |
| 258 | + }, |
| 259 | + { |
| 260 | + "cell_type": "markdown", |
| 261 | + "metadata": {}, |
| 262 | + "source": [ |
| 263 | + "\n", |
| 264 | + "A new query coming in, first calcualte the embedding vector and then conduct a semantic search.\n", |
| 265 | + "\n", |
| 266 | + "- `k`: the number of neighbors the search will return\n", |
| 267 | + "- `size`: (required) how many results the query actually returns. The plugin returns k amount of results for each shard (and each segment) and size amount of results for the entire query. The plugin supports a maximum k value of 10,000." |
| 268 | + ] |
| 269 | + }, |
| 270 | + { |
| 271 | + "cell_type": "code", |
| 272 | + "execution_count": 15, |
| 273 | + "metadata": {}, |
| 274 | + "outputs": [ |
| 275 | + { |
| 276 | + "name": "stdout", |
| 277 | + "output_type": "stream", |
| 278 | + "text": [ |
| 279 | + "It is required that ``tensorboard`` is installed in a dedicated conda environment or virtual environment. Prepare an\n", |
| 280 | + "environment yaml file for creating conda environment with following command - \n", |
| 281 | + "**tensorboard-dep.yaml**: \n", |
| 282 | + "```yaml\n", |
| 283 | + "dependencies:\n", |
| 284 | + "- python=3.8\n", |
| 285 | + "- pip\n", |
| 286 | + "- pip:\n", |
| 287 | + "- ocifs\n", |
| 288 | + "- tensorboard\n", |
| 289 | + "name: tensorboard\n", |
| 290 | + "``` \n", |
| 291 | + "Create the conda environment from the yaml file generated in the preceeding step \n", |
| 292 | + "```bash\n", |
| 293 | + "conda env create -f tensorboard-dep.yaml\n", |
| 294 | + "``` \n", |
| 295 | + "This will create a conda environment called tensorboard. Activate the conda environment by running - \n", |
| 296 | + "```bash\n", |
| 297 | + "conda activate tensorboard\n", |
| 298 | + "``` \n", |
| 299 | + "**Using TensorBoard Logs:** \n", |
| 300 | + "To launch a TensorBoard session on your local workstation, run - \n", |
| 301 | + "```bash\n", |
| 302 | + "export OCIFS_IAM_KEY=api_key\n", |
| 303 | + "tensorboard --logdir oci://my-bucket@my-namespace/path/to/logs\n", |
| 304 | + "``` \n", |
| 305 | + "`OCIFS_IAM_KEY=api_key` - If you are using resource principal, set `resource_principal` \n", |
| 306 | + "This will bring up TensorBoard app on your workstation. Access TensorBoard at ``http://localhost:6006/`` \n", |
| 307 | + "**Note**: The logs take some initial time (few minutes) to reflect on the tensorboard dashboard.\n" |
| 308 | + ] |
| 309 | + } |
| 310 | + ], |
| 311 | + "source": [ |
| 312 | + "query_vector = oci_embedings.embed_query(text=\"how to set up tensorboard in oci?\")\n", |
| 313 | + "query = {\n", |
| 314 | + " \"size\": 2,\n", |
| 315 | + " \"query\": {\"knn\": {VECTOR_1_NAME: {\"vector\": query_vector, \"k\": 2}}},\n", |
| 316 | + "}\n", |
| 317 | + " \n", |
| 318 | + "response = es.search(body=query, index=INDEX_NAME) # the same as before\n", |
| 319 | + "print(response[\"hits\"][\"hits\"][0]['_source']['text'])" |
| 320 | + ] |
| 321 | + } |
| 322 | + ], |
| 323 | + "metadata": { |
| 324 | + "kernelspec": { |
| 325 | + "display_name": "Python 3 (ipykernel)", |
| 326 | + "language": "python", |
| 327 | + "name": "python3" |
| 328 | + }, |
| 329 | + "language_info": { |
| 330 | + "codemirror_mode": { |
| 331 | + "name": "ipython", |
| 332 | + "version": 3 |
| 333 | + }, |
| 334 | + "file_extension": ".py", |
| 335 | + "mimetype": "text/x-python", |
| 336 | + "name": "python", |
| 337 | + "nbconvert_exporter": "python", |
| 338 | + "pygments_lexer": "ipython3", |
| 339 | + "version": "3.8.18" |
| 340 | + } |
| 341 | + }, |
| 342 | + "nbformat": 4, |
| 343 | + "nbformat_minor": 4 |
| 344 | +} |
0 commit comments