Merge pull request #384 from pinecone-io/bulk-import-notebook

jseldess · web-flow · commit 7e5b0afab8d5 · 2024-09-23T17:20:30.000-04:00
Add bulk import notebook
diff --git a/docs/pinecone-bulk-import.ipynb b/docs/pinecone-bulk-import.ipynb
@@ -0,0 +1,333 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5ePFLZDtbWB9"
+      },
+      "source": [
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/pinecone-bulk-import.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/pinecone-bulk-import.ipynb)\n",
+        "\n",
+        "# Pinecone Bulk Import"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lKAHnDD0Zeiw"
+      },
+      "source": [
+        "## Scenario: Ingesting Embedded Parquet Data from S3 to Pinecone\n",
+        "\n",
+        "In this scenario, you are tasked with ingesting pre-generated vector embeddings stored in Parquet files located in an S3 bucket into a Pinecone index. The embeddings have been precomputed by a third-party vendor and are ready to be indexed for future vector similarity search or other downstream tasks.\n",
+        "\n",
+        "### Problem Overview\n",
+        "The goal is to seamlessly move the data from S3 to Pinecone so that it can be used for future tasks such as semantic search, recommendations, and anomaly detection.\n",
+        "\n",
+        "### Solution steps\n",
+        "1. **Access the S3 Bucket**: You will access the S3 bucket where the Parquet files are stored. These files contain the embeddings and metadata needed for indexing.\n",
+        "  \n",
+        "2. **Read and Extract Embeddings**: Once the Parquet files are accessed, you will extract the embeddings and any necessary metadata (e.g., unique document IDs or other attributes).\n",
+        "   \n",
+        "3. **Upload Embeddings to Pinecone**: After extracting the data, you will upload the embeddings to a Pinecone index, associating each embedding with its respective identifier. This process allows the embeddings to be efficiently queried or analyzed later.\n",
+        "\n",
+        "This approach allows you to efficiently transfer embedded parquet files from S3 storage to Pinecone to support vector search.  Please see our official [Understanding Imports in Pinecone Documentation](https://docs.pinecone.io/guides/data/understanding-imports)\n",
+        " for additional information.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "azHQh9CugZHU"
+      },
+      "source": [
+        "## Install required libraries"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "gcofp6aAwlgR"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install pinecone-client\n",
+        "!pip install pinecone_notebooks"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": true,
+        "id": "LC6v4kqda7dN"
+      },
+      "outputs": [],
+      "source": [
+        "from pinecone import Pinecone, ServerlessSpec\n",
+        "import time\n",
+        "import os\n",
+        "from datetime import datetime\n",
+        "import json"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "UYm71QsCEwfD"
+      },
+      "source": [
+        "## Get Pinecone API key"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "BIh83-IXwXgU"
+      },
+      "outputs": [],
+      "source": [
+        "from pinecone_notebooks.colab import Authenticate\n",
+        "Authenticate()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": true,
+        "id": "xyfAuSi5bGoN"
+      },
+      "outputs": [],
+      "source": [
+        "api_key = os.getenv('PINECONE_API_KEY')\n",
+        "\n",
+        "# Configure Pinecone client\n",
+        "pc = Pinecone(api_key=api_key)\n",
+        "\n",
+        "# Get cloud and region settings\n",
+        "cloud = os.getenv('PINECONE_CLOUD', 'aws')\n",
+        "region = os.getenv('PINECONE_REGION', 'us-east-1')\n",
+        "\n",
+        "# Define serverless specifications\n",
+        "spec = ServerlessSpec(cloud=cloud, region=region)\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "wPrCU2PabgTg"
+      },
+      "source": [
+        "## Create a serverless index\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "7TA1uqEQbLiT"
+      },
+      "outputs": [],
+      "source": [
+        "\n",
+        "index_name = \"pinecone-bulk-import\"\n",
+        "dimension = 1536\n",
+        "\n",
+        "if not pc.has_index(index_name):\n",
+        "  pc.create_index(\n",
+        "      name=index_name,\n",
+        "      dimension=dimension,\n",
+        "      metric=\"cosine\",\n",
+        "      spec=ServerlessSpec(cloud=\"aws\", region=\"us-west-2\")\n",
+        "  )\n",
+        "\n",
+        "index = pc.Index(name=index_name)\n",
+        "\n",
+        "print(f\"Index '{index_name}' created successfully.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "3552NYEDcBos"
+      },
+      "source": [
+        "## Start import task\n",
+        "\n",
+        "This sample dataset contains:\n",
+        "\n",
+        "*   **Dimensions**: 1536\n",
+        "*   **Rows**: 10,000\n",
+        "*   **Files**: 10 parquet files\n",
+        "*   **Size per file**: ~12.58 MB\n",
+        "*   **Total size**: ~125.8\n",
+        "\n",
+        "Each file contains:\n",
+        "\n",
+        "*   **id**: Unique identifier\n",
+        "*   **Values**: Embedded vectors\n",
+        "*   **metadata**: JSON-formatted dictionary with metadata\n",
+        "\n",
+        "***Note***: *This task may take 10 minutes or more to complete. And Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is met first.*"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pwVvY9fRlZYj"
+      },
+      "source": [
+        "## Specify AWS S3 folder and start task"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "FV8DGtmtnKpj"
+      },
+      "outputs": [],
+      "source": [
+        "root = \"s3://dev-bulk-import-datasets-pub/10k-1536/\"\n",
+        "op = index.start_import(uri=root, error_mode=\"CONTINUE\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CUoMQXImncaU"
+      },
+      "source": [
+        "## Check the status of the import"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ARJlKVtmpY73"
+      },
+      "outputs": [],
+      "source": [
+        "index.describe_index_stats()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Qq9qL3hRcEWv"
+      },
+      "source": [
+        "## List import operations"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "K_ig20UBbPeu"
+      },
+      "outputs": [],
+      "source": [
+        "imports = list(index.list_imports())\n",
+        "if imports:\n",
+        "    for i in imports:\n",
+        "        print(i)\n",
+        "else:\n",
+        "    print(\"No imports found in the index.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "els_rMBhcFTa"
+      },
+      "source": [
+        "## Describe a specific import"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "OXgTNgVAbRps"
+      },
+      "outputs": [],
+      "source": [
+        "index.describe_import(\"1\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "79NM6VDtcME7"
+      },
+      "source": [
+        "## Cancel the Import (if needed)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "87M3vXgsvbxs"
+      },
+      "outputs": [],
+      "source": [
+        "# Check if operation status and cancel running instance\n",
+        "op_status = index.describe_import(op.id)\n",
+        "print(f\"Operation status: {op_status}\")\n",
+        "\n",
+        "if op_status in ['in_progress', 'pending']:\n",
+        "    try:\n",
+        "        cancel_response = index.cancel_import(op.id)\n",
+        "        print(f\"Import operation {op.id} cancelled.\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"Error cancelling import: {e}\")\n",
+        "else:\n",
+        "    print(f\"Cannot cancel operation {op.id} because its status is: {op_status}\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a1euqHZocS1F"
+      },
+      "source": [
+        "## Delete the index"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "jofVBQHycWxt"
+      },
+      "outputs": [],
+      "source": [
+        "pc.delete_index(index_name)\n",
+        "print(f\"Index '{index_name}' deleted.\")"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}