Skip to content

Commit 7e5b0af

Browse files
authored
Merge pull request #384 from pinecone-io/bulk-import-notebook
Add bulk import notebook
2 parents 37a8fdd + 54cce0e commit 7e5b0af

File tree

1 file changed

+333
-0
lines changed

1 file changed

+333
-0
lines changed

docs/pinecone-bulk-import.ipynb

Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "5ePFLZDtbWB9"
7+
},
8+
"source": [
9+
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/pinecone-bulk-import.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/pinecone-bulk-import.ipynb)\n",
10+
"\n",
11+
"# Pinecone Bulk Import"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {
17+
"id": "lKAHnDD0Zeiw"
18+
},
19+
"source": [
20+
"## Scenario: Ingesting Embedded Parquet Data from S3 to Pinecone\n",
21+
"\n",
22+
"In this scenario, you are tasked with ingesting pre-generated vector embeddings stored in Parquet files located in an S3 bucket into a Pinecone index. The embeddings have been precomputed by a third-party vendor and are ready to be indexed for future vector similarity search or other downstream tasks.\n",
23+
"\n",
24+
"### Problem Overview\n",
25+
"The goal is to seamlessly move the data from S3 to Pinecone so that it can be used for future tasks such as semantic search, recommendations, and anomaly detection.\n",
26+
"\n",
27+
"### Solution steps\n",
28+
"1. **Access the S3 Bucket**: You will access the S3 bucket where the Parquet files are stored. These files contain the embeddings and metadata needed for indexing.\n",
29+
" \n",
30+
"2. **Read and Extract Embeddings**: Once the Parquet files are accessed, you will extract the embeddings and any necessary metadata (e.g., unique document IDs or other attributes).\n",
31+
" \n",
32+
"3. **Upload Embeddings to Pinecone**: After extracting the data, you will upload the embeddings to a Pinecone index, associating each embedding with its respective identifier. This process allows the embeddings to be efficiently queried or analyzed later.\n",
33+
"\n",
34+
"This approach allows you to efficiently transfer embedded parquet files from S3 storage to Pinecone to support vector search. Please see our official [Understanding Imports in Pinecone Documentation](https://docs.pinecone.io/guides/data/understanding-imports)\n",
35+
" for additional information.\n"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"metadata": {
41+
"id": "azHQh9CugZHU"
42+
},
43+
"source": [
44+
"## Install required libraries"
45+
]
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": null,
50+
"metadata": {
51+
"id": "gcofp6aAwlgR"
52+
},
53+
"outputs": [],
54+
"source": [
55+
"!pip install pinecone-client\n",
56+
"!pip install pinecone_notebooks"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {
63+
"collapsed": true,
64+
"id": "LC6v4kqda7dN"
65+
},
66+
"outputs": [],
67+
"source": [
68+
"from pinecone import Pinecone, ServerlessSpec\n",
69+
"import time\n",
70+
"import os\n",
71+
"from datetime import datetime\n",
72+
"import json"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {
78+
"id": "UYm71QsCEwfD"
79+
},
80+
"source": [
81+
"## Get Pinecone API key"
82+
]
83+
},
84+
{
85+
"cell_type": "code",
86+
"execution_count": null,
87+
"metadata": {
88+
"id": "BIh83-IXwXgU"
89+
},
90+
"outputs": [],
91+
"source": [
92+
"from pinecone_notebooks.colab import Authenticate\n",
93+
"Authenticate()"
94+
]
95+
},
96+
{
97+
"cell_type": "code",
98+
"execution_count": null,
99+
"metadata": {
100+
"collapsed": true,
101+
"id": "xyfAuSi5bGoN"
102+
},
103+
"outputs": [],
104+
"source": [
105+
"api_key = os.getenv('PINECONE_API_KEY')\n",
106+
"\n",
107+
"# Configure Pinecone client\n",
108+
"pc = Pinecone(api_key=api_key)\n",
109+
"\n",
110+
"# Get cloud and region settings\n",
111+
"cloud = os.getenv('PINECONE_CLOUD', 'aws')\n",
112+
"region = os.getenv('PINECONE_REGION', 'us-east-1')\n",
113+
"\n",
114+
"# Define serverless specifications\n",
115+
"spec = ServerlessSpec(cloud=cloud, region=region)\n",
116+
"\n"
117+
]
118+
},
119+
{
120+
"cell_type": "markdown",
121+
"metadata": {
122+
"id": "wPrCU2PabgTg"
123+
},
124+
"source": [
125+
"## Create a serverless index\n",
126+
"\n",
127+
"\n"
128+
]
129+
},
130+
{
131+
"cell_type": "code",
132+
"execution_count": null,
133+
"metadata": {
134+
"id": "7TA1uqEQbLiT"
135+
},
136+
"outputs": [],
137+
"source": [
138+
"\n",
139+
"index_name = \"pinecone-bulk-import\"\n",
140+
"dimension = 1536\n",
141+
"\n",
142+
"if not pc.has_index(index_name):\n",
143+
" pc.create_index(\n",
144+
" name=index_name,\n",
145+
" dimension=dimension,\n",
146+
" metric=\"cosine\",\n",
147+
" spec=ServerlessSpec(cloud=\"aws\", region=\"us-west-2\")\n",
148+
" )\n",
149+
"\n",
150+
"index = pc.Index(name=index_name)\n",
151+
"\n",
152+
"print(f\"Index '{index_name}' created successfully.\")"
153+
]
154+
},
155+
{
156+
"cell_type": "markdown",
157+
"metadata": {
158+
"id": "3552NYEDcBos"
159+
},
160+
"source": [
161+
"## Start import task\n",
162+
"\n",
163+
"This sample dataset contains:\n",
164+
"\n",
165+
"* **Dimensions**: 1536\n",
166+
"* **Rows**: 10,000\n",
167+
"* **Files**: 10 parquet files\n",
168+
"* **Size per file**: ~12.58 MB\n",
169+
"* **Total size**: ~125.8\n",
170+
"\n",
171+
"Each file contains:\n",
172+
"\n",
173+
"* **id**: Unique identifier\n",
174+
"* **Values**: Embedded vectors\n",
175+
"* **metadata**: JSON-formatted dictionary with metadata\n",
176+
"\n",
177+
"***Note***: *This task may take 10 minutes or more to complete. And Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is met first.*"
178+
]
179+
},
180+
{
181+
"cell_type": "markdown",
182+
"metadata": {
183+
"id": "pwVvY9fRlZYj"
184+
},
185+
"source": [
186+
"## Specify AWS S3 folder and start task"
187+
]
188+
},
189+
{
190+
"cell_type": "code",
191+
"execution_count": null,
192+
"metadata": {
193+
"id": "FV8DGtmtnKpj"
194+
},
195+
"outputs": [],
196+
"source": [
197+
"root = \"s3://dev-bulk-import-datasets-pub/10k-1536/\"\n",
198+
"op = index.start_import(uri=root, error_mode=\"CONTINUE\")"
199+
]
200+
},
201+
{
202+
"cell_type": "markdown",
203+
"metadata": {
204+
"id": "CUoMQXImncaU"
205+
},
206+
"source": [
207+
"## Check the status of the import"
208+
]
209+
},
210+
{
211+
"cell_type": "code",
212+
"execution_count": null,
213+
"metadata": {
214+
"id": "ARJlKVtmpY73"
215+
},
216+
"outputs": [],
217+
"source": [
218+
"index.describe_index_stats()\n"
219+
]
220+
},
221+
{
222+
"cell_type": "markdown",
223+
"metadata": {
224+
"id": "Qq9qL3hRcEWv"
225+
},
226+
"source": [
227+
"## List import operations"
228+
]
229+
},
230+
{
231+
"cell_type": "code",
232+
"execution_count": null,
233+
"metadata": {
234+
"id": "K_ig20UBbPeu"
235+
},
236+
"outputs": [],
237+
"source": [
238+
"imports = list(index.list_imports())\n",
239+
"if imports:\n",
240+
" for i in imports:\n",
241+
" print(i)\n",
242+
"else:\n",
243+
" print(\"No imports found in the index.\")"
244+
]
245+
},
246+
{
247+
"cell_type": "markdown",
248+
"metadata": {
249+
"id": "els_rMBhcFTa"
250+
},
251+
"source": [
252+
"## Describe a specific import"
253+
]
254+
},
255+
{
256+
"cell_type": "code",
257+
"execution_count": null,
258+
"metadata": {
259+
"id": "OXgTNgVAbRps"
260+
},
261+
"outputs": [],
262+
"source": [
263+
"index.describe_import(\"1\")"
264+
]
265+
},
266+
{
267+
"cell_type": "markdown",
268+
"metadata": {
269+
"id": "79NM6VDtcME7"
270+
},
271+
"source": [
272+
"## Cancel the Import (if needed)"
273+
]
274+
},
275+
{
276+
"cell_type": "code",
277+
"execution_count": null,
278+
"metadata": {
279+
"id": "87M3vXgsvbxs"
280+
},
281+
"outputs": [],
282+
"source": [
283+
"# Check if operation status and cancel running instance\n",
284+
"op_status = index.describe_import(op.id)\n",
285+
"print(f\"Operation status: {op_status}\")\n",
286+
"\n",
287+
"if op_status in ['in_progress', 'pending']:\n",
288+
" try:\n",
289+
" cancel_response = index.cancel_import(op.id)\n",
290+
" print(f\"Import operation {op.id} cancelled.\")\n",
291+
" except Exception as e:\n",
292+
" print(f\"Error cancelling import: {e}\")\n",
293+
"else:\n",
294+
" print(f\"Cannot cancel operation {op.id} because its status is: {op_status}\")\n"
295+
]
296+
},
297+
{
298+
"cell_type": "markdown",
299+
"metadata": {
300+
"id": "a1euqHZocS1F"
301+
},
302+
"source": [
303+
"## Delete the index"
304+
]
305+
},
306+
{
307+
"cell_type": "code",
308+
"execution_count": null,
309+
"metadata": {
310+
"id": "jofVBQHycWxt"
311+
},
312+
"outputs": [],
313+
"source": [
314+
"pc.delete_index(index_name)\n",
315+
"print(f\"Index '{index_name}' deleted.\")"
316+
]
317+
}
318+
],
319+
"metadata": {
320+
"colab": {
321+
"provenance": []
322+
},
323+
"kernelspec": {
324+
"display_name": "Python 3",
325+
"name": "python3"
326+
},
327+
"language_info": {
328+
"name": "python"
329+
}
330+
},
331+
"nbformat": 4,
332+
"nbformat_minor": 0
333+
}

0 commit comments

Comments
 (0)