Skip to content

Commit 8a14ecd

Browse files
committed
Improve language and grammar
1 parent dfa948e commit 8a14ecd

File tree

1 file changed

+9
-10
lines changed

1 file changed

+9
-10
lines changed

examples/vector_databases/duckdb/duckdb-sql-with-openai-embeddings.ipynb

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@
194194
"id": "222d1038",
195195
"metadata": {},
196196
"source": [
197-
"*Note on performance:* The above function, will run a call to OpenAI's embeddings API for every single row. Depending on your dataset size, this might be quite slow. For larger datasets, consider [upgrading this function](https://lukaszrogalski.substack.com/p/python-udfs-in-duckdb) to work with aggregated data and pass in multiple sentences (batches) to the OpenAI embeddings call."
197+
"*Note on performance:* The above function will run a call to OpenAI's embeddings API for every single row. Depending on your dataset size, this might be slow. For larger datasets, consider [upgrading this function](https://lukaszrogalski.substack.com/p/python-udfs-in-duckdb) to work with aggregated data and pass in multiple sentences (batches) to the OpenAI embeddings call."
198198
]
199199
},
200200
{
@@ -222,7 +222,7 @@
222222
"source": [
223223
"### Generating Embeddings\n",
224224
"\n",
225-
"With the embedding function in place, we can now use it to generate and write embeddings into our table via SQL. The below query should run on every row in the table, calling the openai embedding UDF we previously defined. On 400 rows, it should take around 2 minutes to complete."
225+
"With the embedding function in place, we can now use it to generate and store embeddings in our table via SQL. The query below runs on every row in the table, calling the OpenAI embedding UDF we defined earlier. On a dataset of about 400 rows, it typically completes in around 2 minutes."
226226
]
227227
},
228228
{
@@ -273,12 +273,11 @@
273273
"id": "35ee611a",
274274
"metadata": {},
275275
"source": [
276-
"Now that we have embeddings for each paper, we can use them to perform a semantic similarity search. \n",
276+
"Now that we have embeddings for each paper, we can use them to perform a semantic similarity search.\n",
277277
"\n",
278-
"To do this, we can use an array distance function native to DuckDB such as array_cosine_similarity that computes the cosine similarity between two vectors.\n",
278+
"To achieve this, we can use a native DuckDB array distance function such as `array_cosine_similarity`, which computes the cosine similarity between two vectors.\n",
279279
"\n",
280-
"Below we define a query that uses our embed_openai function to generate an embedding for a query, and then uses the array_cosine_similarity function to compute the similarity between the query embedding and each of the paper embeddings.\n",
281-
"\n"
280+
"The query below demonstrates how to generate an embedding for a search term using our `embed_openai` function, and then apply array_cosine_similarity to compare the query embedding with each of the paper embeddings."
282281
]
283282
},
284283
{
@@ -402,11 +401,11 @@
402401
"id": "8fb6dbd9",
403402
"metadata": {},
404403
"source": [
405-
"While the above search query works fine on 400 rows, it can eventually get much slower as the dataset grows into hundreds of thousands. Without an index, DuckDB will compare a query embedding with all document embeddings to find the most similar one.\n",
404+
"While the search query above works well on a dataset of 400 rows, it will become much slower as the data grows into hundreds of thousands of rows. Without an index, DuckDB must compare the query embedding against all document embeddings to find the most similar results.\n",
406405
"\n",
407-
"In order to speed up the vector search, we can use ANN (Approximate Nearest Neighbor) with [HNSW (Hierarchical Navigable Small World)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world), supported via DuckDB's vector [similarity search extension](https://duckdb.org/2024/05/03/vector-similarity-search-vss.html).\n",
406+
"To speed up vector search, we can use ANN (Approximate Nearest Neighbor) with [HNSW (Hierarchical Navigable Small World)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world), available through DuckDB’s [vector similarity search extension](https://duckdb.org/2024/05/03/vector-similarity-search-vss.html).\n",
408407
"\n",
409-
"Let's try that out."
408+
"Let’s give it a try."
410409
]
411410
},
412411
{
@@ -560,7 +559,7 @@
560559
"source": [
561560
"## Conclusion\n",
562561
"\n",
563-
"In this cookbook, we explored how to integrate OpenAI’s embedding calls as a reusable UDF in DuckDB. This approach proves especially powerful when you want to store and query embeddings directly alongside your data. By doing so, you unlock new opportunities for combining advanced data analysis with retrieval tasks, all through DuckDB’s simple and familiar SQL interface."
562+
"In this cookbook, we explored how to integrate OpenAI’s embedding calls as a reusable UDF in DuckDB. This approach is especially powerful when storing and querying embeddings directly alongside your data. By combining embeddings with DuckDB’s familiar SQL interface, you unlock new possibilities for advanced data analysis and retrievalall within a simple, efficient workflow."
564563
]
565564
}
566565
],

0 commit comments

Comments
 (0)