Skip to content

Commit 3dc5f66

Browse files
committed
better documentation
1 parent 2e75807 commit 3dc5f66

File tree

4 files changed

+149
-27
lines changed

4 files changed

+149
-27
lines changed

README.md

Lines changed: 62 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -78,24 +78,24 @@ Now you can query for similar items:
7878
await vec.search([1.0, 9.0])
7979
```
8080

81-
[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
82-
<Record id=UUID('2d49fd73-3db1-4061-81f3-a4ed7529eb61') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]
81+
[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
82+
<Record id=UUID('71e70e23-65fd-4555-be30-bc40710654be') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]
8383

8484
You can specify the number of records to return.
8585

8686
``` python
8787
await vec.search([1.0, 9.0], limit=1)
8888
```
8989

90-
[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]
90+
[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]
9191

9292
You can also specify a filter on the metadata as a simple dictionary
9393

9494
``` python
9595
await vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})
9696
```
9797

98-
[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]
98+
[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]
9999

100100
You can also specify a list of filter dictionaries, where an item is
101101
returned if it matches any dict
@@ -104,8 +104,8 @@ returned if it matches any dict
104104
await vec.search([1.0, 9.0], limit=2, filter=[{"action": "jump"}, {"animal": "fox"}])
105105
```
106106

107-
[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
108-
<Record id=UUID('2d49fd73-3db1-4061-81f3-a4ed7529eb61') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]
107+
[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
108+
<Record id=UUID('71e70e23-65fd-4555-be30-bc40710654be') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]
109109

110110
You can access the fields as follows
111111

@@ -114,7 +114,7 @@ records = await vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})
114114
records[0][client.SEARCH_RESULT_ID_IDX]
115115
```
116116

117-
UUID('9b567b36-209e-4240-aa93-f8e7e74277cd')
117+
UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d')
118118

119119
``` python
120120
records[0][client.SEARCH_RESULT_METADATA_IDX]
@@ -172,26 +172,75 @@ By default, we setup indexes to query your data by the uuid and the
172172
metadata.
173173

174174
If you have many rows, you also need to setup an index on the embedding.
175-
You can create an ivfflat index with the following command after the
176-
table has been populated.
175+
You can create a timescale-vector index on the table with.
177176

178177
``` python
179-
await vec.create_embedding_index(client.IvfflatIndex())
178+
await vec.create_embedding_index(client.TimescaleVectorIndex())
179+
```
180+
181+
You can drop the index with:
182+
183+
``` python
184+
await vec.drop_embedding_index()
180185
```
181186

182-
Please note it is very important to do this only after you have data in
183-
the table.
187+
While we recommend the timescale-vector index type, we also have 2 more
188+
index types availabe: - The pgvector ivfflat index - The pgvector hnsw
189+
index
184190

185-
You can drop the index with the following command.
191+
Usage examples below:
186192

187193
``` python
194+
await vec.create_embedding_index(client.IvfflatIndex())
195+
await vec.drop_embedding_index()
196+
await vec.create_embedding_index(client.HNSWIndex())
188197
await vec.drop_embedding_index()
189198
```
190199

200+
Please note it is very important create the ivfflat index only after you
201+
have data in the table.
202+
191203
Please note the community is actively working on new indexing methods
192204
for embeddings. As they become available, we will add them to our client
193205
as well.
194206

207+
### Time-partitioning
208+
209+
In many use-cases where you have many embeddings time is an important
210+
component associated with the embeddings. For example, when embedding
211+
news stories you often search by time as well as similarity
212+
(e.g. stories related to bitcoin in the past week, or stories about
213+
Clinton in November 2016).
214+
215+
Yet, traditionally, searching by two components “similarity” and “time”
216+
is challenging approximate nearest neigbor (ANN) indexes and makes the
217+
similariy-search index less effective.
218+
219+
One approach to solving this is partitioning the data by time and
220+
creating ANN indexes on each partition individually. Then, during search
221+
you can: - Step 1: filter our partitions that don’t match the time
222+
predicate - Step 2: perform the similarity search on all matching
223+
partitions - Step 3: combine all the results from each partition in step
224+
2, rerank, and filter out results by time.
225+
226+
Step 1 makes the search a lot more effecient by filtering out whole
227+
swaths of data in one go.
228+
229+
Timescale-vector supports time partitioning using TimescaleDB’s
230+
hypertables. To use this feature, simply indicate the length in time for
231+
each partition when creating the client:
232+
233+
``` python
234+
from datetime import timedelta
235+
```
236+
237+
``` python
238+
vec = client.Async(service_url, "data_table_with_time_partition", 2, time_partition_interval=timedelta(hours=6))
239+
240+
id = uuid.uuid1()
241+
vec.upsert([(id, {"key": "val"}, "the brown fox", [1.0, 1.2])])
242+
```
243+
195244
## Development
196245

197246
Please note that this project is developed with

nbs/00_vector.ipynb

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -575,7 +575,11 @@
575575
" IMMUTABLE PARALLEL SAFE\n",
576576
" RETURNS NULL ON NULL INPUT;\n",
577577
"\n",
578-
" SELECT create_hypertable('{table_name}', 'id', time_partitioning_func=>'public.uuid_timestamp', chunk_time_interval => '{chunk_time_interval} seconds'::interval);\n",
578+
" SELECT create_hypertable('{table_name}', \n",
579+
" 'id', \n",
580+
" if_not_exists=> true, \n",
581+
" time_partitioning_func=>'public.uuid_timestamp', \n",
582+
" chunk_time_interval => '{chunk_time_interval} seconds'::interval);\n",
579583
" '''.format(\n",
580584
" table_name=self._quote_ident(self.table_name), \n",
581585
" chunk_time_interval=str(self.time_partition_interval.total_seconds()),\n",

nbs/index.ipynb

Lines changed: 77 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -197,8 +197,8 @@
197197
{
198198
"data": {
199199
"text/plain": [
200-
"[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,\n",
201-
" <Record id=UUID('2d49fd73-3db1-4061-81f3-a4ed7529eb61') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]"
200+
"[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,\n",
201+
" <Record id=UUID('71e70e23-65fd-4555-be30-bc40710654be') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]"
202202
]
203203
},
204204
"execution_count": null,
@@ -226,7 +226,7 @@
226226
{
227227
"data": {
228228
"text/plain": [
229-
"[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]"
229+
"[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]"
230230
]
231231
},
232232
"execution_count": null,
@@ -254,7 +254,7 @@
254254
{
255255
"data": {
256256
"text/plain": [
257-
"[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]"
257+
"[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]"
258258
]
259259
},
260260
"execution_count": null,
@@ -282,8 +282,8 @@
282282
{
283283
"data": {
284284
"text/plain": [
285-
"[<Record id=UUID('9b567b36-209e-4240-aa93-f8e7e74277cd') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,\n",
286-
" <Record id=UUID('2d49fd73-3db1-4061-81f3-a4ed7529eb61') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]"
285+
"[<Record id=UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,\n",
286+
" <Record id=UUID('71e70e23-65fd-4555-be30-bc40710654be') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]"
287287
]
288288
},
289289
"execution_count": null,
@@ -311,7 +311,7 @@
311311
{
312312
"data": {
313313
"text/plain": [
314-
"UUID('9b567b36-209e-4240-aa93-f8e7e74277cd')"
314+
"UUID('1c35a4c3-8a04-4d3e-b74d-511585db052d')"
315315
]
316316
},
317317
"execution_count": null,
@@ -496,7 +496,7 @@
496496
"\n",
497497
"By default, we setup indexes to query your data by the uuid and the metadata.\n",
498498
"\n",
499-
"If you have many rows, you also need to setup an index on the embedding. You can create an ivfflat index with the following command after the table has been populated."
499+
"If you have many rows, you also need to setup an index on the embedding. You can create a timescale-vector index on the table with."
500500
]
501501
},
502502
{
@@ -505,17 +505,36 @@
505505
"metadata": {},
506506
"outputs": [],
507507
"source": [
508-
"await vec.create_embedding_index(client.IvfflatIndex())"
508+
"await vec.create_embedding_index(client.TimescaleVectorIndex())"
509509
]
510510
},
511511
{
512512
"attachments": {},
513513
"cell_type": "markdown",
514514
"metadata": {},
515515
"source": [
516-
"Please note it is very important to do this only after you have data in the table. \n",
516+
"You can drop the index with:"
517+
]
518+
},
519+
{
520+
"cell_type": "code",
521+
"execution_count": null,
522+
"metadata": {},
523+
"outputs": [],
524+
"source": [
525+
"await vec.drop_embedding_index()"
526+
]
527+
},
528+
{
529+
"attachments": {},
530+
"cell_type": "markdown",
531+
"metadata": {},
532+
"source": [
533+
"While we recommend the timescale-vector index type, we also have 2 more index types availabe:\n",
534+
"- The pgvector ivfflat index\n",
535+
"- The pgvector hnsw index\n",
517536
"\n",
518-
"You can drop the index with the following command."
537+
"Usage examples below:"
519538
]
520539
},
521540
{
@@ -524,9 +543,20 @@
524543
"metadata": {},
525544
"outputs": [],
526545
"source": [
546+
"await vec.create_embedding_index(client.IvfflatIndex())\n",
547+
"await vec.drop_embedding_index()\n",
548+
"await vec.create_embedding_index(client.HNSWIndex())\n",
527549
"await vec.drop_embedding_index()"
528550
]
529551
},
552+
{
553+
"attachments": {},
554+
"cell_type": "markdown",
555+
"metadata": {},
556+
"source": [
557+
"Please note it is very important create the ivfflat index only after you have data in the table. "
558+
]
559+
},
530560
{
531561
"attachments": {},
532562
"cell_type": "markdown",
@@ -535,12 +565,47 @@
535565
"Please note the community is actively working on new indexing methods for embeddings. As they become available, we will add them to our client as well."
536566
]
537567
},
568+
{
569+
"attachments": {},
570+
"cell_type": "markdown",
571+
"metadata": {},
572+
"source": [
573+
"### Time-partitioning\n",
574+
"\n",
575+
"In many use-cases where you have many embeddings time is an important component associated with the embeddings. For example, when embedding news stories you often search by time as well as similarity (e.g. stories related to bitcoin in the past week, or stories about Clinton in November 2016). \n",
576+
"\n",
577+
"Yet, traditionally, searching by two components \"similarity\" and \"time\" is challenging approximate nearest neigbor (ANN) indexes and makes the similariy-search index less effective.\n",
578+
"\n",
579+
"One approach to solving this is partitioning the data by time and creating ANN indexes on each partition individually. Then, during search you can:\n",
580+
"- Step 1: filter our partitions that don't match the time predicate\n",
581+
"- Step 2: perform the similarity search on all matching partitions\n",
582+
"- Step 3: combine all the results from each partition in step 2, rerank, and filter out results by time.\n",
583+
"\n",
584+
"Step 1 makes the search a lot more effecient by filtering out whole swaths of data in one go.\n",
585+
"\n",
586+
"Timescale-vector supports time partitioning using TimescaleDB's hypertables. To use this feature, simply indicate the length in time for each partition when creating the client:"
587+
]
588+
},
538589
{
539590
"cell_type": "code",
540591
"execution_count": null,
541592
"metadata": {},
542593
"outputs": [],
543-
"source": []
594+
"source": [
595+
"from datetime import timedelta"
596+
]
597+
},
598+
{
599+
"cell_type": "code",
600+
"execution_count": null,
601+
"metadata": {},
602+
"outputs": [],
603+
"source": [
604+
"vec = client.Async(service_url, \"data_table_with_time_partition\", 2, time_partition_interval=timedelta(hours=6))\n",
605+
"\n",
606+
"id = uuid.uuid1()\n",
607+
"vec.upsert([(id, {\"key\": \"val\"}, \"the brown fox\", [1.0, 1.2])])\n"
608+
]
544609
},
545610
{
546611
"attachments": {},

timescale_vector/client.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -464,7 +464,11 @@ def get_create_query(self):
464464
IMMUTABLE PARALLEL SAFE
465465
RETURNS NULL ON NULL INPUT;
466466
467-
SELECT create_hypertable('{table_name}', 'id', time_partitioning_func=>'public.uuid_timestamp', chunk_time_interval => '{chunk_time_interval} seconds'::interval);
467+
SELECT create_hypertable('{table_name}',
468+
'id',
469+
if_not_exists=> true,
470+
time_partitioning_func=>'public.uuid_timestamp',
471+
chunk_time_interval => '{chunk_time_interval} seconds'::interval);
468472
'''.format(
469473
table_name=self._quote_ident(self.table_name),
470474
chunk_time_interval=str(self.time_partition_interval.total_seconds()),

0 commit comments

Comments
 (0)