You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add NumPy array support for vector representations (#586)
* chore: update example code to use NDArray for vectors
* feat: add numpy array support for vector representations
* feat: refactor vector decoding with DtypeRegistry for NumPy
* test: add unit tests for numpy array support
* Update docs to reflect NDArray support with NumPy dtypes
* feat: add NDArray support for `Vector[T]` with `list[T]` fallback, optimize pgvector queries
* feat: update engine value encoding to return ndarray directly
Copy file name to clipboardExpand all lines: docs/docs/core/data_types.mdx
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,16 +36,17 @@ This is the list of all basic types supported by CocoIndex:
36
36
| LocalDatetime | Date and time without timezone |`cocoindex.LocalDateTime`|`datetime.datetime`|
37
37
| OffsetDatetime | Date and time with a timezone offset |`cocoindex.OffsetDateTime`|`datetime.datetime`|
38
38
| TimeDelta | A duration of time |`datetime.timedelta`|`datetime.timedelta`|
39
-
| Vector[*T*, *Dim*?]|*T* must be basic type. *Dim* is a positive integer and optional. |`cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]`|`list[T]`|
40
39
| Json ||`cocoindex.Json`| Any data convertible to JSON by `json` package |
40
+
| Vector[*T*, *Dim*?]|*T* can be a basic type or a numeric type. *Dim* is a positive integer and optional. |`cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]`|`numpy.typing.NDArray[T]` or `list[T]`|
41
41
42
42
Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
43
43
However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:
44
44
45
45
**Float32* and *Float64* for `float`, with different precision.
46
46
**LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
47
-
**Vector* has optional dimension information.
48
47
**Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
48
+
**Vector* holds elements of type *T*. If *T* is numeric (e.g., `np.float32` or `np.float64`), it's represented as `NDArray[T]`; otherwise, as `list[T]`.
49
+
**Vector* also has optional dimension information.
49
50
50
51
The native Python type is always more permissive and can represent a superset of possible values.
51
52
* Only when you annotate the return type of a custom function, you should use the specific type,
Copy file name to clipboardExpand all lines: docs/docs/getting_started/quickstart.md
+11-6Lines changed: 11 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -154,11 +154,11 @@ The goal of transforming your data is usually to query against it.
154
154
Once you already have your index built, you can directly access the transformed data in the target database.
155
155
CocoIndex also provides utilities for you to do this more seamlessly.
156
156
157
-
In this example, we'll use the [`psycopg` library](https://www.psycopg.org/) to connect to the database and run queries.
158
-
Please make sure it's installed:
157
+
In this example, we'll use the [`psycopg` library](https://www.psycopg.org/) along with pgvector to connect to the database and run queries on vector data.
158
+
Please make sure the required packages are installed:
159
159
160
160
```bash
161
-
pip install psycopg[binary,pool]
161
+
pip install numpy psycopg[binary,pool] pgvector
162
162
```
163
163
164
164
### Step 4.1: Extract common transformations
@@ -169,8 +169,11 @@ i.e. they should use exactly the same embedding model and parameters.
SELECT filename, text, embedding <=> %s::vector AS distance
225
+
SELECT filename, text, embedding <=> %s AS distance
221
226
FROM {table_name} ORDER BY distance LIMIT %s
222
227
""", (query_vector, top_k))
223
228
return [
@@ -236,7 +241,7 @@ There're two CocoIndex-specific logic:
236
241
237
242
2. Evaluate the transform flow defined above with the input query, to get the embedding.
238
243
It's done by the `eval()` method of the transform flow `text_to_embedding`.
239
-
The returntype of this method is `list[float]` as declared in the `text_to_embedding()`function(`cocoindex.DataSlice[list[float]]`).
244
+
The return type of this method is `NDArray[np.float32]` as declared in the `text_to_embedding()` function (`cocoindex.DataSlice[NDArray[np.float32]]`).
0 commit comments