Skip to content

Commit 853bb91

Browse files
committed
docs(types): update docs for data type
1 parent 7f2bd18 commit 853bb91

File tree

3 files changed

+49
-34
lines changed

3 files changed

+49
-34
lines changed

docs/docs/core/custom_function.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Notes:
3333

3434
* The `cocoindex.op.function()` function decorator also takes optional parameters.
3535
See [Parameters for custom functions](#parameters-for-custom-functions) for details.
36-
* Types of arugments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
36+
* Types of arguments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
3737
See [Data Types](/docs/core/data_types) for supported types.
3838

3939
</TabItem>

docs/docs/core/data_types.mdx

Lines changed: 27 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -9,36 +9,49 @@ In CocoIndex, all data processed by the flow have a type determined when the flo
99

1010
This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.
1111

12-
## Data Types
12+
## Data Types
13+
14+
You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc).
15+
These operations decide data types of fields produced by them based on the spec and input data types.
16+
All you need to do is to make sure the data passed to functions and storage targets are accepted by them.
17+
18+
When you define [custom functions](/docs/core/custom_function), you need to specify the data types of arguments and return values.
1319

1420
### Basic Types
1521

1622
This is the list of all basic types supported by CocoIndex:
1723

18-
| Type | Description |Type in Python | Original Type in Python |
24+
| Type | Description | Specific Python Type | Native Python Type |
1925
|------|-------------|---------------|-------------------------|
2026
| Bytes | | `bytes` | `bytes` |
2127
| Str | | `str` | `str` |
2228
| Bool | | `bool` | `bool` |
2329
| Int64 | | `int` | `int` |
24-
| Float32 | | `cocoindex.typing.Float32` |`float` |
25-
| Float64 | | `cocoindex.typing.Float64` |`float` |
26-
| Range | | `cocoindex.typing.Range` | `tuple[int, int]` |
30+
| Float32 | | `cocoindex.Float32` |`float` |
31+
| Float64 | | `cocoindex.Float64` |`float` |
32+
| Range | | `cocoindex.Range` | `tuple[int, int]` |
2733
| Uuid | | `uuid.UUId` | `uuid.UUID` |
2834
| Date | | `datetime.date` | `datetime.date` |
2935
| Time | | `datetime.time` | `datetime.time` |
30-
| LocalDatetime | Date and time without timezone | `cocoindex.typing.LocalDateTime` | `datetime.datetime` |
31-
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.typing.OffsetDateTime` | `datetime.datetime` |
32-
| Vector[*type*, *N*?] | |`Annotated[list[type], cocoindex.typing.Vector(dim=N)]` | `list[type]` |
33-
| Json | | `cocoindex.typing.Json` | Any type convertible to JSON by `json` package |
36+
| LocalDatetime | Date and time without timezone | `cocoindex.LocalDateTime` | `datetime.datetime` |
37+
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.OffsetDateTime` | `datetime.datetime` |
38+
| Vector[*T*, *Dim*?] | *T* must be basic type. *Dim* is a positive integer and optional. |`cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]` | `list[T]` |
39+
| Json | | `cocoindex.Json` | Any data convertible to JSON by `json` package |
40+
41+
Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
42+
However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:
3443

35-
For some types, CocoIndex Python SDK provides annotated types with finer granularity than Python's original type, e.g.
3644
* *Float32* and *Float64* for `float`, with different precision.
3745
* *LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
38-
* *Vector* has dimension information.
46+
* *Vector* has optional dimension information.
47+
* *Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
3948

40-
When defining [custom functions](/docs/core/custom_function), use the specific types as type annotations for arguments and return values.
41-
So CocoIndex will have information about the specific type.
49+
The native Python type is always more permissive and can represent a superset of possible values.
50+
* Only when you annotate the return type of a custom function, you should use the specific type,
51+
so that CocoIndex will have information about the precise type to be used in the execution engine and storage system.
52+
* For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function,
53+
you can choose whatever to use.
54+
The native Python type is usually simpler.
4255

4356
### Struct Type
4457

@@ -94,9 +107,7 @@ LTable is a Table type whose row order is preserved. LTable has no key column.
94107
In Python, a LTable type is represented by `list[R]`, where `R` is a dataclass representing a row.
95108
For example, you can use `list[Person]` to represent a LTable with 3 columns: `first_name` (Str), `last_name` (Str), `dob` (Date).
96109

97-
## Index Types
98-
99-
### Key Types
110+
## Key Types
100111

101112
Currently, the following types are key types
102113

@@ -108,16 +119,3 @@ Currently, the following types are key types
108119
- Uuid
109120
- Date
110121
- Struct with all fields being key types
111-
112-
### Vector Type
113-
114-
Users can create vector index on fields with `vector` types.
115-
A vector index also needs to be configured with a similarity metric, and the index is only effective when this metric is used during retrieval.
116-
117-
Following metrics are supported:
118-
119-
| Metric Name | Description | Similarity Order |
120-
|-------------|-------------|------------------|
121-
| CosineSimilarity | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
122-
| L2Distance | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
123-
| InnerProduct | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |

docs/docs/core/flow_def.mdx

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: Flow Definition
33
description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
4+
toc_max_heading_level: 4
45
---
56

67
import Tabs from '@theme/Tabs';
@@ -281,16 +282,32 @@ The target storage is managed by CocoIndex, i.e. it'll be created by [CocoIndex
281282
The `name` for the same storage should remain stable across different runs.
282283
If it changes, CocoIndex will treat it as an old storage removed and a new one created, and perform setup changes and reindexing accordingly.
283284

284-
#### Storage Indexes
285+
## Storage Indexes
285286

286287
Many storage supports indexes, to boost efficiency in retrieving data.
287288
CocoIndex provides a common way to configure indexes for various storages.
288289

289-
* *Primary key*. `primary_key_fields` (`Sequence[str]`): the fields to be used as primary key. Types of the fields must be supported as key fields. See [Key Types](data_types#key-types) for more details.
290-
* *Vector index*. `vector_indexes` (`Sequence[VectorIndexDef]`): the fields to create vector index. `VectorIndexDef` has the following fields:
290+
### Primary Key
291+
292+
*Primary key* is specified by `primary_key_fields` (`Sequence[str]`).
293+
Types of the fields must be key types. See [Key Types](data_types#key-types) for more details.
294+
295+
### Vector Index
296+
297+
*Vector index* is specified by `vector_indexes` (`Sequence[VectorIndexDef]`). `VectorIndexDef` has the following fields:
298+
291299
* `field_name`: the field to create vector index.
292-
* `metric`: the similarity metric to use. See [Vector Type](data_types#vector-type) for more details about supported similarity metrics.
300+
* `metric`: the similarity metric to use.
301+
302+
#### Similarity Metrics
303+
304+
Following metrics are supported:
293305

306+
| Metric Name | Description | Similarity Order |
307+
|-------------|-------------|------------------|
308+
| CosineSimilarity | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
309+
| L2Distance | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
310+
| InnerProduct | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |
294311

295312
## Miscellaneous
296313

0 commit comments

Comments
 (0)