Skip to content

Commit 22cab1d

Browse files
authored
docs(types): revise doc for cocoindex types to make it more clear (#693)
1 parent 81b0ab2 commit 22cab1d

File tree

1 file changed

+103
-61
lines changed

1 file changed

+103
-61
lines changed

docs/docs/core/data_types.mdx

Lines changed: 103 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: Data Types
33
description: Data Types in CocoIndex
4+
toc_max_heading_level: 4
45
---
56

67
# Data Types in CocoIndex
@@ -11,56 +12,97 @@ This makes schema of data processed by CocoIndex clear, and easily determine the
1112

1213
## Data Types
1314

14-
You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc).
15-
These operations decide data types of fields produced by them based on the spec and input data types.
16-
All you need to do is to make sure the data passed to functions and targets are accepted by them.
15+
As an engine written in Rust, designed to be used in different languages and data are always serializable, CocoIndex defines a type system independent of any specific programming language.
1716

18-
When you define [custom functions](/docs/core/custom_function), you need to specify the data types of arguments and return values.
17+
CocoIndex automatically infers data types of the output created by CocoIndex sources and functions.
18+
You don't need to spell out any data type explicitly when you define the flow.
19+
All you need to do is to make sure the data passed to functions and targets are compatible with them.
20+
21+
Each type in CocoIndex type system is mapped to one or multiple types in Python.
22+
When you define a [custom function](/docs/core/custom_function), you need to annotate the data types of arguments and return values.
23+
24+
* For return values, type annotation is required. Because this provides the ground truth to define the type of the output of the custom function.
25+
* For arguments, type annotation is only used to enable the conversion from data values already existing in CocoIndex engine to Python value.
26+
Type annotation is optional for basic types. When not specified, CocoIndex will use the *default Python type* for the argument.
27+
Type annotation is required for arguments of struct types and table types.
1928

2029
### Basic Types
2130

22-
This is the list of all basic types supported by CocoIndex:
23-
24-
| Type | Description | Specific Python Type | Native Python Type |
25-
|------|-------------|---------------|-------------------------|
26-
| Bytes | | `bytes` | `bytes` |
27-
| Str | | `str` | `str` |
28-
| Bool | | `bool` | `bool` |
29-
| Int64 | | `int` | `int` |
30-
| Float32 | | `cocoindex.Float32` |`float` |
31-
| Float64 | | `cocoindex.Float64` |`float` |
32-
| Range | | `cocoindex.Range` | `tuple[int, int]` |
33-
| Uuid | | `uuid.UUId` | `uuid.UUID` |
34-
| Date | | `datetime.date` | `datetime.date` |
35-
| Time | | `datetime.time` | `datetime.time` |
36-
| LocalDatetime | Date and time without timezone | `cocoindex.LocalDateTime` | `datetime.datetime` |
37-
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.OffsetDateTime` | `datetime.datetime` |
38-
| TimeDelta | A duration of time | `datetime.timedelta` | `datetime.timedelta` |
39-
| Json | | `cocoindex.Json` | Any data convertible to JSON by `json` package |
40-
| Vector[*T*, *Dim*?] | *T* can be a basic type or a numeric type. *Dim* is a positive integer and optional. | `cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]` | `numpy.typing.NDArray[T]` or `list[T]` |
41-
| Union[*T1*, *T2*, ...] | *T1*, *T2*, ... are any basic types | `T1 | T2 | ...` | `T1 | T2 | ...` |
42-
43-
Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
44-
However, the underlying execution engine has finer distinctions for some types, specifically:
45-
46-
* *Float32* and *Float64* for `float`, with different precision.
47-
* *LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
48-
* *Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
49-
* *Vector* holds elements of type *T*. If *T* is numeric (e.g., `np.float32` or `np.float64`), it's represented as `NDArray[T]`; otherwise, as `list[T]`.
50-
* *Vector* also has optional dimension information.
51-
52-
The native Python type is always more permissive and can represent a superset of possible values.
53-
* Only when you annotate the return type of a custom function, you should use the specific type,
54-
so that CocoIndex will have information about the precise type to be used in the execution engine and target.
55-
* For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function,
56-
you can choose whatever to use.
57-
The native Python type is usually simpler.
31+
#### Primitive Types
32+
33+
Primitive types are basic types that are not composed of other types.
34+
This is the list of all primitive types supported by CocoIndex:
35+
36+
| CocoIndex Type | Python Types | Convertible to | Explanation |
37+
|------|-------------|--------------|----------------|
38+
| *Bytes* | `bytes` | | |
39+
| *Str* | `str` | | |
40+
| *Bool* | `bool` | | |
41+
| *Int64* | `cocoindex.Int64`, `int`, `numpy.int64` | | |
42+
| *Float32* | `cocoindex.Float32`, `numpy.float32` | *Float64* | |
43+
| *Float64* | `cocoindex.Float64`, `float`, `numpy.float64` | | |
44+
| *Range* | `cocoindex.Range` | | |
45+
| *Uuid* | `uuid.UUId` | | |
46+
| *Date* | `datetime.date` | | |
47+
| *Time* | `datetime.time` | | |
48+
| *LocalDatetime* | `cocoindex.LocalDateTime` | *OffsetDatetime* | without timezone |
49+
| *OffsetDatetime* | `cocoindex.OffsetDateTime`, `datetime.datetime` | | with timezone |
50+
| *TimeDelta* | `datetime.timedelta` | | |
51+
52+
Notes:
53+
54+
* For some CocoIndex types, we support multiple Python types. You can annotate with any of these Python types.
55+
The first one is the *default Python type*, which means CocoIndex will create a value with this type when you don't annotate the type in function arguments.
56+
57+
* All Python types starting with `cocoindex.` are type aliases exported by CocoIndex. They're annotated types based on certain Python types:
58+
59+
* `cocoindex.Int64`: `int`
60+
* `cocoindex.Float64`: `float`
61+
* `cocoindex.Float32`: `float`
62+
* `cocoindex.Range`: `tuple[int, int]`, i.e. a start offset (inclusive) and an end offset (exclusive)
63+
* `cocoindex.OffsetDateTime`: `datetime.datetime`
64+
* `cocoindex.LocalDateTime`: `datetime.datetime`
65+
66+
These aliases provide a non-ambiguous way to represent a specific type in CocoIndex, given their base Python types can represent a superset of possible values.
67+
68+
* When we say a CocoIndex type is *convertible to* another type, it means Python types for the second type can be also used to bind to a value of the first type.
69+
For example, *Float32* is convertible to *Float64*, so you can bind a value of *Float32* to a Python value of `float` or `np.float64` types.
70+
For *LocalDatetime*, when you use `cocoindex.OffsetDateTime` or `datetime.datetime` as the annotation to bind its value, the timezone will be set to UTC.
71+
72+
73+
#### Json Type
74+
75+
*Json* type can hold any data convertible to JSON by `json` package.
76+
In Python, it's represented by `cocoindex.Json`.
77+
It's useful to hold data without fixed schema known at flow definition time.
78+
79+
80+
#### Vector Types
81+
82+
A vector type is a collection of elements of the same basic type.
83+
Optionally, it can have a fixed dimension. Noted as *Vector[Type]* or *Vector[Type, Dim]*, e.g. *Vector[Float32]* or *Vector[Float32, 384]*.
84+
85+
It supports the following Python types:
86+
87+
* `cocoindex.Vector[T]` or `cocoindex.Vector[T, typing.Literal[Dim]]`, e.g. `cocoindex.Vector[cocoindex.Float32]` or `cocoindex.Vector[cocoindex.Float32, 384]`
88+
* The underlying Python type is `numpy.typing.NDArray[T]` where `T` is a numpy numeric type (`numpy.int64`, `numpy.float32` or `numpy.float64`), or `list[T]` otherwise
89+
* `numpy.typing.NDArray[T]` where `T` is a numpy numeric type
90+
* `list[T]`
91+
92+
93+
#### Union Types
94+
95+
A union type is a type that can represent values in one of multiple basic types.
96+
Noted as *Type1* | *Type2* | ..., e.g. *Int64* | *Float32* | *Float64*.
97+
98+
The Python type is `T1 | T2 | ...`, e.g. `cocoindex.Int64 | cocoindex.Float32 | cocoindex.Float64`, `int | float` (equivalent to `cocoindex.Int64 | cocoindex.Float64`)
99+
58100

59101
### Struct Types
60102

61-
A Struct has a bunch of fields, each with a name and a type.
103+
A *Struct* has a bunch of fields, each with a name and a type.
62104

63-
In Python, a Struct type is represented by either a [dataclass](https://docs.python.org/3/library/dataclasses.html)
105+
In Python, a *Struct* type is represented by either a [dataclass](https://docs.python.org/3/library/dataclasses.html)
64106
or a [NamedTuple](https://docs.python.org/3/library/typing.html#typing.NamedTuple), with all fields annotated with a specific type.
65107
Both options define a structured type with named fields, but they differ slightly:
66108

@@ -93,22 +135,22 @@ Choose `dataclass` for mutable objects or when you need additional methods, and
93135

94136
### Table Types
95137

96-
A Table type models a collection of rows, each with multiple columns.
138+
A *Table* type models a collection of rows, each with multiple columns.
97139
Each column of a table has a specific type.
98140

99-
We have two specific types of Table types: KTable and LTable.
141+
We have two specific types of *Table* types: *KTable* and *LTable*.
100142

101143
#### KTable
102144

103-
KTable is a Table type whose first column serves as the key.
104-
The row order of a KTable is not preserved.
145+
*KTable* is a *Table* type whose first column serves as the key.
146+
The row order of a *KTable* is not preserved.
105147
Type of the first column (key column) must be a [key type](#key-types).
106148

107-
In Python, a KTable type is represented by `dict[K, V]`.
108-
The `V` should be a struct type, either a `dataclass` or `NamedTuple`, representing the value fields of each row.
109-
For example, you can use `dict[str, Person]` or `dict[str, PersonTuple]` to represent a KTable, with 4 columns: key (Str), `first_name` (Str), `last_name` (Str), `dob` (Date).
149+
In Python, a *KTable* type is represented by `dict[K, V]`.
150+
The `V` should be a *Struct* type, either a `dataclass` or `NamedTuple`, representing the value fields of each row.
151+
For example, you can use `dict[str, Person]` or `dict[str, PersonTuple]` to represent a *KTable*, with 4 columns: key (*Str*), `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
110152

111-
Note that if you want to use a struct as the key, you need to ensure the struct is immutable. For `dataclass`, annotate it with `@dataclass(frozen=True)`. For `NamedTuple`, immutability is built-in.
153+
Note that if you want to use a *Struct* as the key, you need to ensure its value in Python is immutable. For `dataclass`, annotate it with `@dataclass(frozen=True)`. For `NamedTuple`, immutability is built-in. For example:
112154
For example:
113155

114156
```python
@@ -127,20 +169,20 @@ Then you can use `dict[PersonKey, Person]` or `dict[PersonKeyTuple, PersonTuple]
127169

128170
#### LTable
129171

130-
LTable is a Table type whose row order is preserved. LTable has no key column.
172+
*LTable* is a *Table* type whose row order is preserved. *LTable* has no key column.
131173

132-
In Python, a LTable type is represented by `list[R]`, where `R` is a dataclass representing a row.
133-
For example, you can use `list[Person]` to represent a LTable with 3 columns: `first_name` (Str), `last_name` (Str), `dob` (Date).
174+
In Python, a *LTable* type is represented by `list[R]`, where `R` is a dataclass representing a row.
175+
For example, you can use `list[Person]` to represent a *LTable* with 3 columns: `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
134176

135177
## Key Types
136178

137179
Currently, the following types are key types
138180

139-
- Bytes
140-
- Str
141-
- Bool
142-
- Int64
143-
- Range
144-
- Uuid
145-
- Date
146-
- Struct with all fields being key types (using `@dataclass(frozen=True)` or `NamedTuple`)
181+
- *Bytes*
182+
- *Str*
183+
- *Bool*
184+
- *Int64*
185+
- *Range*
186+
- *Uuid*
187+
- *Date*
188+
- *Struct* with all fields being key types (using `@dataclass(frozen=True)` or `NamedTuple`)

0 commit comments

Comments
 (0)