Skip to content

Commit 250d041

Browse files
authored
Merge branch 'main' into test-suites
2 parents 539fb21 + d03a584 commit 250d041

File tree

30 files changed

+583
-177
lines changed

30 files changed

+583
-177
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,5 @@ dist/
1818

1919
# Output of `cocoindex eval`
2020
examples/**/eval_*
21+
22+
/.vscode

.vscode/settings.json

Lines changed: 0 additions & 9 deletions
This file was deleted.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,7 @@ It defines an index flow like this:
139139
| [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup |
140140
| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database|
141141
| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
142+
| [Paper Metadata](examples/paper_metadata) | Index papers in PDF files, and build metadata tables for each paper |
142143

143144
More coming and stay tuned 👀!
144145

docs/docs/core/cli.mdx

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -47,12 +47,6 @@ CocoIndex CLI supports the following global options:
4747
* `--version`: Show the CocoIndex version and exit.
4848
* `--help`: Show the main help message and exit.
4949

50-
:::caution Deprecated Usage
51-
52-
The old method of invoking the CLI using `python main.py cocoindex ...` via the `@cocoindex.main_fn()` decorator is now deprecated. Please remove `@cocoindex.main_fn()` from your scripts and use the standalone cocoindex command as described.
53-
54-
:::
55-
5650
## Subcommands
5751

5852
The following subcommands are available:

docs/docs/core/data_types.mdx

Lines changed: 103 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: Data Types
33
description: Data Types in CocoIndex
4+
toc_max_heading_level: 4
45
---
56

67
# Data Types in CocoIndex
@@ -11,56 +12,97 @@ This makes schema of data processed by CocoIndex clear, and easily determine the
1112

1213
## Data Types
1314

14-
You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc).
15-
These operations decide data types of fields produced by them based on the spec and input data types.
16-
All you need to do is to make sure the data passed to functions and targets are accepted by them.
15+
As an engine written in Rust, designed to be used in different languages and data are always serializable, CocoIndex defines a type system independent of any specific programming language.
1716

18-
When you define [custom functions](/docs/core/custom_function), you need to specify the data types of arguments and return values.
17+
CocoIndex automatically infers data types of the output created by CocoIndex sources and functions.
18+
You don't need to spell out any data type explicitly when you define the flow.
19+
All you need to do is to make sure the data passed to functions and targets are compatible with them.
20+
21+
Each type in CocoIndex type system is mapped to one or multiple types in Python.
22+
When you define a [custom function](/docs/core/custom_function), you need to annotate the data types of arguments and return values.
23+
24+
* For return values, type annotation is required. Because this provides the ground truth to define the type of the output of the custom function.
25+
* For arguments, type annotation is only used to enable the conversion from data values already existing in CocoIndex engine to Python value.
26+
Type annotation is optional for basic types. When not specified, CocoIndex will use the *default Python type* for the argument.
27+
Type annotation is required for arguments of struct types and table types.
1928

2029
### Basic Types
2130

22-
This is the list of all basic types supported by CocoIndex:
23-
24-
| Type | Description | Specific Python Type | Native Python Type |
25-
|------|-------------|---------------|-------------------------|
26-
| Bytes | | `bytes` | `bytes` |
27-
| Str | | `str` | `str` |
28-
| Bool | | `bool` | `bool` |
29-
| Int64 | | `int` | `int` |
30-
| Float32 | | `cocoindex.Float32` |`float` |
31-
| Float64 | | `cocoindex.Float64` |`float` |
32-
| Range | | `cocoindex.Range` | `tuple[int, int]` |
33-
| Uuid | | `uuid.UUId` | `uuid.UUID` |
34-
| Date | | `datetime.date` | `datetime.date` |
35-
| Time | | `datetime.time` | `datetime.time` |
36-
| LocalDatetime | Date and time without timezone | `cocoindex.LocalDateTime` | `datetime.datetime` |
37-
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.OffsetDateTime` | `datetime.datetime` |
38-
| TimeDelta | A duration of time | `datetime.timedelta` | `datetime.timedelta` |
39-
| Json | | `cocoindex.Json` | Any data convertible to JSON by `json` package |
40-
| Vector[*T*, *Dim*?] | *T* can be a basic type or a numeric type. *Dim* is a positive integer and optional. | `cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]` | `numpy.typing.NDArray[T]` or `list[T]` |
41-
| Union[*T1*, *T2*, ...] | *T1*, *T2*, ... are any basic types | `T1 | T2 | ...` | `T1 | T2 | ...` |
42-
43-
Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
44-
However, the underlying execution engine has finer distinctions for some types, specifically:
45-
46-
* *Float32* and *Float64* for `float`, with different precision.
47-
* *LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
48-
* *Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
49-
* *Vector* holds elements of type *T*. If *T* is numeric (e.g., `np.float32` or `np.float64`), it's represented as `NDArray[T]`; otherwise, as `list[T]`.
50-
* *Vector* also has optional dimension information.
51-
52-
The native Python type is always more permissive and can represent a superset of possible values.
53-
* Only when you annotate the return type of a custom function, you should use the specific type,
54-
so that CocoIndex will have information about the precise type to be used in the execution engine and target.
55-
* For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function,
56-
you can choose whatever to use.
57-
The native Python type is usually simpler.
31+
#### Primitive Types
32+
33+
Primitive types are basic types that are not composed of other types.
34+
This is the list of all primitive types supported by CocoIndex:
35+
36+
| CocoIndex Type | Python Types | Convertible to | Explanation |
37+
|------|-------------|--------------|----------------|
38+
| *Bytes* | `bytes` | | |
39+
| *Str* | `str` | | |
40+
| *Bool* | `bool` | | |
41+
| *Int64* | `cocoindex.Int64`, `int`, `numpy.int64` | | |
42+
| *Float32* | `cocoindex.Float32`, `numpy.float32` | *Float64* | |
43+
| *Float64* | `cocoindex.Float64`, `float`, `numpy.float64` | | |
44+
| *Range* | `cocoindex.Range` | | |
45+
| *Uuid* | `uuid.UUId` | | |
46+
| *Date* | `datetime.date` | | |
47+
| *Time* | `datetime.time` | | |
48+
| *LocalDatetime* | `cocoindex.LocalDateTime` | *OffsetDatetime* | without timezone |
49+
| *OffsetDatetime* | `cocoindex.OffsetDateTime`, `datetime.datetime` | | with timezone |
50+
| *TimeDelta* | `datetime.timedelta` | | |
51+
52+
Notes:
53+
54+
* For some CocoIndex types, we support multiple Python types. You can annotate with any of these Python types.
55+
The first one is the *default Python type*, which means CocoIndex will create a value with this type when you don't annotate the type in function arguments.
56+
57+
* All Python types starting with `cocoindex.` are type aliases exported by CocoIndex. They're annotated types based on certain Python types:
58+
59+
* `cocoindex.Int64`: `int`
60+
* `cocoindex.Float64`: `float`
61+
* `cocoindex.Float32`: `float`
62+
* `cocoindex.Range`: `tuple[int, int]`, i.e. a start offset (inclusive) and an end offset (exclusive)
63+
* `cocoindex.OffsetDateTime`: `datetime.datetime`
64+
* `cocoindex.LocalDateTime`: `datetime.datetime`
65+
66+
These aliases provide a non-ambiguous way to represent a specific type in CocoIndex, given their base Python types can represent a superset of possible values.
67+
68+
* When we say a CocoIndex type is *convertible to* another type, it means Python types for the second type can be also used to bind to a value of the first type.
69+
For example, *Float32* is convertible to *Float64*, so you can bind a value of *Float32* to a Python value of `float` or `np.float64` types.
70+
For *LocalDatetime*, when you use `cocoindex.OffsetDateTime` or `datetime.datetime` as the annotation to bind its value, the timezone will be set to UTC.
71+
72+
73+
#### Json Type
74+
75+
*Json* type can hold any data convertible to JSON by `json` package.
76+
In Python, it's represented by `cocoindex.Json`.
77+
It's useful to hold data without fixed schema known at flow definition time.
78+
79+
80+
#### Vector Types
81+
82+
A vector type is a collection of elements of the same basic type.
83+
Optionally, it can have a fixed dimension. Noted as *Vector[Type]* or *Vector[Type, Dim]*, e.g. *Vector[Float32]* or *Vector[Float32, 384]*.
84+
85+
It supports the following Python types:
86+
87+
* `cocoindex.Vector[T]` or `cocoindex.Vector[T, typing.Literal[Dim]]`, e.g. `cocoindex.Vector[cocoindex.Float32]` or `cocoindex.Vector[cocoindex.Float32, 384]`
88+
* The underlying Python type is `numpy.typing.NDArray[T]` where `T` is a numpy numeric type (`numpy.int64`, `numpy.float32` or `numpy.float64`), or `list[T]` otherwise
89+
* `numpy.typing.NDArray[T]` where `T` is a numpy numeric type
90+
* `list[T]`
91+
92+
93+
#### Union Types
94+
95+
A union type is a type that can represent values in one of multiple basic types.
96+
Noted as *Type1* | *Type2* | ..., e.g. *Int64* | *Float32* | *Float64*.
97+
98+
The Python type is `T1 | T2 | ...`, e.g. `cocoindex.Int64 | cocoindex.Float32 | cocoindex.Float64`, `int | float` (equivalent to `cocoindex.Int64 | cocoindex.Float64`)
99+
58100

59101
### Struct Types
60102

61-
A Struct has a bunch of fields, each with a name and a type.
103+
A *Struct* has a bunch of fields, each with a name and a type.
62104

63-
In Python, a Struct type is represented by either a [dataclass](https://docs.python.org/3/library/dataclasses.html)
105+
In Python, a *Struct* type is represented by either a [dataclass](https://docs.python.org/3/library/dataclasses.html)
64106
or a [NamedTuple](https://docs.python.org/3/library/typing.html#typing.NamedTuple), with all fields annotated with a specific type.
65107
Both options define a structured type with named fields, but they differ slightly:
66108

@@ -93,22 +135,22 @@ Choose `dataclass` for mutable objects or when you need additional methods, and
93135

94136
### Table Types
95137

96-
A Table type models a collection of rows, each with multiple columns.
138+
A *Table* type models a collection of rows, each with multiple columns.
97139
Each column of a table has a specific type.
98140

99-
We have two specific types of Table types: KTable and LTable.
141+
We have two specific types of *Table* types: *KTable* and *LTable*.
100142

101143
#### KTable
102144

103-
KTable is a Table type whose first column serves as the key.
104-
The row order of a KTable is not preserved.
145+
*KTable* is a *Table* type whose first column serves as the key.
146+
The row order of a *KTable* is not preserved.
105147
Type of the first column (key column) must be a [key type](#key-types).
106148

107-
In Python, a KTable type is represented by `dict[K, V]`.
108-
The `V` should be a struct type, either a `dataclass` or `NamedTuple`, representing the value fields of each row.
109-
For example, you can use `dict[str, Person]` or `dict[str, PersonTuple]` to represent a KTable, with 4 columns: key (Str), `first_name` (Str), `last_name` (Str), `dob` (Date).
149+
In Python, a *KTable* type is represented by `dict[K, V]`.
150+
The `V` should be a *Struct* type, either a `dataclass` or `NamedTuple`, representing the value fields of each row.
151+
For example, you can use `dict[str, Person]` or `dict[str, PersonTuple]` to represent a *KTable*, with 4 columns: key (*Str*), `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
110152

111-
Note that if you want to use a struct as the key, you need to ensure the struct is immutable. For `dataclass`, annotate it with `@dataclass(frozen=True)`. For `NamedTuple`, immutability is built-in.
153+
Note that if you want to use a *Struct* as the key, you need to ensure its value in Python is immutable. For `dataclass`, annotate it with `@dataclass(frozen=True)`. For `NamedTuple`, immutability is built-in. For example:
112154
For example:
113155

114156
```python
@@ -127,20 +169,20 @@ Then you can use `dict[PersonKey, Person]` or `dict[PersonKeyTuple, PersonTuple]
127169

128170
#### LTable
129171

130-
LTable is a Table type whose row order is preserved. LTable has no key column.
172+
*LTable* is a *Table* type whose row order is preserved. *LTable* has no key column.
131173

132-
In Python, a LTable type is represented by `list[R]`, where `R` is a dataclass representing a row.
133-
For example, you can use `list[Person]` to represent a LTable with 3 columns: `first_name` (Str), `last_name` (Str), `dob` (Date).
174+
In Python, a *LTable* type is represented by `list[R]`, where `R` is a dataclass representing a row.
175+
For example, you can use `list[Person]` to represent a *LTable* with 3 columns: `first_name` (*Str*), `last_name` (*Str*), `dob` (*Date*).
134176

135177
## Key Types
136178

137179
Currently, the following types are key types
138180

139-
- Bytes
140-
- Str
141-
- Bool
142-
- Int64
143-
- Range
144-
- Uuid
145-
- Date
146-
- Struct with all fields being key types (using `@dataclass(frozen=True)` or `NamedTuple`)
181+
- *Bytes*
182+
- *Str*
183+
- *Bool*
184+
- *Int64*
185+
- *Range*
186+
- *Uuid*
187+
- *Date*
188+
- *Struct* with all fields being key types (using `@dataclass(frozen=True)` or `NamedTuple`)
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
OPENAI_API_KEY=

examples/paper_metadata/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.env

examples/paper_metadata/README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Build embedding index from PDF files and query with natural language
2+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
3+
4+
5+
In this example, we will build a bunch of tables for papers in PDF files, including:
6+
7+
- Metadata (title, authors, abstract) for each paper.
8+
- Author-to-paper mapping, for author-based query.
9+
- Embeddings for titles and abstract chunks, for semantics search.
10+
11+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
12+
13+
## Steps
14+
### Indexing Flow
15+
16+
1. We will ingest a list of papers in PDF.
17+
2. For each file, we:
18+
- Extract the first page of the paper.
19+
- Convert the first page to Markdown.
20+
- Extract metadata (title, authors, abstract) from the first page.
21+
- Split the abstract into chunks, and compute embeddings for each chunk.
22+
3. We will export to the following tables in Postgres with PGVector:
23+
- Metadata (title, authors, abstract) for each paper.
24+
- Author-to-paper mapping, for author-based query.
25+
- Embeddings for titles and abstract chunks, for semantics search.
26+
27+
28+
## Prerequisite
29+
30+
1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
31+
32+
2. dependencies:
33+
34+
```bash
35+
pip install -e .
36+
```
37+
3. Create a `.env` file from `.env.example`, and fill `OPENAI_API_KEY`.
38+
39+
## Run
40+
41+
Update index, which will also setup the tables at the first time:
42+
43+
```bash
44+
cocoindex update --setup main.py
45+
```
46+
47+
You can also run the command with `-L`, which will watch for file changes and update the index automatically.
48+
49+
```bash
50+
cocoindex update --setup -L main.py
51+
```
52+
53+
## CocoInsight
54+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with zero pipeline data retention. Run following command to start CocoInsight:
55+
56+
```
57+
cocoindex server -ci main.py
58+
```
59+
60+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).

0 commit comments

Comments
 (0)