Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 31 additions & 22 deletions docs/docs/core/basics.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Basics
description: CocoIndex Basics
description: "CocoIndex basic concepts: indexing flow, data, operations, data updates, etc."
---

# CocoIndex Basics
Expand All @@ -9,7 +9,7 @@ An **index** is a collection of data stored in a way that is easy for retrieval.

CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. indexing. It also offers utilities for users to retrieve data from the indexes.

## Indexing Flow
## Indexing flow

An indexing flow extracts data from speicfied data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.

Expand All @@ -36,7 +36,7 @@ An **operation** in an indexing flow defines a step in the flow. An operation is
* **Action**, which defines the behavior of the operation, e.g. *import*, *transform*, *for each*, *collect* and *export*.
See [Flow Definition](flow_def) for more details for each action.

* Some actions (i.e. "import", "transform" and "export") require an **Operation Spec**, which describes the specific behavior of the operation, e.g. a source to import from, a function describing the transformation behavior, a storage to export to as an index.
* Some actions (i.e. "import", "transform" and "export") require an **Operation Spec**, which describes the specific behavior of the operation, e.g. a source to import from, a function describing the transformation behavior, a target storage to export to (as an index).
* Each operation spec has a **operation type**, e.g. `LocalFile` (data source), `SplitRecursively` (function), `SentenceTransformerEmbed` (function), `Postgres` (storage).
* CocoIndex framework maintains a set of supported operation types. Users can also implement their own.

Expand All @@ -60,31 +60,40 @@ This shows schema and example data for the indexing flow:

![Data Example](data_example.svg)

### Life Cycle of an Indexing Flow
### Life cycle of an indexing flow

An indexing flow, once set up, maintains a long-lived relationship between source data and indexes. This means:
An indexing flow, once set up, maintains a long-lived relationship between data source and data in target storage. This means:

1. The target storage created by the flow remain available for querying at any time

2. As source data changes (new data added, existing data updated or deleted), data in the target storage are updated to reflect those changes,
on certain pace, according to the update mode:

* **One time update**: Once triggered, CocoIndex updates the target data to reflect the version of source data up to the current moment.
* **Live update**: CocoIndex continuously watches the source data and updates the target data accordingly.

See more details in the [build / update target data](flow_methods#build--update-target-data) section.

3. CocoIndex intelligently manages these updates by:
* Determining which parts of the target data need to be recomputed
* Reusing existing computations where possible
* Only reprocessing the minimum necessary data

1. The indexes created by the flow remain available for querying at any time
2. When source data changes, the indexes are automatically updated to reflect those changes
3. CocoIndex intelligently manages these updates by:
- Determining which parts of the index need to be recomputed
- Reusing existing computations where possible
- Only reprocessing the minimum necessary data

You can think of an indexing flow similar to formulas in a spreadsheet:

- In a spreadsheet, you define formulas that transform input cells into output cells
- When input values change, the spreadsheet automatically recalculates affected outputs
- You focus on defining the transformation logic, not managing updates
* In a spreadsheet, you define formulas that transform input cells into output cells
* When input values change, the spreadsheet recalculates affected outputs
* You focus on defining the transformation logic, not managing updates

CocoIndex works the same way, but with more powerful capabilities:

- Instead of flat tables, CocoIndex models data in nested data structures, making it more natural to model complex data
- Instead of simple cell-level formulas, you have operations like "for each" to apply the same formula across rows without repeating yourself
* Instead of flat tables, CocoIndex models data in nested data structures, making it more natural to model complex data
* Instead of simple cell-level formulas, you have operations like "for each" to apply the same formula across rows without repeating yourself

This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and indexes behind the scenes.
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and target data behind the scenes.

### Internal Storage
### Internal storage

As an indexing flow is long-lived, it needs to store intermediate data to keep track of the states.
CocoIndex uses internal storage for this purpose.
Expand All @@ -94,9 +103,9 @@ See [Initialization](initialization) for configuring its location, and `cocoinde

## Retrieval

There are two ways to retrieve data from indexes built by an indexing flow:
There are two ways to retrieve data from target storage built by an indexing flow:

* Query the underlying index storage directly for maximum flexibility.
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the index.
* Query the underlying target storage directly for maximum flexibility.
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the target data.

Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the index storage that was created by the flow.
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the target storage that was created by the flow.
110 changes: 75 additions & 35 deletions docs/docs/core/flow_def.mdx
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
---
title: Flow Definition
description: CocoIndex Flow Definition
description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# CocoIndex Flow Definition

In CocoIndex, to define an indexing flow, you provide a function to construct the flow, by adding operations and connecting them with fields.
In CocoIndex, to define an indexing flow, you provide a function to import source, transform data and put them into target storage (sinks).
You connect input/output of these operations with fields of data scopes.

## Entry Point

Expand Down Expand Up @@ -43,7 +44,7 @@ demo_flow = cocoindex.flow.add_flow_def("DemoFlow", demo_flow_def)
```

In both cases, `demo_flow` will be an object with `cocoindex.Flow` class type.
See [Flow Methods](/docs/core/flow_methods) for more details on it.
See [Flow Running](/docs/core/flow_methods) for more details on it.

</TabItem>
</Tabs>
Expand All @@ -52,7 +53,7 @@ See [Flow Methods](/docs/core/flow_methods) for more details on it.

The `FlowBuilder` object is the starting point to construct a flow.

### Import From Source
### Import from source

`FlowBuilder` provides a `add_source()` method to import data from external sources.
A *source spec* needs to be provided for any import operation, to describe the source and parameters related to the source.
Expand All @@ -64,7 +65,7 @@ Import must happen at the top level, and the field created by import must be in
```python
@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
......
```

Expand All @@ -74,17 +75,56 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
`add_source()` returns a `DataSlice`. Once external data sources are imported, you can further transform them using methods exposed by these data objects, as discussed in the following sections.

We'll describe different data objects in next few sections.
Note that the actual value of data is not available at the time when we define the flow: it's only available at runtime.

:::note

The actual value of data is not available at the time when we define the flow: it's only available at runtime.
In a flow definition, you can use a data representation as input for operations, but not access the actual value.

:::

#### Refresh interval

You can provide a `refresh_interval` argument.
When present, in the [live update mode](/docs/core/flow_methods#live-update), the data source will be refreshed by specified interval.

<Tabs>
<TabItem value="python" label="Python" default>

The `refresh_interval` argument is of type `datetime.timedelta`. For example, this refreshes the data source every 1 minute:

```python
@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(
DemoSourceSpec(...), refresh_interval=datetime.timedelta(minutes=1))
......
```

</TabItem>
</Tabs>

:::info

In live update mode, for each refresh, CocoIndex will traverse the data source to figure out the changes,
and only perform transformations on changed source keys.

:::

## Data Scope

A **data scope** represents data for a certain unit, e.g. the top level scope (involving all data for a flow), for a document, or for a chunk.
A data scope has a bunch of fields and collectors, and users can add new fields and collectors to it.

### Get or Add a Field

Get or add a field of a data scope (which is a data slice). Note that you cannot override an existing field.
You can get or add a field of a data scope (which is a data slice).

:::note

You cannot override an existing field.

:::

<Tabs>
<TabItem value="python" label="Python" default>
Expand All @@ -95,20 +135,20 @@ Getting and setting a field of a data scope is done by the `[]` operator with a
@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):

# Add "documents" to the top-level data scope.
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
# Add "documents" to the top-level data scope.
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))

# Each row of "documents" is a child scope.
with data_scope["documents"].row() as document:
# Each row of "documents" is a child scope.
with data_scope["documents"].row() as document:

# Get "content" from the document scope, transform, and add "summary" to scope.
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
# Get "content" from the document scope, transform, and add "summary" to scope.
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
```

</TabItem>
</Tabs>

### Add a Collector
### Add a collector

See [Data Collector](#data-collector) below for more details.

Expand All @@ -132,17 +172,17 @@ Other arguments can be passed in as positional arguments or keyword arguments, a
```python
@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
...
data_scope["field2"] = data_scope["field1"].transform(
DemoFunctionSpec(...),
arg1, arg2, ..., key0=kwarg0, key1=kwarg1, ...)
...
...
data_scope["field2"] = data_scope["field1"].transform(
DemoFunctionSpec(...),
arg1, arg2, ..., key0=kwarg0, key1=kwarg1, ...)
...
```

</TabItem>
</Tabs>

### For Each Row
### For each row

If the data slice has `Table` type, you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.

Expand All @@ -161,7 +201,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
</TabItem>
</Tabs>

### Get a Sub Field
### Get a sub field

If the data slice has `Struct` type, you can obtain a data slice on a specific sub field of it, similar to getting a field of a data scope.

Expand Down Expand Up @@ -192,14 +232,14 @@ For example,
```python
@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
...
demo_collector = data_scope.add_collector()
with data_scope["documents"].row() as document:
...
demo_collector.collect(id=cocoindex.GeneratedField.UUID,
filename=document["filename"],
summary=document["summary"])
...
demo_collector = data_scope.add_collector()
with data_scope["documents"].row() as document:
...
demo_collector.collect(id=cocoindex.GeneratedField.UUID,
filename=document["filename"],
summary=document["summary"])
...
```

</TabItem>
Expand Down Expand Up @@ -228,13 +268,13 @@ Export must happen at the top level of a flow, i.e. not within any child scopes
```python
@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
...
demo_collector = data_scope.add_collector()
...
demo_collector.export(
"demo_storage", DemoStorageSpec(...),
primary_key_fields=["field1"],
vector_index=[("field2", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
...
demo_collector = data_scope.add_collector()
...
demo_collector.export(
"demo_storage", DemoStorageSpec(...),
primary_key_fields=["field1"],
vector_index=[("field2", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
```

</TabItem>
Expand Down
Loading