Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/docs/core/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,4 +101,4 @@ As an indexing flow is long-lived, it needs to store intermediate data to keep t
CocoIndex uses internal storage for this purpose.

Currently, CocoIndex uses Postgres database as the internal storage.
See [Settings](settings#databaseconnectionspec) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.
See [Settings](settings#databaseconnectionspec) for configuring its location. The internal storage is managed by CocoIndex, see [Setup / drop flow](/docs/core/flow_methods#setup--drop-flow) for more details.
12 changes: 6 additions & 6 deletions docs/docs/core/flow_def.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ Export must happen at the top level of a flow, i.e. not within any child scopes
* `target_spec`: the target spec as the export target.
* `setup_by_user` (optional):
whether the export target is setup by user.
By default, CocoIndex is managing the target setup (surfaced by the `cocoindex setup` CLI subcommand), e.g. create related tables/collections/etc. with compatible schema, and update them upon change.
By default, CocoIndex is managing the target setup (see [Setup / drop flow](/docs/core/flow_methods#setup--drop-flow)), e.g. create related tables/collections/etc. with compatible schema, and update them upon change.
If `True`, the export target will be managed by users, and users are responsible for creating the target and updating it upon change.
* Fields to configure [storage indexes](#storage-indexes). `primary_key_fields` is required, and all others are optional.

Expand All @@ -278,7 +278,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
</TabItem>
</Tabs>

The target is managed by CocoIndex, i.e. it'll be created by [CocoIndex CLI](/docs/core/cli) when you run `cocoindex setup`, and the data will be automatically updated (including stale data removal) when updating the index.
The target is managed by CocoIndex, i.e. it'll be created or dropped when [setup / drop flow](/docs/core/flow_methods#setup--drop-flow), and the data will be automatically updated (including stale data removal) when updating the index.
The `name` for the same target should remain stable across different runs.
If it changes, CocoIndex will treat it as an old target removed and a new one created, and perform setup changes and reindexing accordingly.

Expand Down Expand Up @@ -370,11 +370,11 @@ flow_builder.declare(

CocoIndex manages an auth registry. It's an in-memory key-value store, mainly to store authentication information for a backend.

Operation spec is the default way to configure a backend. But it has the following limitations:
Operation spec is the default way to configure a persistent backend. But it has the following limitations:

* The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
* Once an operation is removed after flow definition code change, the spec is also gone.
But we still need to be able to drop the backend (e.g. a table) by `cocoindex setup` or `cocoindex drop`.
But we still need to be able to drop the backend (e.g. a table) when [setup / drop flow](/docs/core/flow_methods#setup--drop-flow).

Auth registry is introduced to solve the problems above. It works as follows:

Expand Down Expand Up @@ -423,5 +423,5 @@ Note that CocoIndex backends use the key of an auth entry to identify the backen
* Keep the key stable.
If the key doesn't change, it's considered to be the same backend (even if the underlying way to connect/authenticate change).

* If a key is no longer referenced in any operation spec, keep it until the next `cocoindex setup` or `cocoindex drop`,
so that when cocoindex will be able to perform cleanups.
* If a key is no longer referenced in any operation spec, keep it until the next flow setup / drop action,
so that cocoindex will be able to clean up the backends.
86 changes: 85 additions & 1 deletion docs/docs/core/flow_methods.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,71 @@ It creates a `demo_flow` object in `cocoindex.Flow` type.
</TabItem>
</Tabs>

## Setup / drop flow

For a flow, its persistent backends need to be ready before it can run, including:

* [Internal storage](/docs/core/basics#internal-storage) for CocoIndex.
* Backend entities for targets exported by the flow, e.g. a table (in relational databases), a collection (in some vector databases), etc.

The desired state of the backends for a flow is derived based on the flow definition itself.
CocoIndex supports two types of actions to manage the persistent backends automatically:

* *Setup* a flow, which will change the backends owned by the flow to a state to the desired state, e.g. create new tables for new flow, drop an existing table if the corresponding target is gone, add new column to a target table if a new field is collected, etc. It's no-op if the backend states are already in the desired state.

* *Drop* a flow, which will drop all backends owned by the flow. It's no-op if there's no existing backends owned by the flow (e.g. never setup or already dropped).

### CLI

`cocoindex setup` subcommand will setup all flows.
`cocoindex update` and `cocoindex server` also provide a `--setup` option to setup the flow if needed before performing the main action of updating or starting the server.

`cocoindex drop` subcommand will drop all flows.

### Library API

<Tabs>
<TabItem value="python" label="Python">

`Flow` provides the following APIs to setup / drop individual flows:

* `setup(report_to_stdout: bool = False)`: Setup the flow.
* `drop(report_to_stdout: bool = False)`: Drop the flow.

For example:

```python
demo_flow.setup(report_to_stdout=True)
demo_flow.drop(report_to_stdout=True)
```

We also provide the following asynchronous versions of the APIs:

* `setup_async(report_to_stdout: bool = False)`: Setup the flow asynchronously.
* `drop_async(report_to_stdout: bool = False)`: Drop the flow asynchronously.

For example:

```python
await demo_flow.setup_async(report_to_stdout=True)
await demo_flow.drop_async(report_to_stdout=True)
```


Besides, CocoIndex also provides APIs to setup / drop all flows at once:

* `setup_all_flows(report_to_stdout: bool = False)`: Setup all flows.
* `drop_all_flows(report_to_stdout: bool = False)`: Drop all flows.

For example:

```python
cocoindex.setup_all_flows(report_to_stdout=True)
cocoindex.drop_all_flows(report_to_stdout=True)
```

</TabItem>
</Tabs>

## Build / update target data

Expand Down Expand Up @@ -71,12 +136,18 @@ Once it's done, the target data is fresh up to the moment when the function is c
cocoindex update main.py
```

With a `--setup` option, it will also setup the flow first if needed.

```sh
cocoindex update --setup main.py
```

#### Library API

<Tabs>
<TabItem value="python" label="Python">

The `update()` async method creates/updates data in the target.
The `Flow.update()` method creates/updates data in the target.

Once the function returns, the target data is fresh up to the moment when the function is called.

Expand All @@ -85,6 +156,13 @@ stats = demo_flow.update()
print(stats)
```

`update_async()` is the asynchronous version of `update()`.

```python
stats = await demo_flow.update_async()
print(stats)
```

</TabItem>
</Tabs>

Expand All @@ -111,6 +189,12 @@ cocoindex update main.py -L
If there's at least one data source with change capture mechanism enabled, it will keep running until the aborted (e.g. by `Ctrl-C`).
Otherwise, it falls back to the same behavior as one time update, and will finish after a one-time update is done.

With a `--setup` option, it will also setup the flow first if needed.

```sh
cocoindex update main.py -L --setup
```

#### Library API

<Tabs>
Expand Down
21 changes: 5 additions & 16 deletions docs/docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,27 +121,16 @@ Specify the database URL by environment variable:
export COCOINDEX_DATABASE_URL="postgresql://cocoindex:cocoindex@localhost:5432/cocoindex"
```

### Step 3.1: Setup the index pipeline

We need to setup the index:

```bash
cocoindex setup quickstart.py
```

Enter `yes` and it will automatically create a few tables in the database.

Now we have tables needed by this CocoIndex flow.

### Step 3.2: Build the index

Now we're ready to build the index:

```bash
cocoindex update quickstart.py
cocoindex update --setup quickstart.py
```

It will run for a few seconds and output the following statistics:
If you run it the first time for this flow, CocoIndex will automatically create its persistent backends (tables in the database).
CocoIndex will ask you to confirm the action, enter `yes` to proceed.

CocoIndex will run for a few seconds and populate the target table with data as declared by the flow. It will output the following statistics:

```
documents: 3 added, 0 removed, 0 updated
Expand Down
7 changes: 5 additions & 2 deletions python/cocoindex/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
from .flow import FlowBuilder, DataScope, DataSlice, Flow, transform_flow
from .flow import flow_def
from .flow import EvaluateAndDumpOptions, GeneratedField
from .flow import update_all_flows_async, FlowLiveUpdater, FlowLiveUpdaterOptions
from .flow import FlowLiveUpdater, FlowLiveUpdaterOptions
from .flow import update_all_flows_async, setup_all_flows, drop_all_flows
from .lib import init, start_server, stop, main_fn
from .llm import LlmSpec, LlmApiType
from .index import VectorSimilarityMetric, VectorIndexDef, IndexOptions
Expand Down Expand Up @@ -40,9 +41,11 @@
"flow_def",
"EvaluateAndDumpOptions",
"GeneratedField",
"update_all_flows_async",
"FlowLiveUpdater",
"FlowLiveUpdaterOptions",
"update_all_flows_async",
"setup_all_flows",
"drop_all_flows",
# Lib
"init",
"start_server",
Expand Down
Loading