Skip to content

Commit 14da9da

Browse files
authored
feat(setup-api): update and officially expose setup API (#681)
* feat(setup-api): update and officially expose setup API * feat(setup): `cocoindex server` supports `--setup` * docs: add new setup/drop APIs
1 parent 559abf6 commit 14da9da

File tree

8 files changed

+225
-113
lines changed

8 files changed

+225
-113
lines changed

docs/docs/core/basics.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,4 +101,4 @@ As an indexing flow is long-lived, it needs to store intermediate data to keep t
101101
CocoIndex uses internal storage for this purpose.
102102

103103
Currently, CocoIndex uses Postgres database as the internal storage.
104-
See [Settings](settings#databaseconnectionspec) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.
104+
See [Settings](settings#databaseconnectionspec) for configuring its location. The internal storage is managed by CocoIndex, see [Setup / drop flow](/docs/core/flow_methods#setup--drop-flow) for more details.

docs/docs/core/flow_def.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -256,7 +256,7 @@ Export must happen at the top level of a flow, i.e. not within any child scopes
256256
* `target_spec`: the target spec as the export target.
257257
* `setup_by_user` (optional):
258258
whether the export target is setup by user.
259-
By default, CocoIndex is managing the target setup (surfaced by the `cocoindex setup` CLI subcommand), e.g. create related tables/collections/etc. with compatible schema, and update them upon change.
259+
By default, CocoIndex is managing the target setup (see [Setup / drop flow](/docs/core/flow_methods#setup--drop-flow)), e.g. create related tables/collections/etc. with compatible schema, and update them upon change.
260260
If `True`, the export target will be managed by users, and users are responsible for creating the target and updating it upon change.
261261
* Fields to configure [storage indexes](#storage-indexes). `primary_key_fields` is required, and all others are optional.
262262

@@ -278,7 +278,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
278278
</TabItem>
279279
</Tabs>
280280

281-
The target is managed by CocoIndex, i.e. it'll be created by [CocoIndex CLI](/docs/core/cli) when you run `cocoindex setup`, and the data will be automatically updated (including stale data removal) when updating the index.
281+
The target is managed by CocoIndex, i.e. it'll be created or dropped when [setup / drop flow](/docs/core/flow_methods#setup--drop-flow), and the data will be automatically updated (including stale data removal) when updating the index.
282282
The `name` for the same target should remain stable across different runs.
283283
If it changes, CocoIndex will treat it as an old target removed and a new one created, and perform setup changes and reindexing accordingly.
284284

@@ -370,11 +370,11 @@ flow_builder.declare(
370370

371371
CocoIndex manages an auth registry. It's an in-memory key-value store, mainly to store authentication information for a backend.
372372

373-
Operation spec is the default way to configure a backend. But it has the following limitations:
373+
Operation spec is the default way to configure a persistent backend. But it has the following limitations:
374374

375375
* The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
376376
* Once an operation is removed after flow definition code change, the spec is also gone.
377-
But we still need to be able to drop the backend (e.g. a table) by `cocoindex setup` or `cocoindex drop`.
377+
But we still need to be able to drop the backend (e.g. a table) when [setup / drop flow](/docs/core/flow_methods#setup--drop-flow).
378378

379379
Auth registry is introduced to solve the problems above. It works as follows:
380380

@@ -423,5 +423,5 @@ Note that CocoIndex backends use the key of an auth entry to identify the backen
423423
* Keep the key stable.
424424
If the key doesn't change, it's considered to be the same backend (even if the underlying way to connect/authenticate change).
425425

426-
* If a key is no longer referenced in any operation spec, keep it until the next `cocoindex setup` or `cocoindex drop`,
427-
so that when cocoindex will be able to perform cleanups.
426+
* If a key is no longer referenced in any operation spec, keep it until the next flow setup / drop action,
427+
so that cocoindex will be able to clean up the backends.

docs/docs/core/flow_methods.mdx

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,71 @@ It creates a `demo_flow` object in `cocoindex.Flow` type.
3434
</TabItem>
3535
</Tabs>
3636

37+
## Setup / drop flow
38+
39+
For a flow, its persistent backends need to be ready before it can run, including:
40+
41+
* [Internal storage](/docs/core/basics#internal-storage) for CocoIndex.
42+
* Backend entities for targets exported by the flow, e.g. a table (in relational databases), a collection (in some vector databases), etc.
43+
44+
The desired state of the backends for a flow is derived based on the flow definition itself.
45+
CocoIndex supports two types of actions to manage the persistent backends automatically:
46+
47+
* *Setup* a flow, which will change the backends owned by the flow to a state to the desired state, e.g. create new tables for new flow, drop an existing table if the corresponding target is gone, add new column to a target table if a new field is collected, etc. It's no-op if the backend states are already in the desired state.
48+
49+
* *Drop* a flow, which will drop all backends owned by the flow. It's no-op if there's no existing backends owned by the flow (e.g. never setup or already dropped).
50+
51+
### CLI
52+
53+
`cocoindex setup` subcommand will setup all flows.
54+
`cocoindex update` and `cocoindex server` also provide a `--setup` option to setup the flow if needed before performing the main action of updating or starting the server.
55+
56+
`cocoindex drop` subcommand will drop all flows.
57+
58+
### Library API
59+
60+
<Tabs>
61+
<TabItem value="python" label="Python">
62+
63+
`Flow` provides the following APIs to setup / drop individual flows:
64+
65+
* `setup(report_to_stdout: bool = False)`: Setup the flow.
66+
* `drop(report_to_stdout: bool = False)`: Drop the flow.
67+
68+
For example:
69+
70+
```python
71+
demo_flow.setup(report_to_stdout=True)
72+
demo_flow.drop(report_to_stdout=True)
73+
```
74+
75+
We also provide the following asynchronous versions of the APIs:
76+
77+
* `setup_async(report_to_stdout: bool = False)`: Setup the flow asynchronously.
78+
* `drop_async(report_to_stdout: bool = False)`: Drop the flow asynchronously.
79+
80+
For example:
81+
82+
```python
83+
await demo_flow.setup_async(report_to_stdout=True)
84+
await demo_flow.drop_async(report_to_stdout=True)
85+
```
86+
87+
88+
Besides, CocoIndex also provides APIs to setup / drop all flows at once:
89+
90+
* `setup_all_flows(report_to_stdout: bool = False)`: Setup all flows.
91+
* `drop_all_flows(report_to_stdout: bool = False)`: Drop all flows.
92+
93+
For example:
94+
95+
```python
96+
cocoindex.setup_all_flows(report_to_stdout=True)
97+
cocoindex.drop_all_flows(report_to_stdout=True)
98+
```
99+
100+
</TabItem>
101+
</Tabs>
37102

38103
## Build / update target data
39104

@@ -71,12 +136,18 @@ Once it's done, the target data is fresh up to the moment when the function is c
71136
cocoindex update main.py
72137
```
73138

139+
With a `--setup` option, it will also setup the flow first if needed.
140+
141+
```sh
142+
cocoindex update --setup main.py
143+
```
144+
74145
#### Library API
75146

76147
<Tabs>
77148
<TabItem value="python" label="Python">
78149

79-
The `update()` async method creates/updates data in the target.
150+
The `Flow.update()` method creates/updates data in the target.
80151

81152
Once the function returns, the target data is fresh up to the moment when the function is called.
82153

@@ -85,6 +156,13 @@ stats = demo_flow.update()
85156
print(stats)
86157
```
87158

159+
`update_async()` is the asynchronous version of `update()`.
160+
161+
```python
162+
stats = await demo_flow.update_async()
163+
print(stats)
164+
```
165+
88166
</TabItem>
89167
</Tabs>
90168

@@ -111,6 +189,12 @@ cocoindex update main.py -L
111189
If there's at least one data source with change capture mechanism enabled, it will keep running until the aborted (e.g. by `Ctrl-C`).
112190
Otherwise, it falls back to the same behavior as one time update, and will finish after a one-time update is done.
113191

192+
With a `--setup` option, it will also setup the flow first if needed.
193+
194+
```sh
195+
cocoindex update main.py -L --setup
196+
```
197+
114198
#### Library API
115199

116200
<Tabs>

docs/docs/getting_started/quickstart.md

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -121,27 +121,16 @@ Specify the database URL by environment variable:
121121
export COCOINDEX_DATABASE_URL="postgresql://cocoindex:cocoindex@localhost:5432/cocoindex"
122122
```
123123

124-
### Step 3.1: Setup the index pipeline
125-
126-
We need to setup the index:
127-
128-
```bash
129-
cocoindex setup quickstart.py
130-
```
131-
132-
Enter `yes` and it will automatically create a few tables in the database.
133-
134-
Now we have tables needed by this CocoIndex flow.
135-
136-
### Step 3.2: Build the index
137-
138124
Now we're ready to build the index:
139125
140126
```bash
141-
cocoindex update quickstart.py
127+
cocoindex update --setup quickstart.py
142128
```
143129
144-
It will run for a few seconds and output the following statistics:
130+
If you run it the first time for this flow, CocoIndex will automatically create its persistent backends (tables in the database).
131+
CocoIndex will ask you to confirm the action, enter `yes` to proceed.
132+
133+
CocoIndex will run for a few seconds and populate the target table with data as declared by the flow. It will output the following statistics:
145134
146135
```
147136
documents: 3 added, 0 removed, 0 updated

python/cocoindex/__init__.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@
1010
from .flow import FlowBuilder, DataScope, DataSlice, Flow, transform_flow
1111
from .flow import flow_def
1212
from .flow import EvaluateAndDumpOptions, GeneratedField
13-
from .flow import update_all_flows_async, FlowLiveUpdater, FlowLiveUpdaterOptions
13+
from .flow import FlowLiveUpdater, FlowLiveUpdaterOptions
14+
from .flow import update_all_flows_async, setup_all_flows, drop_all_flows
1415
from .lib import init, start_server, stop, main_fn
1516
from .llm import LlmSpec, LlmApiType
1617
from .index import VectorSimilarityMetric, VectorIndexDef, IndexOptions
@@ -40,9 +41,11 @@
4041
"flow_def",
4142
"EvaluateAndDumpOptions",
4243
"GeneratedField",
43-
"update_all_flows_async",
4444
"FlowLiveUpdater",
4545
"FlowLiveUpdaterOptions",
46+
"update_all_flows_async",
47+
"setup_all_flows",
48+
"drop_all_flows",
4649
# Lib
4750
"init",
4851
"start_server",

0 commit comments

Comments
 (0)