You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/core/flow_def.mdx
+27-27Lines changed: 27 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,8 +30,8 @@ This `@cocoindex.flow_def` decorator declares this function as a CocoIndex flow
30
30
31
31
It takes two arguments:
32
32
33
-
*`flow_builder`: a `FlowBuilder` object to help build the flow.
34
-
*`data_scope`: a `DataScope` object, representing the top-level data scope. Any data created by the flow should be added to it.
33
+
*`flow_builder`: a `FlowBuilder` object to help build the flow.
34
+
*`data_scope`: a `DataScope` object, representing the top-level data scope. Any data created by the flow should be added to it.
35
35
36
36
Alternatively, for more flexibility (e.g. you want to do this conditionally or generate dynamic name), you can explicitly call the `cocoindex.open_flow()` method:
If the data slice has `Struct` type, you can obtain a data slice on a specific sub field of it, similar to getting a field of a data scope.
@@ -224,8 +223,8 @@ A **data collector** can be added from a specific data scope, and it collects mu
224
223
Call its `collect()` method to collect a specific entry, which can have multiple fields.
225
224
Each field has a name as specified by the argument name, and a value in one of the following representations:
226
225
227
-
*A `DataSlice`.
228
-
*An enum `cocoindex.GeneratedField.UUID` indicating its value is an UUID automatically generated by the engine.
226
+
* A `DataSlice`.
227
+
* An enum `cocoindex.GeneratedField.UUID` indicating its value is an UUID automatically generated by the engine.
229
228
The uuid will remain stable when other collected input values are unchanged.
230
229
231
230
:::note
@@ -267,13 +266,16 @@ A *target spec* needs to be provided for any export operation, to describe the t
267
266
268
267
Exportmusthappenatthetoplevelofaflow, i.e. notwithinanychildscopescreatedby"for each row". Ittakesthefollowingarguments:
269
268
270
-
*`name`: the name to identify the export target.
271
-
*`target_spec`: the target spec as the export target.
272
-
*`setup_by_user` (optional):
269
+
*`name`: the name to identify the export target.
270
+
*`target_spec`: the target spec as the export target.
271
+
*`attachments` (optional): additional attachments for the export target.
272
+
Different targets support different attachments.
273
+
For example, `Postgres` supports `PostgresSqlAttachment`, which can be used to configure arbitrary SQL statements for the export target, after the target is created or before the target is dropped.
274
+
*`setup_by_user` (optional):
273
275
whether the export target is setup by user.
274
276
By default, CocoIndex is managing the target setup (see [Setup / drop flow](/docs/core/flow_methods#setupdrop-flow)), e.g. create related tables/collections/etc. with compatible schema, and update them upon change.
275
277
If `True`, the export target will be managed by users, and users are responsible for creating the target and updating it upon change.
276
-
*Fields to configure [storage indexes](#storage-indexes). `primary_key_fields` is required, and all others are optional.
278
+
* Fields to configure [storage indexes](#storage-indexes). `primary_key_fields` is required, and all others are optional.
277
279
278
280
<Tabs>
279
281
<TabItemvalue="python"label="Python"default>
@@ -334,7 +336,7 @@ to organize flows across different environments (e.g., dev, staging, production)
334
336
335
337
In the code, You can call `flow.get_app_namespace()` to get the app namespace, and use it to name certain backends. It takes the following arguments:
336
338
337
-
*`trailing_delimiter` (optional): a string to append to the app namespace when it's not empty.
339
+
*`trailing_delimiter` (optional): a string to append to the app namespace when it's not empty.
338
340
339
341
e.g. when the current app namespace is `Staging`, `flow.get_app_namespace(trailing_delimiter='.')` will return `Staging.`.
340
342
@@ -364,31 +366,31 @@ It will use `Staging__doc_embeddings` as the collection name if the current app
364
366
CocoIndex processes data in parallel to maximize throughput, but unconstrained parallelism can overwhelm your system.
365
367
Processing too many items simultaneously can lead to:
366
368
367
-
-**Memory exhaustion**: Large datasets loaded concurrently can consume excessive RAM
368
-
-**Resource contention**: Too many parallel operations competing for CPU, disk I/O, or network bandwidth
369
-
-**System instability**: High concurrency can cause timeouts, crashes, or degraded performance
369
+
***Memory exhaustion**: Large datasets loaded concurrently can consume excessive RAM
370
+
***Resource contention**: Too many parallel operations competing for CPU, disk I/O, or network bandwidth
371
+
***System instability**: High concurrency can cause timeouts, crashes, or degraded performance
370
372
371
373
To prevent these issues, CocoIndex provides concurrency controls that limit how many data items are processed simultaneously.
372
374
373
375
#### Concurrency Options
374
376
375
377
You can control processing concurrency using these options:
376
378
377
-
*`max_inflight_rows`: Limits the maximum number of data rows being processed concurrently
378
-
*`max_inflight_bytes`: Limits the total memory footprint of data being processed concurrently (measured in bytes)
379
+
*`max_inflight_rows`: Limits the maximum number of data rows being processed concurrently
380
+
*`max_inflight_bytes`: Limits the total memory footprint of data being processed concurrently (measured in bytes)
379
381
380
382
When these limits are reached, CocoIndex will pause loading new data until some of the current processing completes, ensuring your system remains stable.
381
383
382
384
#### Where to Apply Concurrency Controls
383
385
384
386
These concurrency options can be configured at different levels:
385
387
386
-
***Source level** via [`FlowBuilder.add_source()`](#import-from-source): Controls how many rows from a data source are processed simultaneously. This prevents overwhelming your system when ingesting large datasets.
388
+
***Source level** via [`FlowBuilder.add_source()`](#import-from-source): Controls how many rows from a data source are processed simultaneously. This prevents overwhelming your system when ingesting large datasets.
387
389
388
390
You can also set global limits across all sources and flows using [`GlobalExecutionOptions`](/docs/core/settings#globalexecutionoptions) or environment variables [`COCOINDEX_SOURCE_MAX_INFLIGHT_ROWS`](/docs/core/settings#list-of-environment-variables)/[`COCOINDEX_SOURCE_MAX_INFLIGHT_BYTES`](/docs/core/settings#list-of-environment-variables).
389
391
When both global and per-source limits are specified, both limits are enforced independently - a new row can only be processed if there's available capacity in both the global budget (shared across all sources) and the per-source budget (specific to that source).
390
392
391
-
***Row iteration level** via [`DataSlice.row()`](#for-each-row): Provides fine-grained control over parallel processing within nested data structures, allowing you to tune concurrency at any level of your data hierarchy.
393
+
***Row iteration level** via [`DataSlice.row()`](#for-each-row): Provides fine-grained control over parallel processing within nested data structures, allowing you to tune concurrency at any level of your data hierarchy.
392
394
393
395
:::note
394
396
@@ -440,19 +442,18 @@ It's usually used for targets, where key stability is important for backend clea
440
442
441
443
Operation spec is the default way to configure sources, functions and targets. But it has the following limitations:
442
444
443
-
*The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
444
-
*For targets, once an operation is removed after flow definition code change, the spec is also gone.
445
+
* The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
446
+
* For targets, once an operation is removed after flow definition code change, the spec is also gone.
445
447
But we still need to be able to drop the persistent backend (e.g. a table) when [setup / drop flow](/docs/core/flow_methods#setupdrop-flow).
446
448
447
449
Auth registry is introduced to solve the problems above.
448
450
449
-
450
451
#### Auth Entry
451
452
452
453
An auth entry is an entry in the auth registry with an explicit key.
453
454
454
-
*You can create new *auth entry* by a key and a value.
455
-
*You can reference the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
455
+
* You can create new *auth entry* by a key and a value.
456
+
* You can reference the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
Copy file name to clipboardExpand all lines: docs/docs/targets/postgres.md
+45-3Lines changed: 45 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,15 +41,57 @@ CocoIndex automatically strips U+0000 (NUL) characters from strings before expor
41
41
42
42
The spec takes the following fields:
43
43
44
-
*`database` ([auth reference](/docs/core/flow_def#auth-registry) to `DatabaseConnectionSpec`, optional): The connection to the Postgres database.
44
+
*`database` ([auth reference](/docs/core/flow_def#auth-registry) to `DatabaseConnectionSpec`, optional): The connection to the Postgres database.
45
45
See [DatabaseConnectionSpec](/docs/core/settings#databaseconnectionspec) for its specific fields.
46
46
If not provided, will use the same database as the [internal storage](/docs/core/basics#internal-storage).
47
47
48
-
*`table_name` (`str`, optional): The name of the table to store to. If unspecified, will use the table name `[${AppNamespace}__]${FlowName}__${TargetName}`, e.g. `DemoFlow__doc_embeddings` or `Staging__DemoFlow__doc_embeddings`.
48
+
*`table_name` (`str`, optional): The name of the table to store to. If unspecified, will use the table name `[${AppNamespace}__]${FlowName}__${TargetName}`, e.g. `DemoFlow__doc_embeddings` or `Staging__DemoFlow__doc_embeddings`.
49
49
50
-
*`schema` (`str`, optional): The PostgreSQL schema to create the table in. If unspecified, the table will be created in the default schema (usually `public`). When specified, `table_name` must also be explicitly specified. CocoIndex will automatically create the schema if it doesn't exist.
50
+
*`schema` (`str`, optional): The PostgreSQL schema to create the table in. If unspecified, the table will be created in the default schema (usually `public`). When specified, `table_name` must also be explicitly specified. CocoIndex will automatically create the schema if it doesn't exist.
51
+
52
+
## Attachments
53
+
54
+
### PostgresSqlCommand
55
+
56
+
Execute arbitrary Postgres SQL during flow setup, with an optional SQL to undo it when the attachment or target is removed.
57
+
58
+
This attachment is useful for capabilities not natively modeled by the target spec, such as creating specialized indexes, triggers, or grants.
59
+
60
+
Fields:
61
+
62
+
*`name` (`str`, required): A identifier for this attachment on the target. Unique within the target.
63
+
*`setup_sql` (`str`, required): SQL to execute during setup.
64
+
*`teardown_sql` (`str`, optional): SQL to execute on removal/drop.
65
+
66
+
Notes about `setup_sql` and `teardown_sql`:
67
+
68
+
* Multiple statements are allowed in both `setup_sql` and `teardown_sql`. Use `;` to separate them.
69
+
* Both `setup_sql` and `teardown_sql` are expected to be idempotent, e.g. use statements like `CREATE ... IF NOT EXISTS` and `DROP ... IF EXISTS`.
70
+
* The `setup_sql` is expected to have an "upsert" behavior. If you update `setup_sql`, the updated `setup_sql` will be executed during setup.
71
+
* The `teardown_sql` is saved by CocoIndex, so it'll be executed when the attachment no longer exists. If you update `teardown_sql`, the updated `teardown_sql` will be saved and executed (instead of the previous one) during teardown.
0 commit comments