Skip to content

Commit 5f4c2f4

Browse files
authored
docs(postgres-sql-cmd): add docs for PostgresSqlCommand attachment (#1146)
1 parent c0827ca commit 5f4c2f4

File tree

2 files changed

+72
-30
lines changed

2 files changed

+72
-30
lines changed

docs/docs/core/flow_def.mdx

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ This `@cocoindex.flow_def` decorator declares this function as a CocoIndex flow
3030

3131
It takes two arguments:
3232

33-
* `flow_builder`: a `FlowBuilder` object to help build the flow.
34-
* `data_scope`: a `DataScope` object, representing the top-level data scope. Any data created by the flow should be added to it.
33+
* `flow_builder`: a `FlowBuilder` object to help build the flow.
34+
* `data_scope`: a `DataScope` object, representing the top-level data scope. Any data created by the flow should be added to it.
3535

3636
Alternatively, for more flexibility (e.g. you want to do this conditionally or generate dynamic name), you can explicitly call the `cocoindex.open_flow()` method:
3737

@@ -210,7 +210,6 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
210210
</TabItem>
211211
</Tabs>
212212

213-
214213
### Get a sub field
215214

216215
If the data slice has `Struct` type, you can obtain a data slice on a specific sub field of it, similar to getting a field of a data scope.
@@ -224,8 +223,8 @@ A **data collector** can be added from a specific data scope, and it collects mu
224223
Call its `collect()` method to collect a specific entry, which can have multiple fields.
225224
Each field has a name as specified by the argument name, and a value in one of the following representations:
226225

227-
* A `DataSlice`.
228-
* An enum `cocoindex.GeneratedField.UUID` indicating its value is an UUID automatically generated by the engine.
226+
* A `DataSlice`.
227+
* An enum `cocoindex.GeneratedField.UUID` indicating its value is an UUID automatically generated by the engine.
229228
The uuid will remain stable when other collected input values are unchanged.
230229

231230
:::note
@@ -267,13 +266,16 @@ A *target spec* needs to be provided for any export operation, to describe the t
267266

268267
Export must happen at the top level of a flow, i.e. not within any child scopes created by "for each row". It takes the following arguments:
269268

270-
* `name`: the name to identify the export target.
271-
* `target_spec`: the target spec as the export target.
272-
* `setup_by_user` (optional):
269+
* `name`: the name to identify the export target.
270+
* `target_spec`: the target spec as the export target.
271+
* `attachments` (optional): additional attachments for the export target.
272+
Different targets support different attachments.
273+
For example, `Postgres` supports `PostgresSqlAttachment`, which can be used to configure arbitrary SQL statements for the export target, after the target is created or before the target is dropped.
274+
* `setup_by_user` (optional):
273275
whether the export target is setup by user.
274276
By default, CocoIndex is managing the target setup (see [Setup / drop flow](/docs/core/flow_methods#setupdrop-flow)), e.g. create related tables/collections/etc. with compatible schema, and update them upon change.
275277
If `True`, the export target will be managed by users, and users are responsible for creating the target and updating it upon change.
276-
* Fields to configure [storage indexes](#storage-indexes). `primary_key_fields` is required, and all others are optional.
278+
* Fields to configure [storage indexes](#storage-indexes). `primary_key_fields` is required, and all others are optional.
277279

278280
<Tabs>
279281
<TabItem value="python" label="Python" default>
@@ -334,7 +336,7 @@ to organize flows across different environments (e.g., dev, staging, production)
334336

335337
In the code, You can call `flow.get_app_namespace()` to get the app namespace, and use it to name certain backends. It takes the following arguments:
336338

337-
* `trailing_delimiter` (optional): a string to append to the app namespace when it's not empty.
339+
* `trailing_delimiter` (optional): a string to append to the app namespace when it's not empty.
338340

339341
e.g. when the current app namespace is `Staging`, `flow.get_app_namespace(trailing_delimiter='.')` will return `Staging.`.
340342

@@ -364,31 +366,31 @@ It will use `Staging__doc_embeddings` as the collection name if the current app
364366
CocoIndex processes data in parallel to maximize throughput, but unconstrained parallelism can overwhelm your system.
365367
Processing too many items simultaneously can lead to:
366368

367-
- **Memory exhaustion**: Large datasets loaded concurrently can consume excessive RAM
368-
- **Resource contention**: Too many parallel operations competing for CPU, disk I/O, or network bandwidth
369-
- **System instability**: High concurrency can cause timeouts, crashes, or degraded performance
369+
* **Memory exhaustion**: Large datasets loaded concurrently can consume excessive RAM
370+
* **Resource contention**: Too many parallel operations competing for CPU, disk I/O, or network bandwidth
371+
* **System instability**: High concurrency can cause timeouts, crashes, or degraded performance
370372

371373
To prevent these issues, CocoIndex provides concurrency controls that limit how many data items are processed simultaneously.
372374

373375
#### Concurrency Options
374376

375377
You can control processing concurrency using these options:
376378

377-
* `max_inflight_rows`: Limits the maximum number of data rows being processed concurrently
378-
* `max_inflight_bytes`: Limits the total memory footprint of data being processed concurrently (measured in bytes)
379+
* `max_inflight_rows`: Limits the maximum number of data rows being processed concurrently
380+
* `max_inflight_bytes`: Limits the total memory footprint of data being processed concurrently (measured in bytes)
379381

380382
When these limits are reached, CocoIndex will pause loading new data until some of the current processing completes, ensuring your system remains stable.
381383

382384
#### Where to Apply Concurrency Controls
383385

384386
These concurrency options can be configured at different levels:
385387

386-
* **Source level** via [`FlowBuilder.add_source()`](#import-from-source): Controls how many rows from a data source are processed simultaneously. This prevents overwhelming your system when ingesting large datasets.
388+
* **Source level** via [`FlowBuilder.add_source()`](#import-from-source): Controls how many rows from a data source are processed simultaneously. This prevents overwhelming your system when ingesting large datasets.
387389

388390
You can also set global limits across all sources and flows using [`GlobalExecutionOptions`](/docs/core/settings#globalexecutionoptions) or environment variables [`COCOINDEX_SOURCE_MAX_INFLIGHT_ROWS`](/docs/core/settings#list-of-environment-variables)/[`COCOINDEX_SOURCE_MAX_INFLIGHT_BYTES`](/docs/core/settings#list-of-environment-variables).
389391
When both global and per-source limits are specified, both limits are enforced independently - a new row can only be processed if there's available capacity in both the global budget (shared across all sources) and the per-source budget (specific to that source).
390392

391-
* **Row iteration level** via [`DataSlice.row()`](#for-each-row): Provides fine-grained control over parallel processing within nested data structures, allowing you to tune concurrency at any level of your data hierarchy.
393+
* **Row iteration level** via [`DataSlice.row()`](#for-each-row): Provides fine-grained control over parallel processing within nested data structures, allowing you to tune concurrency at any level of your data hierarchy.
392394

393395
:::note
394396

@@ -440,19 +442,18 @@ It's usually used for targets, where key stability is important for backend clea
440442

441443
Operation spec is the default way to configure sources, functions and targets. But it has the following limitations:
442444

443-
* The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
444-
* For targets, once an operation is removed after flow definition code change, the spec is also gone.
445+
* The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. `cocoindex show`.
446+
* For targets, once an operation is removed after flow definition code change, the spec is also gone.
445447
But we still need to be able to drop the persistent backend (e.g. a table) when [setup / drop flow](/docs/core/flow_methods#setupdrop-flow).
446448

447449
Auth registry is introduced to solve the problems above.
448450

449-
450451
#### Auth Entry
451452

452453
An auth entry is an entry in the auth registry with an explicit key.
453454

454-
* You can create new *auth entry* by a key and a value.
455-
* You can reference the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
455+
* You can create new *auth entry* by a key and a value.
456+
* You can reference the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
456457

457458
<Tabs>
458459
<TabItem value="python" label="Python" default>
@@ -471,7 +472,7 @@ my_graph_conn = cocoindex.add_auth_entry(
471472

472473
Then reference it when building a spec that takes an auth entry:
473474

474-
* You can either reference by the `AuthEntryReference[T]` object directly:
475+
* You can either reference by the `AuthEntryReference[T]` object directly:
475476

476477
```python
477478
demo_collector.export(
@@ -480,7 +481,7 @@ Then reference it when building a spec that takes an auth entry:
480481
)
481482
```
482483

483-
* You can also reference it by the key string, using `cocoindex.ref_auth_entry()` function:
484+
* You can also reference it by the key string, using `cocoindex.ref_auth_entry()` function:
484485

485486
```python
486487
demo_collector.export(
@@ -493,10 +494,10 @@ Then reference it when building a spec that takes an auth entry:
493494

494495
Note that CocoIndex backends use the key of an auth entry to identify the backend.
495496

496-
* Keep the key stable.
497+
* Keep the key stable.
497498
If the key doesn't change, it's considered to be the same backend (even if the underlying way to connect/authenticate changes).
498499

499-
* If a key is no longer referenced in any operation spec, keep it until the next flow setup / drop action,
500+
* If a key is no longer referenced in any operation spec, keep it until the next flow setup / drop action,
500501
so that CocoIndex will be able to clean up the backends.
501502

502503
#### Transient Auth Entry
@@ -518,7 +519,6 @@ flow_builder.add_source(
518519
)
519520
```
520521

521-
522522
</TabItem>
523523
</Tabs>
524524

docs/docs/targets/postgres.md

Lines changed: 45 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,15 +41,57 @@ CocoIndex automatically strips U+0000 (NUL) characters from strings before expor
4141

4242
The spec takes the following fields:
4343

44-
* `database` ([auth reference](/docs/core/flow_def#auth-registry) to `DatabaseConnectionSpec`, optional): The connection to the Postgres database.
44+
* `database` ([auth reference](/docs/core/flow_def#auth-registry) to `DatabaseConnectionSpec`, optional): The connection to the Postgres database.
4545
See [DatabaseConnectionSpec](/docs/core/settings#databaseconnectionspec) for its specific fields.
4646
If not provided, will use the same database as the [internal storage](/docs/core/basics#internal-storage).
4747

48-
* `table_name` (`str`, optional): The name of the table to store to. If unspecified, will use the table name `[${AppNamespace}__]${FlowName}__${TargetName}`, e.g. `DemoFlow__doc_embeddings` or `Staging__DemoFlow__doc_embeddings`.
48+
* `table_name` (`str`, optional): The name of the table to store to. If unspecified, will use the table name `[${AppNamespace}__]${FlowName}__${TargetName}`, e.g. `DemoFlow__doc_embeddings` or `Staging__DemoFlow__doc_embeddings`.
4949

50-
* `schema` (`str`, optional): The PostgreSQL schema to create the table in. If unspecified, the table will be created in the default schema (usually `public`). When specified, `table_name` must also be explicitly specified. CocoIndex will automatically create the schema if it doesn't exist.
50+
* `schema` (`str`, optional): The PostgreSQL schema to create the table in. If unspecified, the table will be created in the default schema (usually `public`). When specified, `table_name` must also be explicitly specified. CocoIndex will automatically create the schema if it doesn't exist.
51+
52+
## Attachments
53+
54+
### PostgresSqlCommand
55+
56+
Execute arbitrary Postgres SQL during flow setup, with an optional SQL to undo it when the attachment or target is removed.
57+
58+
This attachment is useful for capabilities not natively modeled by the target spec, such as creating specialized indexes, triggers, or grants.
59+
60+
Fields:
61+
62+
* `name` (`str`, required): A identifier for this attachment on the target. Unique within the target.
63+
* `setup_sql` (`str`, required): SQL to execute during setup.
64+
* `teardown_sql` (`str`, optional): SQL to execute on removal/drop.
65+
66+
Notes about `setup_sql` and `teardown_sql`:
67+
68+
* Multiple statements are allowed in both `setup_sql` and `teardown_sql`. Use `;` to separate them.
69+
* Both `setup_sql` and `teardown_sql` are expected to be idempotent, e.g. use statements like `CREATE ... IF NOT EXISTS` and `DROP ... IF EXISTS`.
70+
* The `setup_sql` is expected to have an "upsert" behavior. If you update `setup_sql`, the updated `setup_sql` will be executed during setup.
71+
* The `teardown_sql` is saved by CocoIndex, so it'll be executed when the attachment no longer exists. If you update `teardown_sql`, the updated `teardown_sql` will be saved and executed (instead of the previous one) during teardown.
72+
73+
Example (create a custom index):
74+
75+
```py
76+
collector.export(
77+
"doc_embeddings",
78+
cocoindex.targets.Postgres(table_name="doc_embeddings"),
79+
primary_key_fields=["id"],
80+
attachments=[
81+
cocoindex.targets.PostgresSqlCommand(
82+
name="fts",
83+
setup_sql=(
84+
"CREATE INDEX IF NOT EXISTS doc_embeddings_text_fts "
85+
"ON doc_embeddings USING GIN (to_tsvector('english', text));"
86+
),
87+
teardown_sql= "DROP INDEX IF EXISTS doc_embeddings_text_fts;",
88+
)
89+
],
90+
)
91+
```
5192

5293
## Example
94+
5395
<ExampleButton
5496
href="https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding"
5597
text="Text Embedding Example with Postgres"

0 commit comments

Comments
 (0)