Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(
language="markdown", chunk_size=300, chunk_overlap=100))
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=300, chunk_overlap=100)

# Transform data of each chunk
with doc["chunks"].row() as chunk:
Expand Down
8 changes: 7 additions & 1 deletion docs/docs/core/flow_def.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -122,14 +122,20 @@ A data slice has a certain data type, and it's the input for most operations.
`transform()` method transforms the data slice by a function, which creates another data slice.
A *function spec* needs to be provided for any transform operation, to describe the function and parameters related to the function.

The function takes one or multiple data arguments.
The first argument is the data slice to be transformed, and the `transform()` method is applied from it.
Other arguments can be passed in as positional arguments or keyword arguments, aftert the function spec.

<Tabs>
<TabItem value="python" label="Python" default>

```python
@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
...
data_scope["field1"] = data_scope["documents"].transform(DemoFunctionSpec(...))
data_scope["field2"] = data_scope["field1"].transform(
DemoFunctionSpec(...),
arg1, arg2, ..., key0=kwarg0, key1=kwarg1, ...)
...
```

Expand Down
4 changes: 2 additions & 2 deletions docs/docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(
language="markdown", chunk_size=300, chunk_overlap=100))
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=300, chunk_overlap=100)

# Transform data of each chunk
with doc["chunks"].row() as chunk:
Expand Down
9 changes: 3 additions & 6 deletions docs/docs/ops/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,12 @@ description: CocoIndex Built-in Functions
It tries to split at higher-level boundaries. If each chunk is still too large, it tries at the next level of boundaries.
For example, for a Markdown file, it identifies boundaries in this order: level-1 sections, level-2 sections, level-3 sections, paragraphs, sentences, etc.

The spec takes the following fields:

* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
* `chunk_overlap` (type: `int`, required): The maximum overlap size between adjacent chunks, in bytes.
* `language` (type: `str`, optional): The language of the document. Currently it supports `markdown`, `python` and `javascript`. If unspecified, will treat it as plain text.

Input data:

* `text` (type: `str`, required): The text to split.
* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes.
* `chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes.
* `language` (type: `str`, optional): The language of the document. Currently it supports `markdown`, `python` and `javascript`. If unspecified, will treat it as plain text.

Return type: `Table`, each row represents a chunk, with the following sub fields:

Expand Down