diff --git a/README.md b/README.md index 6d6b40643..5af880307 100644 --- a/README.md +++ b/README.md @@ -62,8 +62,8 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind with data_scope["documents"].row() as doc: # Split the document into chunks, put into `chunks` field doc["chunks"] = doc["content"].transform( - cocoindex.functions.SplitRecursively( - language="markdown", chunk_size=300, chunk_overlap=100)) + cocoindex.functions.SplitRecursively(), + language="markdown", chunk_size=300, chunk_overlap=100) # Transform data of each chunk with doc["chunks"].row() as chunk: diff --git a/docs/docs/core/flow_def.mdx b/docs/docs/core/flow_def.mdx index f86756406..9f4e5d7cd 100644 --- a/docs/docs/core/flow_def.mdx +++ b/docs/docs/core/flow_def.mdx @@ -122,6 +122,10 @@ A data slice has a certain data type, and it's the input for most operations. `transform()` method transforms the data slice by a function, which creates another data slice. A *function spec* needs to be provided for any transform operation, to describe the function and parameters related to the function. +The function takes one or multiple data arguments. +The first argument is the data slice to be transformed, and the `transform()` method is applied from it. +Other arguments can be passed in as positional arguments or keyword arguments, aftert the function spec. + @@ -129,7 +133,9 @@ A *function spec* needs to be provided for any transform operation, to describe @cocoindex.flow_def(name="DemoFlow") def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): ... - data_scope["field1"] = data_scope["documents"].transform(DemoFunctionSpec(...)) + data_scope["field2"] = data_scope["field1"].transform( + DemoFunctionSpec(...), + arg1, arg2, ..., key0=kwarg0, key1=kwarg1, ...) ... ``` diff --git a/docs/docs/getting_started/quickstart.md b/docs/docs/getting_started/quickstart.md index 2557d47a4..77792cc10 100644 --- a/docs/docs/getting_started/quickstart.md +++ b/docs/docs/getting_started/quickstart.md @@ -78,8 +78,8 @@ def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoind with data_scope["documents"].row() as doc: # Split the document into chunks, put into `chunks` field doc["chunks"] = doc["content"].transform( - cocoindex.functions.SplitRecursively( - language="markdown", chunk_size=300, chunk_overlap=100)) + cocoindex.functions.SplitRecursively(), + language="markdown", chunk_size=300, chunk_overlap=100) # Transform data of each chunk with doc["chunks"].row() as chunk: diff --git a/docs/docs/ops/functions.md b/docs/docs/ops/functions.md index ac9df9d36..4a5002da6 100644 --- a/docs/docs/ops/functions.md +++ b/docs/docs/ops/functions.md @@ -11,15 +11,12 @@ description: CocoIndex Built-in Functions It tries to split at higher-level boundaries. If each chunk is still too large, it tries at the next level of boundaries. For example, for a Markdown file, it identifies boundaries in this order: level-1 sections, level-2 sections, level-3 sections, paragraphs, sentences, etc. -The spec takes the following fields: - -* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes. -* `chunk_overlap` (type: `int`, required): The maximum overlap size between adjacent chunks, in bytes. -* `language` (type: `str`, optional): The language of the document. Currently it supports `markdown`, `python` and `javascript`. If unspecified, will treat it as plain text. - Input data: * `text` (type: `str`, required): The text to split. +* `chunk_size` (type: `int`, required): The maximum size of each chunk, in bytes. +* `chunk_overlap` (type: `int`, optional): The maximum overlap size between adjacent chunks, in bytes. +* `language` (type: `str`, optional): The language of the document. Currently it supports `markdown`, `python` and `javascript`. If unspecified, will treat it as plain text. Return type: `Table`, each row represents a chunk, with the following sub fields: