Skip to content

Commit 67bbd48

Browse files
authored
Merge branch 'cocoindex-io:main' into feat-anthropic-dataflow
2 parents cfb4bb4 + a98fba4 commit 67bbd48

File tree

32 files changed

+1752
-414
lines changed

32 files changed

+1752
-414
lines changed

.github/ISSUE_TEMPLATE/💡-feature-request.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,10 @@ assignees: ''
1111
**What is the use case?**
1212

1313
**Describe the solution you'd like**
14-
A clear and concise description of what you want to happen.
15-
16-
**Describe alternatives you've considered**
17-
A clear and concise description of any alternative solutions or features you've considered.
1814

1915
**Additional context**
20-
Add any other context or screenshots about the feature request here.
16+
2117

2218
---
23-
[Contributing Guide](https://cocoindex.io/docs/about/contributing)
24-
For changes that takes more than a day, we recommend you to leave a comment on the issue like **`I'm working on it`** or **`Can I work on this issue?`** to avoid duplicating work.
19+
❤️ Contributors, please refer to 📙[Contributing Guide](https://cocoindex.io/docs/about/contributing).
20+
Unless the PR can be sent immediately (e.g. just a few lines of code), we recommend you to leave a comment on the issue like **`I'm working on it`** or **`Can I work on this issue?`** to avoid duplicating work. Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is always open and friendly.

.vscode/settings.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"cSpell.words": [
3+
"cocoindex",
4+
"reindexing",
5+
"timedelta"
6+
]
7+
}

docs/docs/about/contributing.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,20 @@ We tag issues with the ["good first issue"](https://github.com/cocoindex-io/coco
2323
## Start hacking! Setting Up Development Environment
2424
Following the steps below to get cocoindex build on latest codebase locally - if you are making changes to cocoindex funcionality and want to test it out.
2525

26-
- Install Rust toolchain: [docs](https://rust-lang.org/tools/install)
26+
- 🦀 [Install Rust](https://rust-lang.org/tools/install)
27+
28+
If you don't have Rust installed, run
29+
```bash
30+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
31+
```
32+
Already have Rust? Make sure it's up to date
33+
```bash
34+
rustup update
35+
```
2736
28-
- (Optional) Setup Python virtual environment:
37+
- (Recommended) Setup Python virtual environment:
2938
```bash
30-
virtualenv --python=$(which python3.12) .venv
39+
python3 -m venv .venv
3140
```
3241
Activate the virtual environment, before any installings / buildings / runnings:
3342
@@ -51,6 +60,7 @@ Following the steps below to get cocoindex build on latest codebase locally - if
5160
```
5261
5362
## Submit Your Code
63+
CocoIndex is committed to the highest standards of code quality. Please ensure your code is thoroughly tested before submitting a PR.
5464
5565
To submit your code:
5666

docs/docs/core/flow_def.mdx

Lines changed: 70 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,55 @@ See [Flow Running](/docs/core/flow_methods) for more details on it.
4949
</TabItem>
5050
</Tabs>
5151

52-
## Flow Builder
52+
## Data Scope
53+
54+
A **data scope** represents data for a certain unit, e.g. the top level scope (involving all data for a flow), for a document, or for a chunk.
55+
A data scope has a bunch of fields and collectors, and users can add new fields and collectors to it.
56+
57+
### Get or Add a Field
58+
59+
You can get or add a field of a data scope (which is a data slice).
60+
61+
:::note
5362

54-
The `FlowBuilder` object is the starting point to construct a flow.
63+
You cannot override an existing field.
64+
65+
:::
66+
67+
<Tabs>
68+
<TabItem value="python" label="Python" default>
69+
70+
Getting and setting a field of a data scope is done by the `[]` operator with a field name:
71+
72+
```python
73+
@cocoindex.flow_def(name="DemoFlow")
74+
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
75+
76+
# Add "documents" to the top-level data scope.
77+
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
78+
79+
# Each row of "documents" is a child scope.
80+
with data_scope["documents"].row() as document:
81+
82+
# Get "content" from the document scope, transform, and add "summary" to scope.
83+
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
84+
```
85+
86+
</TabItem>
87+
</Tabs>
88+
89+
### Add a collector
90+
91+
See [Data Collector](#data-collector) below for more details.
92+
93+
## Data Slice
94+
95+
A **data slice** references a subset of data belonging to a data scope, e.g. a specific field from a data scope.
96+
A data slice has a certain data type, and it's the input for most operations.
5597

5698
### Import from source
5799

100+
To get the initial data slice, we need to start from importing data from a source.
58101
`FlowBuilder` provides a `add_source()` method to import data from external sources.
59102
A *source spec* needs to be provided for any import operation, to describe the source and parameters related to the source.
60103
Import must happen at the top level, and the field created by import must be in the top-level struct.
@@ -72,10 +115,6 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
72115
</TabItem>
73116
</Tabs>
74117

75-
`add_source()` returns a `DataSlice`. Once external data sources are imported, you can further transform them using methods exposed by these data objects, as discussed in the following sections.
76-
77-
We'll describe different data objects in next few sections.
78-
79118
:::note
80119

81120
The actual value of data is not available at the time when we define the flow: it's only available at runtime.
@@ -111,51 +150,6 @@ and only perform transformations on changed source keys.
111150

112151
:::
113152

114-
## Data Scope
115-
116-
A **data scope** represents data for a certain unit, e.g. the top level scope (involving all data for a flow), for a document, or for a chunk.
117-
A data scope has a bunch of fields and collectors, and users can add new fields and collectors to it.
118-
119-
### Get or Add a Field
120-
121-
You can get or add a field of a data scope (which is a data slice).
122-
123-
:::note
124-
125-
You cannot override an existing field.
126-
127-
:::
128-
129-
<Tabs>
130-
<TabItem value="python" label="Python" default>
131-
132-
Getting and setting a field of a data scope is done by the `[]` operator with a field name:
133-
134-
```python
135-
@cocoindex.flow_def(name="DemoFlow")
136-
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
137-
138-
# Add "documents" to the top-level data scope.
139-
data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
140-
141-
# Each row of "documents" is a child scope.
142-
with data_scope["documents"].row() as document:
143-
144-
# Get "content" from the document scope, transform, and add "summary" to scope.
145-
document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))
146-
```
147-
148-
</TabItem>
149-
</Tabs>
150-
151-
### Add a collector
152-
153-
See [Data Collector](#data-collector) below for more details.
154-
155-
## Data Slice
156-
157-
A **data slice** references a subset of data belonging to a data scope, e.g. a specific field from a data scope.
158-
A data slice has a certain data type, and it's the input for most operations.
159153

160154
### Transform
161155

@@ -164,7 +158,7 @@ A *function spec* needs to be provided for any transform operation, to describe
164158

165159
The function takes one or multiple data arguments.
166160
The first argument is the data slice to be transformed, and the `transform()` method is applied from it.
167-
Other arguments can be passed in as positional arguments or keyword arguments, aftert the function spec.
161+
Other arguments can be passed in as positional arguments or keyword arguments, after the function spec.
168162

169163
<Tabs>
170164
<TabItem value="python" label="Python" default>
@@ -300,6 +294,29 @@ CocoIndex provides a common way to configure indexes for various storages.
300294

301295
## Miscellaneous
302296

297+
### Target Declarations
298+
299+
Most time a target storage is created by calling `export()` method on a collector, and this `export()` call comes with configurations needed for the target storage, e.g. options for storage indexes.
300+
Occasionally, you may need to specify some configurations for target storage out of the context of any specific data collector.
301+
302+
For example, for graph database targets like `Neo4j`, you may have a data collector to export data to Neo4j relationships, which will create nodes referenced by various relationships in turn.
303+
These nodes don't directly come from any specific data collector (consider relationships from different data collectors may share the same nodes).
304+
To specify configurations for these nodes, you can *declare* spec for related node labels.
305+
306+
`FlowBuilder` provides `declare()` method for this purpose, which takes the spec to declare, as provided by various target types.
307+
308+
<Tabs>
309+
<TabItem value="python" label="Python" default>
310+
311+
```python
312+
flow_builder.declare(
313+
cocoindex.storages.Neo4jDeclarations(...)
314+
)
315+
```
316+
317+
</TabItem>
318+
</Tabs>
319+
303320
### Auth Registry
304321

305322
CocoIndex manages an auth registry. It's an in-memory key-value store, mainly to store authentication information for a backend.
@@ -310,11 +327,10 @@ Operation spec is the default way to configure a backend. But it has the followi
310327
* Once an operation is removed after flow definition code change, the spec is also gone.
311328
But we still need to be able to drop the backend (e.g. a table) by `cocoindex setup` or `cocoindex drop`.
312329

313-
314330
Auth registry is introduced to solve the problems above. It works as follows:
315331

316332
* You can create new **auth entry** by a key and a value.
317-
* You can references the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4jRelationship` takes `connection` field in the form of auth entry reference.
333+
* You can references the entry by the key, and pass it as part of spec for certain operations. e.g. `Neo4j` takes `connection` field in the form of auth entry reference.
318334

319335
<Tabs>
320336
<TabItem value="python" label="Python" default>

docs/docs/core/flow_methods.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ This action has two modes:
6262
:::info
6363

6464
For both modes, CocoIndex is performing *incremental processing*,
65-
i.e. we only performs computations and storage mutations on source data that are changed, or the flow has changed.
65+
i.e. we only perform computations and storage mutations on source data that are changed, or the flow has changed.
6666
This is to achieve best efficiency.
6767

6868
:::

0 commit comments

Comments
 (0)