-
Notifications
You must be signed in to change notification settings - Fork 294
Update docs for incremental update / change capturing to clarify things. #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,21 +38,21 @@ It creates a `demo_flow` object in `cocoindex.Flow` type. | |
| ## Build / update target data | ||
|
|
||
| The major goal of a flow is to perform the transformations on source data and build / update data in the target storage (the index). | ||
| This action has two flavors: | ||
| This action has two modes: | ||
|
|
||
| * **One time update.** | ||
| It builds/update the target data based on source data up to the current moment. | ||
| After the target data is at least as fresh as the source data when update starts, it's done. | ||
| It fits into situations that you need to access the fresh target data at certain time points. | ||
|
|
||
| * **Live update.** | ||
| It continuously watches the source data and updates the target data accordingly. | ||
| It continuously captures changes from the source data and updates the target data accordingly. | ||
| It's long-running and only stops when being aborted explicitly. | ||
| It fits into situations that you need to access the fresh target data continuously in most of the time. | ||
|
|
||
| :::info | ||
|
|
||
| For both flavors, CocoIndex is performing updates incrementally. | ||
| For both modes, CocoIndex is performing *incremental processing*, | ||
| i.e. we only performs computations and storage mutations on source data that are changed, or the flow has changed. | ||
| This is to achieve best efficiency. | ||
|
|
||
|
|
@@ -90,22 +90,26 @@ CLI equivalence: `cocoindex update -L` | |
|
|
||
| ::: | ||
|
|
||
| Live update is *eligible* for certain data sources, including: | ||
| A data source may enable one or multiple *change capture mechanisms*: | ||
|
|
||
| * Data sources configured with a [refresh interval](flow_def#refresh-interval). | ||
| * Data sources provides a **change stream**. | ||
| * Configured with a [refresh interval](flow_def#refresh-interval), which is generally applicable to all data sources. | ||
| * Specific data sources also provide their specific change capture mechanisms. | ||
| See documentations for specific data sources for details. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can there be an example |
||
|
|
||
| Change capture mechanisms enables CocoIndex to continuously capture changes from the source data and update the target data accordingly, under live update mode. | ||
|
|
||
| To perform live update, you need to create a `cocoindex.FlowLiveUpdater` object using the `cocoindex.Flow` object. | ||
| It takes an optional `cocoindex.FlowLiveUpdaterOptions` option, with the following fields: | ||
|
|
||
| * `live_mode` (type: `bool`, default: `True`): | ||
| Whether to perform live update for eligible data sources. | ||
| Whether to perform live update for data sources with change capture mechanisms. | ||
| It has no effect for data sources without any change capture mechanism. | ||
|
|
||
| * `print_stats` (type: `bool`, default: `False`): Whether to print stats during update. | ||
|
|
||
| For data sources ineligible for live updates, or when the `live_mode` is `False`, | ||
| the `FlowLiveUpdater` only performs a one-time update, i.e. similar to the one-time update (`update()` method) above, | ||
| under a unified interface. | ||
| Note that `cocoindex.FlowLiveUpdater` provides a unified interface for both one-time update and live update. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does it make sense to link?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which part do you refer to? |
||
| It only performs live update when `live_mode` is `True`, and only for sources with change capture mechanisms enabled. | ||
| If a source has multiple change capture mechanisms enabled, all will take effect to trigger updates. | ||
|
|
||
| <Tabs> | ||
| <TabItem value="python" label="Python" default> | ||
|
|
@@ -126,7 +130,7 @@ A `FlowLiveUpdater` object supports the following methods: | |
| * `wait()` (async): Wait for the updater to finish. It only unblocks in one of the following cases: | ||
| * The updater was aborted. | ||
| * A one time update is done, and live update is not enabled: | ||
| either `live_mode` is `False`, or all data sources are ineligible for live updates. | ||
| either `live_mode` is `False`, or all data sources have no change capture mechanisms enabled. | ||
| * `update_stats()`: It returns the stats of the updater. | ||
|
|
||
| <Tabs> | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -59,7 +59,7 @@ The spec takes the following fields: | |
| * `service_account_credential_path` (type: `str`, required): full path to the service account credential file in JSON format. | ||
| * `root_folder_ids` (type: `list[str]`, required): a list of Google Drive folder IDs to import files from. | ||
| * `binary` (type: `bool`, optional): whether reading files as binary (instead of text). | ||
| * `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a *change stream* by polling Google Drive for recent modified files periodically. | ||
| * `recent_changes_poll_interval` (type: `datetime.timedelta`, optional): when set, this source provides a *change capture mechanism* by polling Google Drive for recent modified files periodically. | ||
|
|
||
| :::info | ||
|
|
||
|
|
@@ -70,8 +70,8 @@ The spec takes the following fields: | |
| On the other hand, this only detects changes for files still exists. | ||
| If the file is deleted (or the current account no longer has access to), this change will not be detected by this change stream. | ||
|
|
||
| So when a source is configured with a change stream, it's still recommended to set a `refresh_interval`, with a larger value. | ||
| So for most changes can be covered by the change stream (with low latency), and remaining changes (files no longer exist or accessible) will still be covered (with a higher latency). | ||
| So when a `GoogleDrive` source enabled `recent_changes_poll_interval`, it's still recommended to set a `refresh_interval`, with a larger value. | ||
| So that most changes can be covered by polling recent changes (with low latency), and remaining changes (files no longer exist or accessible) will still be covered (with a higher latency). | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wonder if make sense to give an example, e.g., every 2 hours, based on your requirement. |
||
|
|
||
| ::: | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does push events belong here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different sources have different channels to push events, so it belongs to specific data sources.
We can add more words to clarify here after we start to support a real push events based change capture mechanism.