diff --git a/docs/docs/tutorials/live_updates.md b/docs/docs/tutorials/live_updates.md new file mode 100644 index 000000000..ffa320e55 --- /dev/null +++ b/docs/docs/tutorials/live_updates.md @@ -0,0 +1,156 @@ +--- +title: Live Updates +description: "Keep your indexes up-to-date with live updates in CocoIndex." +--- + +# Live Updates + +CocoIndex is designed to keep your indexes synchronized with your data sources. This is achieved through a feature called **live updates**, which automatically detects changes in your sources and updates your indexes accordingly. This ensures that your search results and data analysis are always based on the most current information. + +## How Live Updates Work + +Live updates in CocoIndex can be triggered in two main ways: + +1. **Refresh Interval:** You can configure a `refresh_interval` for any data source. CocoIndex will then periodically check the source for any new, updated, or deleted data. This is a simple and effective way to keep your index fresh, especially for sources that don't have a built-in change notification system. + +2. **Change Capture Mechanisms:** Some data sources offer more sophisticated ways to track changes. For example: + * **Amazon S3:** You can configure an SQS queue to receive notifications whenever a file is added, modified, or deleted in your S3 bucket. CocoIndex can listen to this queue and trigger an update instantly. + * **Google Drive:** The Google Drive source can be configured to poll for recent changes, which is more efficient than a full refresh. + +When a change is detected, CocoIndex performs an **incremental update**. This means it only re-processes the data that has been affected by the change, without having to re-index your entire dataset. This makes the update process fast and efficient. + +Here's an example of how to set up a source with a `refresh_interval`: + +```python +@cocoindex.flow_def(name="LiveUpdateExample") +def live_update_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): + # Source: local files in the 'data' directory + data_scope["documents"] = flow_builder.add_source( + cocoindex.sources.LocalFile(path="data"), + refresh_interval=cocoindex.timedelta(seconds=5), + ) + # ... +``` + +By setting `refresh_interval` to 5 seconds, we're telling CocoIndex to check for changes in the `data` directory every 5 seconds. + +## Implementing Live Updates + +You can enable live updates using either the CocoIndex CLI or the Python library. + +### Using the CLI + +To start a live update process from the command line, use the `update` command with the `-L` or `--live` flag: + +```bash +cocoindex update -L your_flow_definition_file.py +``` + +This will start a long-running process that continuously monitors your data sources for changes and updates your indexes in real-time. You can stop the process by pressing `Ctrl+C`. + +### Using the Python Library + +For more control over the live update process, you can use the `FlowLiveUpdater` class in your Python code. This is particularly useful when you want to integrate CocoIndex into a larger application. + +The `FlowLiveUpdater` can be used as a context manager, which automatically starts the updater when you enter the `with` block and stops it when you exit. The `wait()` method will block until the updater is aborted (e.g., by pressing `Ctrl+C`). + +Here's how you can use `FlowLiveUpdater` to start and manage a live update process: + +```python +import cocoindex + +# Create a FlowLiveUpdater instance +with cocoindex.FlowLiveUpdater(live_update_flow, cocoindex.FlowLiveUpdaterOptions(print_stats=True)) as updater: + print("Live updater started. Press Ctrl+C to stop.") + # The updater runs in the background. + # The wait() method blocks until the updater is stopped. + updater.wait() + +print("Live updater stopped.") +``` + +#### Getting Status Updates + +You can also get status updates from the `FlowLiveUpdater` to monitor the update process. The `next_status_updates()` method blocks until there is a new status update. + +```python +import cocoindex + +updater = cocoindex.FlowLiveUpdater(live_update_flow) +updater.start() + +while True: + updates = updater.next_status_updates() + + if not updates.active_sources: + print("All sources have finished processing.") + break + + for source_name in updates.updated_sources: + print(f"Source '{source_name}' has been updated.") + +updater.wait() +``` + +This allows you to react to updates in your application, for example, by notifying users or triggering downstream processes. + +## Example + +Let's walk through an example of how to set up a live update flow. For the complete, runnable code, see the [live updates example](https://github.com/cocoindex-io/cocoindex/tree/main/examples/live_updates) in the CocoIndex repository. + +### 1. Setting up the Source + +The first step is to define a source and configure a `refresh_interval`. In this example, we'll use a `LocalFile` source to monitor a directory named `data`. + +```python +@cocoindex.flow_def(name="LiveUpdateExample") +def live_update_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): + # Source: local files in the 'data' directory + data_scope["documents"] = flow_builder.add_source( + cocoindex.sources.LocalFile(path="data"), + refresh_interval=cocoindex.timedelta(seconds=5), + ) + + # Collector + collector = data_scope.add_collector() + with data_scope["documents"].row() as doc: + collector.collect(filename=doc["filename"], content=doc["content"]) + + # Target: Postgres database + collector.export( + "documents_index", + cocoindex.targets.Postgres(), + primary_key_fields=["filename"] + ) +``` + +By setting `refresh_interval` to 5 seconds, we're telling CocoIndex to check for changes in the `data` directory every 5 seconds. + +### 2. Running the Live Updater + +Once the flow is defined, you can use the `FlowLiveUpdater` to start the live update process. + +```python +def main(): + # Initialize CocoIndex + cocoindex.init() + + # Setup the flow + live_update_flow.setup(report_to_stdout=True) + + # Start the live updater + with cocoindex.FlowLiveUpdater(live_update_flow, cocoindex.FlowLiveUpdaterOptions(print_stats=True)) as updater: + print("Live updater started. Watching for changes in the 'data' directory.") + updater.wait() + +if __name__ == "__main__": + main() +``` + +The `FlowLiveUpdater` will run in the background, and the `updater.wait()` call will block until the process is stopped. + +## Conclusion + +Live updates is a powerful feature of CocoIndex that ensures your indexes are always fresh. By using a combination of refresh intervals and source-specific change capture mechanisms, you can build responsive, real-time applications that are always in sync with your data. + +For more detailed information on the `FlowLiveUpdater` and other live update options, please refer to the [Run a Flow documentation](https://cocoindex.io/docs/core/flow_methods#live-update). diff --git a/docs/sidebars.ts b/docs/sidebars.ts index bf645bdd6..297237437 100644 --- a/docs/sidebars.ts +++ b/docs/sidebars.ts @@ -12,6 +12,14 @@ const sidebars: SidebarsConfig = { 'getting_started/installation', ], }, + { + type: 'category', + label: 'Tutorials', + collapsed: false, + items: [ + 'tutorials/live_updates', + ], + }, { type: 'category', label: 'CocoIndex Core', diff --git a/examples/live_updates/.env b/examples/live_updates/.env new file mode 100644 index 000000000..b8559bcc9 --- /dev/null +++ b/examples/live_updates/.env @@ -0,0 +1 @@ +COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex diff --git a/examples/live_updates/README.md b/examples/live_updates/README.md new file mode 100644 index 000000000..221a62400 --- /dev/null +++ b/examples/live_updates/README.md @@ -0,0 +1,58 @@ +# Applying Live Updates to CocoIndex Flow Example +[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) + +We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful. + +This example demonstrates how to use CocoIndex's live update feature to keep an index synchronized with a local directory. + +## How it Works + +The `main.py` script defines a CocoIndex flow that: + +1. **Sources** data from a local directory named `data`. It uses a `refresh_interval` of 5 seconds to check for changes. +2. **Collects** the `filename` and `content` of each file. +3. **Exports** the collected data to a Postgres database table. + +The script then starts a `FlowLiveUpdater`, which runs in the background and continuously monitors the `data` directory for changes. + +## Running the Example + +1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. + +2. **Install the dependencies:** + + ```bash + pip install -e . + ``` + +3. **Run the example:** + + You can run the live update example in two ways: + + **Option 1: Using the Python script** + + This method uses CocoIndex [Library API](https://cocoindex.io/docs/core/flow_methods#library-api-2) to perform live updates. + + ```bash + python main.py + ``` + + **Option 2: Using the CocoIndex CLI** + + This method is useful for managing your indexes from the command line, through CocoIndex [CLI](https://cocoindex.io/docs/core/flow_methods#cli-2). + + ```bash + cocoindex update main.py -L --setup + ``` + +4. **Test the live updates:** + + While the script is running, you can try adding, modifying, or deleting files in the `data` directory. You will see the changes reflected in the logs as CocoIndex updates the index. + +## Cleaning Up + +To remove the database table created by this example, you can run: + +```bash +cocoindex drop main.py +``` diff --git a/examples/live_updates/data/bizarre_animals.md b/examples/live_updates/data/bizarre_animals.md new file mode 100644 index 000000000..013e7a730 --- /dev/null +++ b/examples/live_updates/data/bizarre_animals.md @@ -0,0 +1,21 @@ +In the spirit of Project Zeta’s innovative chaos, here’s a collection of absurdly true facts about the weirdest animals you’ve never heard of: + +1. **Tardigrade (Water Bear)**: This microscopic beast can survive outer space, radiation, and being boiled alive. It once crashed a team meeting by stowing away in Bob’s coffee mug and demanding admin access to the server. + +2. **Aye-Aye**: A Madagascar primate with a creepy long finger it uses to tap trees for grubs. It tried to “debug” our codebase by tapping the keyboard, resulting in 47 nested for-loops. + +3. **Saiga Antelope**: This goofy-nosed critter looks like it’s auditioning for a sci-fi flick. Its sneezes are so powerful they once blew out the office Wi-Fi during a sprint review. + +4. **Glaucus Atlanticus (Blue Dragon Sea Slug)**: This tiny ocean dragon steals venom from jellyfish and uses it like a borrowed superpower. It infiltrated our water cooler and left behind a sparkly, toxic trail. + +5. **Pink Fairy Armadillo**: A palm-sized digger that looks like a cotton candy tank. It burrowed into the office carpet, mistaking it for a desert, and now we have a “no armadillos” policy. + +6. **Dumbo Octopus**: A deep-sea octopus with ear-like fins, flapping around like it’s late for a Zoom call. It once rewired our projector to display memes of itself across the office. + +7. **Jerboa**: A hopping desert rodent with kangaroo vibes. It stole the team’s snacks and leaped over three cubicles before anyone noticed, earning the codename "Snack Bandit." + +8. **Mantis Shrimp**: This crustacean sees more colors than our graphic designer and punches harder than a failing CI pipeline. It shattered a monitor when we tried to pair-program with it. + +9. **Okapi**: A zebra-giraffe hybrid that looks like a Photoshop error. It wandered into our sprint planning and suggested we pivot to a “forest-themed” microservices architecture. + +10. **Blobfish**: The ocean’s saddest-looking blob, voted “Most Likely to Crash a Stand-Up” by the team. Its mere presence caused our morale bot to send 200 crying emojis. diff --git a/examples/live_updates/data/chunk_norris.md b/examples/live_updates/data/chunk_norris.md new file mode 100644 index 000000000..89952641e --- /dev/null +++ b/examples/live_updates/data/chunk_norris.md @@ -0,0 +1,19 @@ +# Chuck Norris Project Facts +Date: 2025-07-20 +Author: Anonymous (because Chuck Norris knows who you are) + +Here are some totally true facts about Chuck Norris's involvement in Project Omega: + +1. Chuck Norris doesn't write code; he stares at the computer until it writes itself out of fear. +2. The project deadline was yesterday, but time rescheduled itself to accommodate Chuck Norris. +3. Chuck Norris's code never has bugs—just "features" that are too scared to misbehave. +4. When the database crashed, Chuck Norris roundhouse-kicked the server, and it apologized. +5. The team tried to use Agile, but Chuck Norris declared, "I am the only methodology you need." +6. Version control? Chuck Norris is the only version that matters. +7. The project scope expanded because Chuck Norris added "world domination" as a deliverable. +8. When the CI/CD pipeline failed, Chuck Norris rebuilt it with a single grunt. +9. The codebase is 100% documented because no one dares ask Chuck Norris, "What does this do?" +10. Chuck Norris doesn't deploy to production; production deploys to Chuck Norris. + +Last updated: 2025-07-20 06:36 AM MST +Note: If you modify this file, Chuck Norris will know... and he’ll find you. diff --git a/examples/live_updates/main.py b/examples/live_updates/main.py new file mode 100644 index 000000000..0dad80156 --- /dev/null +++ b/examples/live_updates/main.py @@ -0,0 +1,55 @@ +import datetime + +import cocoindex +from dotenv import load_dotenv + + +# Define the flow +@cocoindex.flow_def(name="LiveUpdates") +def live_update_flow( + flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope +) -> None: + # Source: local files in the 'data' directory + data_scope["documents"] = flow_builder.add_source( + cocoindex.sources.LocalFile(path="data"), + refresh_interval=datetime.timedelta(seconds=5), + ) + + # Collector + collector = data_scope.add_collector() + with data_scope["documents"].row() as doc: + collector.collect( + filename=doc["filename"], + content=doc["content"], + ) + + # Target: Postgres database + collector.export( + "documents_index", + cocoindex.targets.Postgres(), + primary_key_fields=["filename"], + ) + + +def main() -> None: + # Setup the flow + live_update_flow.setup(report_to_stdout=True) + + # Start the live updater + print("Starting live updater...") + with cocoindex.FlowLiveUpdater( + live_update_flow, cocoindex.FlowLiveUpdaterOptions(print_stats=True) + ) as updater: + print("Live updater started. Watching for changes in the 'data' directory.") + print("Try adding, modifying, or deleting files in the 'data' directory.") + print("Press Ctrl+C to stop.") + try: + updater.wait() + except KeyboardInterrupt: # handle graceful shutdown + print("Stopping live updater...") + + +if __name__ == "__main__": + load_dotenv() + cocoindex.init() + main() diff --git a/examples/live_updates/pyproject.toml b/examples/live_updates/pyproject.toml new file mode 100644 index 000000000..05f0bd2e5 --- /dev/null +++ b/examples/live_updates/pyproject.toml @@ -0,0 +1,12 @@ +[project] +name = "live-updates-example" +version = "0.1.0" +description = "Simple example for cocoindex: perform live updates based on local markdown files." +requires-python = ">=3.11" +dependencies = [ + "cocoindex>=0.1.70", + "python-dotenv>=1.1.0", +] + +[tools.setuptools] +packages = []