Skip to content

Commit b536dce

Browse files
example(custom-target): add custom_output_files example for custom target (#822)
* example: setup example `custom_output_files` w/o custom target yet * example: add `LocalFileTargetExecutor` with reasonable interface * example: fix example for `custom_output_files` * example: fix field name * example: add `.gitignore` * example: clean up * example(custom-output-files): update comments * docs: update `README` and `pyproject.toml` for `custom_output_files` * example: update comments --------- Co-authored-by: Jiangzhou He <[email protected]>
1 parent 3bd85bb commit b536dce

File tree

8 files changed

+229
-0
lines changed

8 files changed

+229
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,7 @@ It defines an index flow like this:
185185
| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
186186
| [Face Recognition](examples/face_recognition) | Recognize faces in images and build embedding index |
187187
| [Paper Metadata](examples/paper_metadata) | Index papers in PDF files, and build metadata tables for each paper |
188+
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
188189

189190
More coming and stay tuned 👀!
190191

examples/custom_output_files/.env

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
output_html/
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Build text embedding and semantic search 🔍
2+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cocoindex-io/cocoindex/blob/main/examples/text_embedding/Text_Embedding.ipynb)
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
5+
In this example, we will build index flow to load data from a local directory, convert them to HTML, and save the data to another local directory powered by [CocoIndex Custom Targets](https://cocoindex.io/docs/custom_ops/custom_targets).
6+
7+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
8+
9+
## Steps
10+
11+
### Indexing Flow
12+
13+
1. We ingest a list of local markdown files from the `data/` directory.
14+
2. For each file, convert them to HTML using [markdown-it-py](https://markdown-it-py.readthedocs.io/).
15+
3. We will save the HTML files to a local directory `output_html/`.
16+
17+
## Prerequisite
18+
19+
[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
20+
21+
## Run
22+
23+
Install dependencies:
24+
25+
```bash
26+
pip install -e .
27+
```
28+
29+
Update the target:
30+
31+
```bash
32+
cocoindex update --setup main.py
33+
```
34+
35+
You can add new files to the `data/` directory, delete or update existing files.
36+
Each time when you run the `update` command, cocoindex will only re-process the files that have changed, and keep the target in sync with the source.
37+
38+
You can also run `update` command in live mode, which will keep the target in sync with the source in real-time:
39+
40+
```bash
41+
cocoindex update --setup -L main.py
42+
```
43+
44+
## CocoInsight
45+
46+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline.
47+
It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
48+
49+
```
50+
cocoindex server -ci main.py
51+
```
52+
53+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
In the spirit of Project Zeta’s innovative chaos, here’s a collection of absurdly true facts about the weirdest animals you’ve never heard of:
2+
3+
1. **Tardigrade (Water Bear)**: This microscopic beast can survive outer space, radiation, and being boiled alive. It once crashed a team meeting by stowing away in Bob’s coffee mug and demanding admin access to the server.
4+
5+
2. **Aye-Aye**: A Madagascar primate with a creepy long finger it uses to tap trees for grubs. It tried to “debug” our codebase by tapping the keyboard, resulting in 47 nested for-loops.
6+
7+
3. **Saiga Antelope**: This goofy-nosed critter looks like it’s auditioning for a sci-fi flick. Its sneezes are so powerful they once blew out the office Wi-Fi during a sprint review.
8+
9+
4. **Glaucus Atlanticus (Blue Dragon Sea Slug)**: This tiny ocean dragon steals venom from jellyfish and uses it like a borrowed superpower. It infiltrated our water cooler and left behind a sparkly, toxic trail.
10+
11+
5. **Pink Fairy Armadillo**: A palm-sized digger that looks like a cotton candy tank. It burrowed into the office carpet, mistaking it for a desert, and now we have a “no armadillos” policy.
12+
13+
6. **Dumbo Octopus**: A deep-sea octopus with ear-like fins, flapping around like it’s late for a Zoom call. It once rewired our projector to display memes of itself across the office.
14+
15+
7. **Jerboa**: A hopping desert rodent with kangaroo vibes. It stole the team’s snacks and leaped over three cubicles before anyone noticed, earning the codename "Snack Bandit."
16+
17+
8. **Mantis Shrimp**: This crustacean sees more colors than our graphic designer and punches harder than a failing CI pipeline. It shattered a monitor when we tried to pair-program with it.
18+
19+
9. **Okapi**: A zebra-giraffe hybrid that looks like a Photoshop error. It wandered into our sprint planning and suggested we pivot to a “forest-themed” microservices architecture.
20+
21+
10. **Blobfish**: The ocean’s saddest-looking blob, voted “Most Likely to Crash a Stand-Up” by the team. Its mere presence caused our morale bot to send 200 crying emojis.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Chuck Norris Project Facts
2+
Date: 2025-07-20
3+
Author: Anonymous (because Chuck Norris knows who you are)
4+
5+
Here are some totally true facts about Chuck Norris's involvement in Project Omega:
6+
7+
1. Chuck Norris doesn't write code; he stares at the computer until it writes itself out of fear.
8+
2. The project deadline was yesterday, but time rescheduled itself to accommodate Chuck Norris.
9+
3. Chuck Norris's code never has bugs—just "features" that are too scared to misbehave.
10+
4. When the database crashed, Chuck Norris roundhouse-kicked the server, and it apologized.
11+
5. The team tried to use Agile, but Chuck Norris declared, "I am the only methodology you need."
12+
6. Version control? Chuck Norris is the only version that matters.
13+
7. The project scope expanded because Chuck Norris added "world domination" as a deliverable.
14+
8. When the CI/CD pipeline failed, Chuck Norris rebuilt it with a single grunt.
15+
9. The codebase is 100% documented because no one dares ask Chuck Norris, "What does this do?"
16+
10. Chuck Norris doesn't deploy to production; production deploys to Chuck Norris.
17+
18+
Last updated: 2025-07-20 06:36 AM MST
19+
Note: If you modify this file, Chuck Norris will know... and he’ll find you.
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
from datetime import timedelta
2+
import os
3+
import dataclasses
4+
5+
import cocoindex
6+
from markdown_it import MarkdownIt
7+
8+
_markdown_it = MarkdownIt("gfm-like")
9+
10+
11+
class LocalFileTarget(cocoindex.op.TargetSpec):
12+
"""Represents the custom target spec."""
13+
14+
# The directory to save the HTML files.
15+
directory: str
16+
17+
18+
@dataclasses.dataclass
19+
class LocalFileTargetValues:
20+
"""Represents value fields of exported data. Used in `mutate` method below."""
21+
22+
html: str
23+
24+
25+
@cocoindex.op.target_connector(spec_cls=LocalFileTarget)
26+
class LocalFileTargetConnector:
27+
@staticmethod
28+
def get_persistent_key(spec: LocalFileTarget, target_name: str) -> str:
29+
"""Use the directory path as the persistent key for this target."""
30+
return spec.directory
31+
32+
@staticmethod
33+
def describe(key: str) -> str:
34+
"""(Optional) Return a human-readable description of the target."""
35+
return f"Local directory {key}"
36+
37+
@staticmethod
38+
def apply_setup_change(
39+
key: str, previous: LocalFileTarget | None, current: LocalFileTarget | None
40+
) -> None:
41+
"""
42+
Apply setup changes to the target.
43+
44+
Best practice: keep all actions idempotent.
45+
"""
46+
47+
# Create the directory if it didn't exist.
48+
if previous is None and current is not None:
49+
os.makedirs(current.directory, exist_ok=True)
50+
51+
# Delete the directory with its contents if it no longer exists.
52+
if previous is not None and current is None:
53+
if os.path.isdir(previous.directory):
54+
for filename in os.listdir(previous.directory):
55+
if filename.endswith(".html"):
56+
os.remove(os.path.join(previous.directory, filename))
57+
os.rmdir(previous.directory)
58+
59+
@staticmethod
60+
def prepare(spec: LocalFileTarget) -> LocalFileTarget:
61+
"""
62+
(Optional) Prepare for execution. To run common operations before applying any mutations.
63+
The returned value will be passed as the first element of tuples in `mutate` method.
64+
65+
If not provided, will directly pass the spec to `mutate` method.
66+
"""
67+
return spec
68+
69+
@staticmethod
70+
def mutate(
71+
*all_mutations: tuple[LocalFileTarget, dict[str, LocalFileTargetValues | None]],
72+
) -> None:
73+
"""
74+
Mutate the target.
75+
76+
The first element of the tuple is the target spec.
77+
The second element is a dictionary of mutations:
78+
- The key is the filename, and the value is the mutation.
79+
- If the value is `None`, the file will be removed.
80+
Otherwise, the file will be written with the content.
81+
82+
Best practice: keep all actions idempotent.
83+
"""
84+
for spec, mutations in all_mutations:
85+
for filename, mutation in mutations.items():
86+
full_path = os.path.join(spec.directory, filename) + ".html"
87+
if mutation is None:
88+
try:
89+
os.remove(full_path)
90+
except FileNotFoundError:
91+
pass
92+
else:
93+
with open(full_path, "w") as f:
94+
f.write(mutation.html)
95+
96+
97+
@cocoindex.op.function()
98+
def markdown_to_html(text: str) -> str:
99+
return _markdown_it.render(text)
100+
101+
102+
@cocoindex.flow_def(name="CustomOutputFiles")
103+
def custom_output_files(
104+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
105+
) -> None:
106+
"""
107+
Define an example flow that exports markdown files to HTML files.
108+
"""
109+
data_scope["documents"] = flow_builder.add_source(
110+
cocoindex.sources.LocalFile(path="data", included_patterns=["*.md"]),
111+
refresh_interval=timedelta(seconds=5),
112+
)
113+
114+
output_html = data_scope.add_collector()
115+
with data_scope["documents"].row() as doc:
116+
doc["html"] = doc["content"].transform(markdown_to_html)
117+
output_html.collect(filename=doc["filename"], html=doc["html"])
118+
119+
output_html.export(
120+
"OutputHtml",
121+
LocalFileTarget(directory="output_html"),
122+
primary_key_fields=["filename"],
123+
)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[project]
2+
name = "custom-output-files"
3+
version = "0.1.0"
4+
description = "Simple example for cocoindex: convert markdown files to HTML files and save them to a local directory."
5+
requires-python = ">=3.11"
6+
dependencies = ["cocoindex>=0.1.74", "markdown-it-py[linkify,plugins]"]
7+
8+
[tool.setuptools]
9+
packages = []

0 commit comments

Comments
 (0)