diff --git a/docs/docs/examples/examples/manual_extraction.md b/docs/docs/examples/examples/manual_extraction.md
index 8e6f081a8..0b340fde0 100644
--- a/docs/docs/examples/examples/manual_extraction.md
+++ b/docs/docs/examples/examples/manual_extraction.md
@@ -10,31 +10,108 @@ sidebar_custom_props:
tags: [structured-data-extraction, data-mapping]
---
-import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
+import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';
+## Overview
+This example shows how to extract structured data from Python Manuals using Ollama.
+
+## Flow Overview
+
+
+- For each PDF file:
+ - Parse to markdown.
+ - Extract structured data from the markdown using LLM.
+ - Add summary to the module info.
+ - Collect the data.
+- Export the data to a table.
+
## Prerequisites
-### Install Postgres
-If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
+- If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
+
+- [Download](https://ollama.com/download) and install Ollama. Pull your favorite LLM models by:
+ ```sh
+ ollama pull llama3.2
+ ```
+
+
+
+ Alternatively, CocoIndex have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
-### Install ollama
-Ollama allows you to run LLM models on your local machine easily. To get started:
+
-[Download](https://ollama.com/download) and install Ollama.
-Pull your favorite LLM models by the ollama pull command, e.g.
+## Add Source
+Let's add Python docs as a source.
+
+```python
+@cocoindex.flow_def(name="ManualExtraction")
+def manual_extraction_flow(
+ flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
+):
+ """
+ Define an example flow that extracts manual information from a Markdown.
+ """
+ data_scope["documents"] = flow_builder.add_source(
+ cocoindex.sources.LocalFile(path="manuals", binary=True)
+ )
+ modules_index = data_scope.add_collector()
```
-ollama pull llama3.2
+
+`flow_builder.add_source` will create a table with the following sub fields:
+- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
+- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
+
+
+
+## Parse Markdown
+
+To do this, we can plugin a custom function to convert PDF to markdown. There are so many different parsers commercially and open source available, you can bring your own parser here.
+
+```python
+class PdfToMarkdown(cocoindex.op.FunctionSpec):
+ """Convert a PDF to markdown."""
+
+
+@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
+class PdfToMarkdownExecutor:
+ """Executor for PdfToMarkdown."""
+
+ spec: PdfToMarkdown
+ _converter: PdfConverter
+
+ def prepare(self):
+ config_parser = ConfigParser({})
+ self._converter = PdfConverter(
+ create_model_dict(), config=config_parser.generate_config_dict()
+ )
+
+ def __call__(self, content: bytes) -> str:
+ with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
+ temp_file.write(content)
+ temp_file.flush()
+ text, _, _ = text_from_rendered(self._converter(temp_file.name))
+ return text
```
+You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
+
-## Extract Structured Data from Markdown files
-### 1. Define output
-We are going to extract the following information from the Python Manuals as structured data.
+Plug in the function to the flow.
+
+```python
+with data_scope["documents"].row() as doc:
+ doc["markdown"] = doc["content"].transform(PdfToMarkdown())
+```
+
+It transforms each document to markdown.
-So we are going to define the output data class as the following. The goal is to extract and populate `ModuleInfo`.
+
+## Extract Structured Data from Markdown files
+### Define schema
+Let's define the schema `ModuleInfo` using Python dataclasses, and we can pass it to the LLM to extract the structured data. It's easy to do this with CocoIndex.
``` python
@dataclasses.dataclass
@@ -66,27 +143,9 @@ class ModuleInfo:
methods: cocoindex.typing.List[MethodInfo]
```
-### 2. Define cocoIndex Flow
-Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple.
-
-First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below.
-
-```python
-@cocoindex.flow_def(name="ManualExtraction")
-def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
- data_scope["documents"] = flow_builder.add_source(
- cocoindex.sources.LocalFile(path="markdown_files"))
-
- modules_index = data_scope.add_collector()
-```
-
-`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/ops/sources) here.
-- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
-- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
-
-Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type.
+### Extract structured data
-CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models [here](https://ollama.com/library). We also support OpenAI API. You can find the full documentation and instructions [here](https://cocoindex.io/docs/ai/llm).
+CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. This example uses Ollama.
```python
with data_scope["documents"].row() as doc:
@@ -101,71 +160,14 @@ with data_scope["documents"].row() as doc:
instruction="Please extract Python module information from the manual."))
```
-After the extraction, we just need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
-
-```python
-modules_index.collect(
- filename=doc["filename"],
- module_info=doc["module_info"],
-)
-```
-
-Finally, let's export the extracted data to a table.
-
-```python
-modules_index.export(
- "modules",
- cocoindex.storages.Postgres(table_name="modules_info"),
- primary_key_fields=["filename"],
-)
-```
-
-### 3. Query and test your index
-🎉 Now you are all set!
-
-Run the following command to setup and update the index.
-```sh
-cocoindex update -L main.py
-```
-You'll see the index updates state in the terminal
-After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:
-
-```bash
-psql postgres://cocoindex:cocoindex@localhost/cocoindex
-```
-
-And run the SQL query:
-
-```sql
-SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
-```
-
-You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information:
-
-
-### CocoInsight
-CocoInsight is a tool to help you understand your data pipeline and data index.
-CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://www.youtube.com/watch?v=ZnmyoHslBSc).
-
-#### 1. Run the CocoIndex server
-
-```sh
-cocoindex server -ci main.py
-```
-
-to see the CocoInsight dashboard https://cocoindex.io/cocoinsight. It connects to your local CocoIndex server with zero data retention.
-
-
+
-## Add Summary to the data
-Using cocoindex as framework, you can easily add any transformation on the data (including LLM summary), and collect it as part of the data index.
-For example, let's add some simple summary to each module - like number of classes and methods, using simple Python funciton.
+
-We will add a LLM example later.
-
-### 1. Define output
-First, let's add the structure we want as part of the output definition.
+## Add summarization to module info
+Using CocoIndex as framework, you can easily add any transformation on the data, and collect it as part of the data index. Let's add some simple summary to each module - like number of classes and methods, using simple Python function.
+### Define Schema
``` python
@dataclasses.dataclass
class ModuleSummary:
@@ -174,11 +176,8 @@ class ModuleSummary:
num_methods: int
```
-### 2. Define cocoIndex Flow
-Next, let's define a custom function to summarize the data. You can see detailed documentation [here](https://cocoindex.io/docs/core/custom_function#option-1-by-a-standalone-function)
-
-
-``` python
+### A simple custom function to summarize the data
+```python
@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
"""Summarize a Python module."""
@@ -186,89 +185,65 @@ def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
num_classes=len(module_info.classes),
num_methods=len(module_info.methods),
)
-```
-
-### 3. Plug in the function into the flow
+```
-``` python
+### Plug in the function into the flow
+```python
with data_scope["documents"].row() as doc:
# ... after the extraction
doc["module_summary"] = doc["module_info"].transform(summarize_module)
```
-🎉 Now you are all set!
-
-Run the following command to setup and update the index.
-```sh
-cocoindex update --setup main.py
-```
+
-## Extract Structured Data from PDF files
-Ollama does not support PDF files directly as input, so we need to convert them to markdown first.
+
-To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation [here](https://cocoindex.io/docs/core/custom_function).
+## Collect the data
-### 1. Define a function spec
-The function spec of a function configures behavior of a specific instance of the function.
+After the extraction, we need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
-``` python
-class PdfToMarkdown(cocoindex.op.FunctionSpec):
- """Convert a PDF to markdown."""
+```python
+modules_index.collect(
+ filename=doc["filename"],
+ module_info=doc["module_info"],
+)
```
-### 2. Define an executor class
-
-The executor class is a class that implements the function spec. It is responsible for the actual execution of the function.
-
-This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format.
-
-It is associated with the function spec by `spec: PdfToMarkdown`.
-
-``` python
-@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
-class PdfToMarkdownExecutor:
- """Executor for PdfToMarkdown."""
-
- spec: PdfToMarkdown
- _converter: PdfConverter
+Finally, let's export the extracted data to a table.
- def prepare(self):
- config_parser = ConfigParser({})
- self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
+```python
+modules_index.export(
+ "modules",
+ cocoindex.storages.Postgres(table_name="modules_info"),
+ primary_key_fields=["filename"],
+)
+```
- def __call__(self, content: bytes) -> str:
- with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
- temp_file.write(content)
- temp_file.flush()
- text, _, _ = text_from_rendered(self._converter(temp_file.name))
- return text
+## Query and test your index
+Run the following command to setup and update the index.
+```sh
+cocoindex update -L main.py
```
-You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
+You'll see the index updates state in the terminal
-### 3. Plugin it to the flow
+After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:
-``` python
- # Note the binary = True for PDF
- data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
- modules_index = data_scope.add_collector()
+```bash
+psql postgres://cocoindex:cocoindex@localhost/cocoindex
+```
- with data_scope["documents"].row() as doc:
- # plug in your custom function here
- doc["markdown"] = doc["content"].transform(PdfToMarkdown())
+And run the SQL query:
+```sql
+SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
```
-🎉 Now you are all set!
-
-Run the following command to setup and update the index.
+## CocoInsight
+[CocoInsight](https://www.youtube.com/watch?v=ZnmyoHslBSc) is a really cool tool to help you understand your data pipeline and data index. It is in Early Access now (Free).
```sh
-cocoindex update --setup main.py
+cocoindex server -ci main.py
```
+CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server with zero data retention.
-## Community
-
-We love to hear from the community! You can find us on [Github](https://github.com/cocoindex-io/cocoindex) and [Discord](https://discord.com/invite/zpA9S2DR7s).
-
-If you like this post and our work, please **⭐ star [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) to support us**. Thank you with a warm coconut hug 🥥🤗.
\ No newline at end of file
diff --git a/docs/static/img/examples/manual_extraction/cover.png b/docs/static/img/examples/manual_extraction/cover.png
index 3d8233117..2b5e77673 100644
Binary files a/docs/static/img/examples/manual_extraction/cover.png and b/docs/static/img/examples/manual_extraction/cover.png differ
diff --git a/docs/static/img/examples/manual_extraction/extraction.png b/docs/static/img/examples/manual_extraction/extraction.png
new file mode 100644
index 000000000..e2d2d0438
Binary files /dev/null and b/docs/static/img/examples/manual_extraction/extraction.png differ
diff --git a/docs/static/img/examples/manual_extraction/flow.png b/docs/static/img/examples/manual_extraction/flow.png
new file mode 100644
index 000000000..af7cf9706
Binary files /dev/null and b/docs/static/img/examples/manual_extraction/flow.png differ
diff --git a/docs/static/img/examples/manual_extraction/summary.png b/docs/static/img/examples/manual_extraction/summary.png
new file mode 100644
index 000000000..22999394f
Binary files /dev/null and b/docs/static/img/examples/manual_extraction/summary.png differ