Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
c4a6601
upgrade docusaurus version
badmonster0 Aug 21, 2025
369ac26
initial checkin
badmonster0 Aug 21, 2025
3b609d9
example documentation for custom targets
badmonster0 Aug 21, 2025
f44135f
Update custom_targets.md
badmonster0 Aug 21, 2025
7b045be
paper indexing
badmonster0 Aug 21, 2025
fda59b1
Update academic_papers_index.md
badmonster0 Aug 21, 2025
98eaa05
add example for knowledge graphs
badmonster0 Aug 21, 2025
7707e41
add examples for photo search / knowledge graph
badmonster0 Aug 21, 2025
b74a1ed
Create multi_format_index.md
badmonster0 Aug 21, 2025
2ddb232
Update multi_format_index.md
badmonster0 Aug 21, 2025
f18b84d
product recommendation example
badmonster0 Aug 21, 2025
145f488
Merge branch 'main' into examples
badmonster0 Aug 21, 2025
84a553a
Create manual_extraction.md
badmonster0 Aug 21, 2025
0ceda04
Create simple_text_embedding.md
badmonster0 Aug 21, 2025
57a61e2
Delete code_index.md
badmonster0 Aug 21, 2025
70e74a2
patient intake form
badmonster0 Aug 21, 2025
ed847f4
Create image_search.md
badmonster0 Aug 21, 2025
8ccf086
visual & images for examples
badmonster0 Aug 22, 2025
b72a49d
Merge branch 'main' into examples
badmonster0 Aug 22, 2025
e483a71
update example for semantic search 101
badmonster0 Aug 22, 2025
9eefa87
compress image
badmonster0 Aug 22, 2025
8966c05
Merge branch 'main' into examples
badmonster0 Aug 22, 2025
c6542bb
tags & images
badmonster0 Aug 22, 2025
b689d9e
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
23b8130
polish codebase example docs
badmonster0 Aug 26, 2025
83a58b7
add flow overview to codebase example
badmonster0 Aug 26, 2025
2600706
add image to illustrate chunks
badmonster0 Aug 26, 2025
6c99025
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
2d76b05
docs: custom target example
badmonster0 Aug 26, 2025
2c9a3ab
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
d687b5d
docs: docs to knowledge graph, add image illustrations, reorganize ex…
badmonster0 Aug 26, 2025
bc33999
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
1de530f
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
78081ea
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
2bb8792
docs: paper metadata extraction example
badmonster0 Aug 27, 2025
c4f23fb
docs: patient form extraction
badmonster0 Aug 27, 2025
0e0e641
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
c13f000
docs: product recommendation example
badmonster0 Aug 27, 2025
9df543d
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
e39fe0f
docs: ollama example
badmonster0 Aug 27, 2025
b5e4ade
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
293 changes: 134 additions & 159 deletions docs/docs/examples/examples/manual_extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,31 +10,108 @@ sidebar_custom_props:
tags: [structured-data-extraction, data-mapping]
---

import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';

<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/manuals_llm_extraction"/>

## Overview
This example shows how to extract structured data from Python Manuals using Ollama.

## Flow Overview
![Flow Overview](/img/examples/manual_extraction/flow.png)

- For each PDF file:
- Parse to markdown.
- Extract structured data from the markdown using LLM.
- Add summary to the module info.
- Collect the data.
- Export the data to a table.


## Prerequisites
### Install Postgres
If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
- If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).

- [Download](https://ollama.com/download) and install Ollama. Pull your favorite LLM models by:
```sh
ollama pull llama3.2
```

<DocumentationButton href="https://cocoindex.io/docs/ai/llm#ollama" text="Ollama" margin="0 0 16px 0" />

Alternatively, CocoIndex have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.

### Install ollama
Ollama allows you to run LLM models on your local machine easily. To get started:
<DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />

[Download](https://ollama.com/download) and install Ollama.
Pull your favorite LLM models by the ollama pull command, e.g.
## Add Source
Let's add Python docs as a source.

```python
@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
):
"""
Define an example flow that extracts manual information from a Markdown.
"""
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="manuals", binary=True)
)

modules_index = data_scope.add_collector()
```
ollama pull llama3.2

`flow_builder.add_source` will create a table with the following sub fields:
- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file

<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="LocalFile" margin="0 0 16px 0" />

## Parse Markdown

To do this, we can plugin a custom function to convert PDF to markdown. There are so many different parsers commercially and open source available, you can bring your own parser here.

```python
class PdfToMarkdown(cocoindex.op.FunctionSpec):
"""Convert a PDF to markdown."""


@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
"""Executor for PdfToMarkdown."""

spec: PdfToMarkdown
_converter: PdfConverter

def prepare(self):
config_parser = ConfigParser({})
self._converter = PdfConverter(
create_model_dict(), config=config_parser.generate_config_dict()
)

def __call__(self, content: bytes) -> str:
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
temp_file.write(content)
temp_file.flush()
text, _, _ = text_from_rendered(self._converter(temp_file.name))
return text
```
You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.

<DocumentationButton href="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />

## Extract Structured Data from Markdown files
### 1. Define output
We are going to extract the following information from the Python Manuals as structured data.
Plug in the function to the flow.

```python
with data_scope["documents"].row() as doc:
doc["markdown"] = doc["content"].transform(PdfToMarkdown())
```

It transforms each document to markdown.

So we are going to define the output data class as the following. The goal is to extract and populate `ModuleInfo`.

## Extract Structured Data from Markdown files
### Define schema
Let's define the schema `ModuleInfo` using Python dataclasses, and we can pass it to the LLM to extract the structured data. It's easy to do this with CocoIndex.

``` python
@dataclasses.dataclass
Expand Down Expand Up @@ -66,27 +143,9 @@ class ModuleInfo:
methods: cocoindex.typing.List[MethodInfo]
```

### 2. Define cocoIndex Flow
Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple.

First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below.

```python
@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"))

modules_index = data_scope.add_collector()
```

`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/ops/sources) here.
- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file

Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type.
### Extract structured data

CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models [here](https://ollama.com/library). We also support OpenAI API. You can find the full documentation and instructions [here](https://cocoindex.io/docs/ai/llm).
CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. This example uses Ollama.

```python
with data_scope["documents"].row() as doc:
Expand All @@ -101,71 +160,14 @@ with data_scope["documents"].row() as doc:
instruction="Please extract Python module information from the manual."))
```

After the extraction, we just need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.

```python
modules_index.collect(
filename=doc["filename"],
module_info=doc["module_info"],
)
```

Finally, let's export the extracted data to a table.

```python
modules_index.export(
"modules",
cocoindex.storages.Postgres(table_name="modules_info"),
primary_key_fields=["filename"],
)
```

### 3. Query and test your index
🎉 Now you are all set!

Run the following command to setup and update the index.
```sh
cocoindex update -L main.py
```
You'll see the index updates state in the terminal
After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:

```bash
psql postgres://cocoindex:cocoindex@localhost/cocoindex
```

And run the SQL query:

```sql
SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
```

You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information:


### CocoInsight
CocoInsight is a tool to help you understand your data pipeline and data index.
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://www.youtube.com/watch?v=ZnmyoHslBSc).

#### 1. Run the CocoIndex server

```sh
cocoindex server -ci main.py
```

to see the CocoInsight dashboard https://cocoindex.io/cocoinsight. It connects to your local CocoIndex server with zero data retention.


<DocumentationButton href="https://cocoindex.io/docs/core/functions#extractbyllm" text="ExtractByLlm" margin="0 0 16px 0" />

## Add Summary to the data
Using cocoindex as framework, you can easily add any transformation on the data (including LLM summary), and collect it as part of the data index.
For example, let's add some simple summary to each module - like number of classes and methods, using simple Python funciton.
![ExtractByLlm](/img/examples/manual_extraction/extraction.png)

We will add a LLM example later.

### 1. Define output
First, let's add the structure we want as part of the output definition.
## Add summarization to module info
Using CocoIndex as framework, you can easily add any transformation on the data, and collect it as part of the data index. Let's add some simple summary to each module - like number of classes and methods, using simple Python function.

### Define Schema
``` python
@dataclasses.dataclass
class ModuleSummary:
Expand All @@ -174,101 +176,74 @@ class ModuleSummary:
num_methods: int
```

### 2. Define cocoIndex Flow
Next, let's define a custom function to summarize the data. You can see detailed documentation [here](https://cocoindex.io/docs/core/custom_function#option-1-by-a-standalone-function)


``` python
### A simple custom function to summarize the data
```python
@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
"""Summarize a Python module."""
return ModuleSummary(
num_classes=len(module_info.classes),
num_methods=len(module_info.methods),
)
```

### 3. Plug in the function into the flow
```

``` python
### Plug in the function into the flow
```python
with data_scope["documents"].row() as doc:
# ... after the extraction
doc["module_summary"] = doc["module_info"].transform(summarize_module)
```

🎉 Now you are all set!

Run the following command to setup and update the index.
```sh
cocoindex update --setup main.py
```
<DocumentationButton href="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />

## Extract Structured Data from PDF files
Ollama does not support PDF files directly as input, so we need to convert them to markdown first.
![Summarize Module](/img/examples/manual_extraction/summary.png)

To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation [here](https://cocoindex.io/docs/core/custom_function).
## Collect the data

### 1. Define a function spec

The function spec of a function configures behavior of a specific instance of the function.
After the extraction, we need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.

``` python
class PdfToMarkdown(cocoindex.op.FunctionSpec):
"""Convert a PDF to markdown."""
```python
modules_index.collect(
filename=doc["filename"],
module_info=doc["module_info"],
)
```

### 2. Define an executor class

The executor class is a class that implements the function spec. It is responsible for the actual execution of the function.

This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format.

It is associated with the function spec by `spec: PdfToMarkdown`.

``` python
@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
"""Executor for PdfToMarkdown."""

spec: PdfToMarkdown
_converter: PdfConverter
Finally, let's export the extracted data to a table.

def prepare(self):
config_parser = ConfigParser({})
self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
```python
modules_index.export(
"modules",
cocoindex.storages.Postgres(table_name="modules_info"),
primary_key_fields=["filename"],
)
```

def __call__(self, content: bytes) -> str:
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
temp_file.write(content)
temp_file.flush()
text, _, _ = text_from_rendered(self._converter(temp_file.name))
return text
## Query and test your index
Run the following command to setup and update the index.
```sh
cocoindex update -L main.py
```
You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
You'll see the index updates state in the terminal

### 3. Plugin it to the flow
After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:

``` python
# Note the binary = True for PDF
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
modules_index = data_scope.add_collector()
```bash
psql postgres://cocoindex:cocoindex@localhost/cocoindex
```

with data_scope["documents"].row() as doc:
# plug in your custom function here
doc["markdown"] = doc["content"].transform(PdfToMarkdown())
And run the SQL query:

```sql
SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
```

🎉 Now you are all set!

Run the following command to setup and update the index.
## CocoInsight
[CocoInsight](https://www.youtube.com/watch?v=ZnmyoHslBSc) is a really cool tool to help you understand your data pipeline and data index. It is in Early Access now (Free).

```sh
cocoindex update --setup main.py
cocoindex server -ci main.py
```
CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server with zero data retention.

## Community

We love to hear from the community! You can find us on [Github](https://github.com/cocoindex-io/cocoindex) and [Discord](https://discord.com/invite/zpA9S2DR7s).

If you like this post and our work, please **⭐ star [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) to support us**. Thank you with a warm coconut hug 🥥🤗.
Binary file modified docs/static/img/examples/manual_extraction/cover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.