Skip to content

Commit e39fe0f

Browse files
committed
docs: ollama example
1 parent 9df543d commit e39fe0f

File tree

5 files changed

+134
-159
lines changed

5 files changed

+134
-159
lines changed

docs/docs/examples/examples/manual_extraction.md

Lines changed: 134 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -10,31 +10,108 @@ sidebar_custom_props:
1010
tags: [structured-data-extraction, data-mapping]
1111
---
1212

13-
import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
13+
import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';
1414

1515
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/manuals_llm_extraction"/>
1616

17+
## Overview
18+
This example shows how to extract structured data from Python Manuals using Ollama.
19+
20+
## Flow Overview
21+
![Flow Overview](/img/examples/manual_extraction/flow.png)
22+
23+
- For each PDF file:
24+
- Parse to markdown.
25+
- Extract structured data from the markdown using LLM.
26+
- Add summary to the module info.
27+
- Collect the data.
28+
- Export the data to a table.
29+
1730

1831
## Prerequisites
19-
### Install Postgres
20-
If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
32+
- If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
33+
34+
- [Download](https://ollama.com/download) and install Ollama. Pull your favorite LLM models by:
35+
```sh
36+
ollama pull llama3.2
37+
```
38+
39+
<DocumentationButton href="https://cocoindex.io/docs/ai/llm#ollama" text="Ollama" margin="0 0 16px 0" />
40+
41+
Alternatively, CocoIndex have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
2142

22-
### Install ollama
23-
Ollama allows you to run LLM models on your local machine easily. To get started:
43+
<DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />
2444

25-
[Download](https://ollama.com/download) and install Ollama.
26-
Pull your favorite LLM models by the ollama pull command, e.g.
45+
## Add Source
46+
Let's add Python docs as a source.
47+
48+
```python
49+
@cocoindex.flow_def(name="ManualExtraction")
50+
def manual_extraction_flow(
51+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
52+
):
53+
"""
54+
Define an example flow that extracts manual information from a Markdown.
55+
"""
56+
data_scope["documents"] = flow_builder.add_source(
57+
cocoindex.sources.LocalFile(path="manuals", binary=True)
58+
)
2759
60+
modules_index = data_scope.add_collector()
2861
```
29-
ollama pull llama3.2
62+
63+
`flow_builder.add_source` will create a table with the following sub fields:
64+
- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
65+
- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
66+
67+
<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="LocalFile" margin="0 0 16px 0" />
68+
69+
## Parse Markdown
70+
71+
To do this, we can plugin a custom function to convert PDF to markdown. There are so many different parsers commercially and open source available, you can bring your own parser here.
72+
73+
```python
74+
class PdfToMarkdown(cocoindex.op.FunctionSpec):
75+
"""Convert a PDF to markdown."""
76+
77+
78+
@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
79+
class PdfToMarkdownExecutor:
80+
"""Executor for PdfToMarkdown."""
81+
82+
spec: PdfToMarkdown
83+
_converter: PdfConverter
84+
85+
def prepare(self):
86+
config_parser = ConfigParser({})
87+
self._converter = PdfConverter(
88+
create_model_dict(), config=config_parser.generate_config_dict()
89+
)
90+
91+
def __call__(self, content: bytes) -> str:
92+
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
93+
temp_file.write(content)
94+
temp_file.flush()
95+
text, _, _ = text_from_rendered(self._converter(temp_file.name))
96+
return text
3097
```
98+
You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
3199

100+
<DocumentationButton href="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />
32101

33-
## Extract Structured Data from Markdown files
34-
### 1. Define output
35-
We are going to extract the following information from the Python Manuals as structured data.
102+
Plug in the function to the flow.
103+
104+
```python
105+
with data_scope["documents"].row() as doc:
106+
doc["markdown"] = doc["content"].transform(PdfToMarkdown())
107+
```
108+
109+
It transforms each document to markdown.
36110

37-
So we are going to define the output data class as the following. The goal is to extract and populate `ModuleInfo`.
111+
112+
## Extract Structured Data from Markdown files
113+
### Define schema
114+
Let's define the schema `ModuleInfo` using Python dataclasses, and we can pass it to the LLM to extract the structured data. It's easy to do this with CocoIndex.
38115

39116
``` python
40117
@dataclasses.dataclass
@@ -66,27 +143,9 @@ class ModuleInfo:
66143
methods: cocoindex.typing.List[MethodInfo]
67144
```
68145

69-
### 2. Define cocoIndex Flow
70-
Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple.
71-
72-
First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below.
73-
74-
```python
75-
@cocoindex.flow_def(name="ManualExtraction")
76-
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
77-
data_scope["documents"] = flow_builder.add_source(
78-
cocoindex.sources.LocalFile(path="markdown_files"))
79-
80-
modules_index = data_scope.add_collector()
81-
```
82-
83-
`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/ops/sources) here.
84-
- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
85-
- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
86-
87-
Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type.
146+
### Extract structured data
88147

89-
CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models [here](https://ollama.com/library). We also support OpenAI API. You can find the full documentation and instructions [here](https://cocoindex.io/docs/ai/llm).
148+
CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. This example uses Ollama.
90149

91150
```python
92151
with data_scope["documents"].row() as doc:
@@ -101,71 +160,14 @@ with data_scope["documents"].row() as doc:
101160
instruction="Please extract Python module information from the manual."))
102161
```
103162

104-
After the extraction, we just need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
105-
106-
```python
107-
modules_index.collect(
108-
filename=doc["filename"],
109-
module_info=doc["module_info"],
110-
)
111-
```
112-
113-
Finally, let's export the extracted data to a table.
114-
115-
```python
116-
modules_index.export(
117-
"modules",
118-
cocoindex.storages.Postgres(table_name="modules_info"),
119-
primary_key_fields=["filename"],
120-
)
121-
```
122-
123-
### 3. Query and test your index
124-
🎉 Now you are all set!
125-
126-
Run the following command to setup and update the index.
127-
```sh
128-
cocoindex update -L main.py
129-
```
130-
You'll see the index updates state in the terminal
131-
After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:
132-
133-
```bash
134-
psql postgres://cocoindex:cocoindex@localhost/cocoindex
135-
```
136-
137-
And run the SQL query:
138-
139-
```sql
140-
SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
141-
```
142-
143-
You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information:
144-
145-
146-
### CocoInsight
147-
CocoInsight is a tool to help you understand your data pipeline and data index.
148-
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://www.youtube.com/watch?v=ZnmyoHslBSc).
149-
150-
#### 1. Run the CocoIndex server
151-
152-
```sh
153-
cocoindex server -ci main.py
154-
```
155-
156-
to see the CocoInsight dashboard https://cocoindex.io/cocoinsight. It connects to your local CocoIndex server with zero data retention.
157-
158-
163+
<DocumentationButton href="https://cocoindex.io/docs/core/functions#extractbyllm" text="ExtractByLlm" margin="0 0 16px 0" />
159164

160-
## Add Summary to the data
161-
Using cocoindex as framework, you can easily add any transformation on the data (including LLM summary), and collect it as part of the data index.
162-
For example, let's add some simple summary to each module - like number of classes and methods, using simple Python funciton.
165+
![ExtractByLlm](/img/examples/manual_extraction/extraction.png)
163166

164-
We will add a LLM example later.
165-
166-
### 1. Define output
167-
First, let's add the structure we want as part of the output definition.
167+
## Add summarization to module info
168+
Using CocoIndex as framework, you can easily add any transformation on the data, and collect it as part of the data index. Let's add some simple summary to each module - like number of classes and methods, using simple Python function.
168169
170+
### Define Schema
169171
``` python
170172
@dataclasses.dataclass
171173
class ModuleSummary:
@@ -174,101 +176,74 @@ class ModuleSummary:
174176
num_methods: int
175177
```
176178
177-
### 2. Define cocoIndex Flow
178-
Next, let's define a custom function to summarize the data. You can see detailed documentation [here](https://cocoindex.io/docs/core/custom_function#option-1-by-a-standalone-function)
179-
180-
181-
``` python
179+
### A simple custom function to summarize the data
180+
```python
182181
@cocoindex.op.function()
183182
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
184183
"""Summarize a Python module."""
185184
return ModuleSummary(
186185
num_classes=len(module_info.classes),
187186
num_methods=len(module_info.methods),
188187
)
189-
```
190-
191-
### 3. Plug in the function into the flow
188+
```
192189
193-
``` python
190+
### Plug in the function into the flow
191+
```python
194192
with data_scope["documents"].row() as doc:
195193
# ... after the extraction
196194
doc["module_summary"] = doc["module_info"].transform(summarize_module)
197195
```
198196
199-
🎉 Now you are all set!
200-
201-
Run the following command to setup and update the index.
202-
```sh
203-
cocoindex update --setup main.py
204-
```
197+
<DocumentationButton href="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />
205198
206-
## Extract Structured Data from PDF files
207-
Ollama does not support PDF files directly as input, so we need to convert them to markdown first.
199+
![Summarize Module](/img/examples/manual_extraction/summary.png)
208200
209-
To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation [here](https://cocoindex.io/docs/core/custom_function).
201+
## Collect the data
210202
211-
### 1. Define a function spec
212203
213-
The function spec of a function configures behavior of a specific instance of the function.
204+
After the extraction, we need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
214205
215-
``` python
216-
class PdfToMarkdown(cocoindex.op.FunctionSpec):
217-
"""Convert a PDF to markdown."""
206+
```python
207+
modules_index.collect(
208+
filename=doc["filename"],
209+
module_info=doc["module_info"],
210+
)
218211
```
219212
220-
### 2. Define an executor class
221-
222-
The executor class is a class that implements the function spec. It is responsible for the actual execution of the function.
223-
224-
This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format.
225-
226-
It is associated with the function spec by `spec: PdfToMarkdown`.
227-
228-
``` python
229-
@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
230-
class PdfToMarkdownExecutor:
231-
"""Executor for PdfToMarkdown."""
232-
233-
spec: PdfToMarkdown
234-
_converter: PdfConverter
213+
Finally, let's export the extracted data to a table.
235214

236-
def prepare(self):
237-
config_parser = ConfigParser({})
238-
self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
215+
```python
216+
modules_index.export(
217+
"modules",
218+
cocoindex.storages.Postgres(table_name="modules_info"),
219+
primary_key_fields=["filename"],
220+
)
221+
```
239222

240-
def __call__(self, content: bytes) -> str:
241-
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
242-
temp_file.write(content)
243-
temp_file.flush()
244-
text, _, _ = text_from_rendered(self._converter(temp_file.name))
245-
return text
223+
## Query and test your index
224+
Run the following command to setup and update the index.
225+
```sh
226+
cocoindex update -L main.py
246227
```
247-
You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
228+
You'll see the index updates state in the terminal
248229
249-
### 3. Plugin it to the flow
230+
After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:
250231
251-
``` python
252-
# Note the binary = True for PDF
253-
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
254-
modules_index = data_scope.add_collector()
232+
```bash
233+
psql postgres://cocoindex:cocoindex@localhost/cocoindex
234+
```
255235
256-
with data_scope["documents"].row() as doc:
257-
# plug in your custom function here
258-
doc["markdown"] = doc["content"].transform(PdfToMarkdown())
236+
And run the SQL query:
259237
238+
```sql
239+
SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
260240
```
261241
262-
🎉 Now you are all set!
263-
264-
Run the following command to setup and update the index.
242+
## CocoInsight
243+
[CocoInsight](https://www.youtube.com/watch?v=ZnmyoHslBSc) is a really cool tool to help you understand your data pipeline and data index. It is in Early Access now (Free).
265244
266245
```sh
267-
cocoindex update --setup main.py
246+
cocoindex server -ci main.py
268247
```
248+
CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server with zero data retention.
269249
270-
## Community
271-
272-
We love to hear from the community! You can find us on [Github](https://github.com/cocoindex-io/cocoindex) and [Discord](https://discord.com/invite/zpA9S2DR7s).
273-
274-
If you like this post and our work, please **⭐ star [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) to support us**. Thank you with a warm coconut hug 🥥🤗.
-57.7 KB
Loading
172 KB
Loading
90.8 KB
Loading
90.8 KB
Loading

0 commit comments

Comments
 (0)