You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To do this, we can plugin a custom function to convert PDF to markdown. There are so many different parsers commercially and open source available, you can bring your own parser here.
You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
So we are going to define the output data class as the following. The goal is to extract and populate `ModuleInfo`.
111
+
112
+
## Extract Structured Data from Markdown files
113
+
### Define schema
114
+
Let's define the schema `ModuleInfo` using Python dataclasses, and we can pass it to the LLM to extract the structured data. It's easy to do this with CocoIndex.
38
115
39
116
``` python
40
117
@dataclasses.dataclass
@@ -66,27 +143,9 @@ class ModuleInfo:
66
143
methods: cocoindex.typing.List[MethodInfo]
67
144
```
68
145
69
-
### 2. Define cocoIndex Flow
70
-
Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple.
71
-
72
-
First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below.
`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/ops/sources) here.
84
-
-`filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
85
-
-`content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
86
-
87
-
Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type.
146
+
### Extract structured data
88
147
89
-
CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models [here](https://ollama.com/library). We also support OpenAI API. You can find the full documentation and instructions [here](https://cocoindex.io/docs/ai/llm).
148
+
CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. This example uses Ollama.
90
149
91
150
```python
92
151
with data_scope["documents"].row() as doc:
@@ -101,71 +160,14 @@ with data_scope["documents"].row() as doc:
101
160
instruction="Please extract Python module information from the manual."))
102
161
```
103
162
104
-
After the extraction, we just need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
105
-
106
-
```python
107
-
modules_index.collect(
108
-
filename=doc["filename"],
109
-
module_info=doc["module_info"],
110
-
)
111
-
```
112
-
113
-
Finally, let's export the extracted data to a table.
SELECT filename, module_info->'title'AS title, module_summary FROM modules_info;
141
-
```
142
-
143
-
You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information:
144
-
145
-
146
-
### CocoInsight
147
-
CocoInsight is a tool to help you understand your data pipeline and data index.
148
-
CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://www.youtube.com/watch?v=ZnmyoHslBSc).
149
-
150
-
#### 1. Run the CocoIndex server
151
-
152
-
```sh
153
-
cocoindex server -ci main.py
154
-
```
155
-
156
-
to see the CocoInsight dashboard https://cocoindex.io/cocoinsight. It connects to your local CocoIndex server with zero data retention.
First, let's add the structure we want as part of the output definition.
167
+
## Add summarization to module info
168
+
Using CocoIndex as framework, you can easily add any transformation on the data, and collect it as part of the data index. Let's add some simple summary to each module - like number of classes and methods, using simple Python function.
168
169
170
+
### Define Schema
169
171
``` python
170
172
@dataclasses.dataclass
171
173
class ModuleSummary:
@@ -174,101 +176,74 @@ class ModuleSummary:
174
176
num_methods: int
175
177
```
176
178
177
-
### 2. Define cocoIndex Flow
178
-
Next, let's define a custom function to summarize the data. You can see detailed documentation [here](https://cocoindex.io/docs/core/custom_function#option-1-by-a-standalone-function)
179
-
180
-
181
-
```python
179
+
### A simple custom function to summarize the data
To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation [here](https://cocoindex.io/docs/core/custom_function).
201
+
## Collect the data
210
202
211
-
### 1. Define a function spec
212
203
213
-
The function spec of a function configures behavior of a specific instance of the function.
204
+
After the extraction, we need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
214
205
215
-
```python
216
-
classPdfToMarkdown(cocoindex.op.FunctionSpec):
217
-
"""Convert a PDF to markdown."""
206
+
```python
207
+
modules_index.collect(
208
+
filename=doc["filename"],
209
+
module_info=doc["module_info"],
210
+
)
218
211
```
219
212
220
-
### 2. Define an executor class
221
-
222
-
The executor class is a class that implements the function spec. It is responsible for the actual execution of the function.
223
-
224
-
This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format.
225
-
226
-
It is associated with the function spec by `spec: PdfToMarkdown`.
Run the following command to setup and update the index.
225
+
```sh
226
+
cocoindex update -L main.py
246
227
```
247
-
You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
228
+
You'll see the index updates state in the terminal
248
229
249
-
### 3. Plugin it to the flow
230
+
After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:
SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
260
240
```
261
241
262
-
🎉 Now you are all set!
263
-
264
-
Run the following command to setup and update the index.
242
+
## CocoInsight
243
+
[CocoInsight](https://www.youtube.com/watch?v=ZnmyoHslBSc) is a really cool tool to help you understand your data pipeline and data index. It is in Early Access now (Free).
265
244
266
245
```sh
267
-
cocoindex update --setup main.py
246
+
cocoindex server -ci main.py
268
247
```
248
+
CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server with zero data retention.
269
249
270
-
## Community
271
-
272
-
We love to hear from the community! You can find us on [Github](https://github.com/cocoindex-io/cocoindex) and [Discord](https://discord.com/invite/zpA9S2DR7s).
273
-
274
-
If you like this post and our work, please **⭐ star [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) to support us**. Thank you with a warm coconut hug 🥥🤗.
0 commit comments