cocoindex-io
diff --git a/‎docs/docs/examples/examples/manual_extraction.md‎
Lines changed: 134 additions & 159 deletions b/‎docs/docs/examples/examples/manual_extraction.md‎
Lines changed: 134 additions & 159 deletions
diff --git a/‎docs/static/img/examples/manual_extraction/cover.png‎
-57.7 KB b/‎docs/static/img/examples/manual_extraction/cover.png‎
-57.7 KB
diff --git a/‎docs/static/img/examples/manual_extraction/extraction.png‎
172 KB b/‎docs/static/img/examples/manual_extraction/extraction.png‎
172 KB
diff --git a/‎docs/static/img/examples/manual_extraction/flow.png‎
90.8 KB b/‎docs/static/img/examples/manual_extraction/flow.png‎
90.8 KB
diff --git a/‎docs/static/img/examples/manual_extraction/summary.png‎
90.8 KB b/‎docs/static/img/examples/manual_extraction/summary.png‎
90.8 KB
@@ -10,31 +10,108 @@ sidebar_custom_props:
 tags: [structured-data-extraction, data-mapping]
 ---
 
-import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
+import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';
 
 <GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/manuals_llm_extraction"/>
 
+## Overview
+This example shows how to extract structured data from Python Manuals using Ollama.
+
+## Flow Overview
+![Flow Overview](/img/examples/manual_extraction/flow.png)
+
+- For each PDF file:
+    - Parse to markdown.
+    - Extract structured data from the markdown using LLM.
+    - Add summary to the module info.
+    - Collect the data.
+- Export the data to a table.
+
 
 ## Prerequisites
-### Install Postgres
-If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
+- If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
+
+- [Download](https://ollama.com/download) and install Ollama. Pull your favorite LLM models by:
+    ```sh
+    ollama pull llama3.2
+    ```
+
+    <DocumentationButton href="https://cocoindex.io/docs/ai/llm#ollama" text="Ollama" margin="0 0 16px 0" />
+
+    Alternatively, CocoIndex have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
 
-### Install ollama
-Ollama allows you to run LLM models on your local machine easily. To get started:
+    <DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />
 
-[Download](https://ollama.com/download) and install Ollama.
-Pull your favorite LLM models by the ollama pull command, e.g.
+## Add Source
+Let's add Python docs as a source.
+
+```python
+@cocoindex.flow_def(name="ManualExtraction")
+def manual_extraction_flow(
+    flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
+):
+    """
+    Define an example flow that extracts manual information from a Markdown.
+    """
+    data_scope["documents"] = flow_builder.add_source(
+        cocoindex.sources.LocalFile(path="manuals", binary=True)
+    )
 
+    modules_index = data_scope.add_collector()
 ```
-ollama pull llama3.2
+
+`flow_builder.add_source` will create a table with the following sub fields:
+- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
+- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
+
+<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="LocalFile" margin="0 0 16px 0" />
+
+## Parse Markdown
+
+To do this, we can plugin a custom function to convert PDF to markdown. There are so many different parsers commercially and open source available, you can bring your own parser here.
+
+```python
+class PdfToMarkdown(cocoindex.op.FunctionSpec):
+    """Convert a PDF to markdown."""
+
+
+@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
+class PdfToMarkdownExecutor:
+    """Executor for PdfToMarkdown."""
+
+    spec: PdfToMarkdown
+    _converter: PdfConverter
+
+    def prepare(self):
+        config_parser = ConfigParser({})
+        self._converter = PdfConverter(
+            create_model_dict(), config=config_parser.generate_config_dict()
+        )
+
+    def __call__(self, content: bytes) -> str:
+        with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
+            temp_file.write(content)
+            temp_file.flush()
+            text, _, _ = text_from_rendered(self._converter(temp_file.name))
+            return text
 ```
+You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
 
+<DocumentationButton href="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />
 
-## Extract Structured Data from Markdown files
-### 1. Define output
-We are going to extract the following information from the Python Manuals as structured data.
+Plug in the function to the flow.
+
+```python
+with data_scope["documents"].row() as doc:
+    doc["markdown"] = doc["content"].transform(PdfToMarkdown())
+```
+
+It transforms each document to markdown.
 
-So we are going to define the output data class as the following. The goal is to extract and populate `ModuleInfo`.
+
+## Extract Structured Data from Markdown files
+### Define schema
+Let's define the schema `ModuleInfo` using Python dataclasses, and we can pass it to the LLM to extract the structured data. It's easy to do this with CocoIndex.
 
 ``` python
 @dataclasses.dataclass
@@ -66,27 +143,9 @@ class ModuleInfo:
     methods: cocoindex.typing.List[MethodInfo]
 ```
 
-### 2. Define cocoIndex Flow
-Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple.
-
-First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below.
-
-```python
-@cocoindex.flow_def(name="ManualExtraction")
-def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
-    data_scope["documents"] = flow_builder.add_source(
-        cocoindex.sources.LocalFile(path="markdown_files")) 
-
-    modules_index = data_scope.add_collector()
-```
-
-`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/ops/sources) here.
-- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
-- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
-
-Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type.
+### Extract structured data
 
-CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models [here](https://ollama.com/library). We also support OpenAI API. You can find the full documentation and instructions [here](https://cocoindex.io/docs/ai/llm).
+CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM.  This example uses Ollama.
 
 ```python
 with data_scope["documents"].row() as doc:
@@ -101,71 +160,14 @@ with data_scope["documents"].row() as doc:
             instruction="Please extract Python module information from the manual."))
 ```
 
-After the extraction, we just need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
-
-```python
-modules_index.collect(
-    filename=doc["filename"],
-    module_info=doc["module_info"],
-)
-```
-
-Finally, let's export the extracted data to a table.
-
-```python
-modules_index.export(
-    "modules",
-    cocoindex.storages.Postgres(table_name="modules_info"),
-    primary_key_fields=["filename"],
-)
-```
-
-### 3. Query and test your index
-🎉 Now you are all set!
-
-Run the following command to setup and update the index.
-```sh
-cocoindex update -L main.py
-```
-You'll see the index updates state in the terminal
-After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:
-
-```bash
-psql postgres://cocoindex:cocoindex@localhost/cocoindex
-```
-
-And run the SQL query:
-
-```sql
-SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
-```
-
-You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information:
-
-
-### CocoInsight
-CocoInsight is a tool to help you understand your data pipeline and data index.
-CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://www.youtube.com/watch?v=ZnmyoHslBSc).
-
-#### 1. Run the CocoIndex server
-
-```sh
-cocoindex server -ci main.py
-```
-
-to see the CocoInsight dashboard https://cocoindex.io/cocoinsight.  It connects to your local CocoIndex server with zero data retention.
-
-
+<DocumentationButton href="https://cocoindex.io/docs/core/functions#extractbyllm" text="ExtractByLlm" margin="0 0 16px 0" />
 
-## Add Summary to the data
-Using cocoindex as framework, you can easily add any transformation on the data (including LLM summary), and collect it as part of the data index.
-For example, let's add some simple summary to each module - like number of classes and methods, using simple Python funciton.
+![ExtractByLlm](/img/examples/manual_extraction/extraction.png)
 
-We will add a LLM example later.
-
-### 1. Define output
-First, let's add the structure we want as part of the output definition.
+## Add summarization to module info
+Using CocoIndex as framework, you can easily add any transformation on the data, and collect it as part of the data index. Let's add some simple summary to each module - like number of classes and methods, using simple Python function.
 
+### Define Schema
 ``` python
 @dataclasses.dataclass
 class ModuleSummary:
@@ -174,101 +176,74 @@ class ModuleSummary:
     num_methods: int
 ```
 
-### 2. Define cocoIndex Flow
-Next, let's define a custom function to summarize the data. You can see detailed documentation [here](https://cocoindex.io/docs/core/custom_function#option-1-by-a-standalone-function) 
-
-
-``` python
+### A simple custom function to summarize the data
+```python
 @cocoindex.op.function()
 def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
     """Summarize a Python module."""
     return ModuleSummary(
         num_classes=len(module_info.classes),
         num_methods=len(module_info.methods),
     )
-```
-
-### 3. Plug in the function into the flow
+``` 
 
-``` python
+### Plug in the function into the flow
+```python
 with data_scope["documents"].row() as doc:
     # ... after the extraction
     doc["module_summary"] = doc["module_info"].transform(summarize_module)
 ```
 
-🎉 Now you are all set!
-
-Run the following command to setup and update the index.
-```sh
-cocoindex update --setup main.py
-```
+<DocumentationButton href="https://cocoindex.io/docs/custom_ops/custom_functions" text="Custom Function" margin="0 0 16px 0" />
 
-## Extract Structured Data from PDF files
-Ollama does not support PDF files directly as input, so we need to convert them to markdown first.
+![Summarize Module](/img/examples/manual_extraction/summary.png)
 
-To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation [here](https://cocoindex.io/docs/core/custom_function).
+## Collect the data
 
-### 1. Define a function spec
 
-The function spec of a function configures behavior of a specific instance of the function. 
+After the extraction, we need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above.
 
-``` python
-class PdfToMarkdown(cocoindex.op.FunctionSpec):
-    """Convert a PDF to markdown."""
+```python
+modules_index.collect(
+    filename=doc["filename"],
+    module_info=doc["module_info"],
+)
 ```
 
-### 2. Define an executor class
-
-The executor class is a class that implements the function spec. It is responsible for the actual execution of the function. 
-
-This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format.
-
-It is associated with the function spec by `spec: PdfToMarkdown`.
-
-``` python
-@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
-class PdfToMarkdownExecutor:
-    """Executor for PdfToMarkdown."""
-
-    spec: PdfToMarkdown
-    _converter: PdfConverter
+Finally, let's export the extracted data to a table.
 
-    def prepare(self):
-        config_parser = ConfigParser({})
-        self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())
+```python
+modules_index.export(
+    "modules",
+    cocoindex.storages.Postgres(table_name="modules_info"),
+    primary_key_fields=["filename"],
+)
+```
 
-    def __call__(self, content: bytes) -> str:
-        with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
-            temp_file.write(content)
-            temp_file.flush()
-            text, _, _ = text_from_rendered(self._converter(temp_file.name))
-            return text
+## Query and test your index
+Run the following command to setup and update the index.
+```sh
+cocoindex update -L main.py
 ```
-You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.
+You'll see the index updates state in the terminal
 
-### 3. Plugin it to the flow
+After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell:
 
-``` python
-    # Note the binary = True for PDF
-    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
-    modules_index = data_scope.add_collector()
+```bash
+psql postgres://cocoindex:cocoindex@localhost/cocoindex
+```
 
-    with data_scope["documents"].row() as doc:
-        # plug in your custom function here
-        doc["markdown"] = doc["content"].transform(PdfToMarkdown())
+And run the SQL query:
 
+```sql
+SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;
 ```
 
-🎉 Now you are all set!
-
-Run the following command to setup and update the index.
+## CocoInsight
+[CocoInsight](https://www.youtube.com/watch?v=ZnmyoHslBSc) is a really cool tool to help you understand your data pipeline and data index. It is in Early Access now (Free).
 
 ```sh
-cocoindex update --setup main.py
+cocoindex server -ci main.py
 ```
+CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`.  It connects to your local CocoIndex server with zero data retention.
 
-## Community
-
-We love to hear from the community! You can find us on [Github](https://github.com/cocoindex-io/cocoindex) and [Discord](https://discord.com/invite/zpA9S2DR7s).
-
-If you like this post and our work, please **⭐ star [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) to support us**. Thank you with a warm coconut hug 🥥🤗.