diff --git a/docs/docs/examples/examples/manual_extraction.md b/docs/docs/examples/examples/manual_extraction.md index 8e6f081a8..0b340fde0 100644 --- a/docs/docs/examples/examples/manual_extraction.md +++ b/docs/docs/examples/examples/manual_extraction.md @@ -10,31 +10,108 @@ sidebar_custom_props: tags: [structured-data-extraction, data-mapping] --- -import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton'; +import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton'; +## Overview +This example shows how to extract structured data from Python Manuals using Ollama. + +## Flow Overview +![Flow Overview](/img/examples/manual_extraction/flow.png) + +- For each PDF file: + - Parse to markdown. + - Extract structured data from the markdown using LLM. + - Add summary to the module info. + - Collect the data. +- Export the data to a table. + ## Prerequisites -### Install Postgres -If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation). +- If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation). + +- [Download](https://ollama.com/download) and install Ollama. Pull your favorite LLM models by: + ```sh + ollama pull llama3.2 + ``` + + + + Alternatively, CocoIndex have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises. -### Install ollama -Ollama allows you to run LLM models on your local machine easily. To get started: + -[Download](https://ollama.com/download) and install Ollama. -Pull your favorite LLM models by the ollama pull command, e.g. +## Add Source +Let's add Python docs as a source. + +```python +@cocoindex.flow_def(name="ManualExtraction") +def manual_extraction_flow( + flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope +): + """ + Define an example flow that extracts manual information from a Markdown. + """ + data_scope["documents"] = flow_builder.add_source( + cocoindex.sources.LocalFile(path="manuals", binary=True) + ) + modules_index = data_scope.add_collector() ``` -ollama pull llama3.2 + +`flow_builder.add_source` will create a table with the following sub fields: +- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md` +- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file + + + +## Parse Markdown + +To do this, we can plugin a custom function to convert PDF to markdown. There are so many different parsers commercially and open source available, you can bring your own parser here. + +```python +class PdfToMarkdown(cocoindex.op.FunctionSpec): + """Convert a PDF to markdown.""" + + +@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1) +class PdfToMarkdownExecutor: + """Executor for PdfToMarkdown.""" + + spec: PdfToMarkdown + _converter: PdfConverter + + def prepare(self): + config_parser = ConfigParser({}) + self._converter = PdfConverter( + create_model_dict(), config=config_parser.generate_config_dict() + ) + + def __call__(self, content: bytes) -> str: + with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file: + temp_file.write(content) + temp_file.flush() + text, _, _ = text_from_rendered(self._converter(temp_file.name)) + return text ``` +You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data. + -## Extract Structured Data from Markdown files -### 1. Define output -We are going to extract the following information from the Python Manuals as structured data. +Plug in the function to the flow. + +```python +with data_scope["documents"].row() as doc: + doc["markdown"] = doc["content"].transform(PdfToMarkdown()) +``` + +It transforms each document to markdown. -So we are going to define the output data class as the following. The goal is to extract and populate `ModuleInfo`. + +## Extract Structured Data from Markdown files +### Define schema +Let's define the schema `ModuleInfo` using Python dataclasses, and we can pass it to the LLM to extract the structured data. It's easy to do this with CocoIndex. ``` python @dataclasses.dataclass @@ -66,27 +143,9 @@ class ModuleInfo: methods: cocoindex.typing.List[MethodInfo] ``` -### 2. Define cocoIndex Flow -Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple. - -First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below. - -```python -@cocoindex.flow_def(name="ManualExtraction") -def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): - data_scope["documents"] = flow_builder.add_source( - cocoindex.sources.LocalFile(path="markdown_files")) - - modules_index = data_scope.add_collector() -``` - -`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/ops/sources) here. -- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md` -- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file - -Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type. +### Extract structured data -CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models [here](https://ollama.com/library). We also support OpenAI API. You can find the full documentation and instructions [here](https://cocoindex.io/docs/ai/llm). +CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. This example uses Ollama. ```python with data_scope["documents"].row() as doc: @@ -101,71 +160,14 @@ with data_scope["documents"].row() as doc: instruction="Please extract Python module information from the manual.")) ``` -After the extraction, we just need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above. - -```python -modules_index.collect( - filename=doc["filename"], - module_info=doc["module_info"], -) -``` - -Finally, let's export the extracted data to a table. - -```python -modules_index.export( - "modules", - cocoindex.storages.Postgres(table_name="modules_info"), - primary_key_fields=["filename"], -) -``` - -### 3. Query and test your index -🎉 Now you are all set! - -Run the following command to setup and update the index. -```sh -cocoindex update -L main.py -``` -You'll see the index updates state in the terminal -After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell: - -```bash -psql postgres://cocoindex:cocoindex@localhost/cocoindex -``` - -And run the SQL query: - -```sql -SELECT filename, module_info->'title' AS title, module_summary FROM modules_info; -``` - -You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information: - - -### CocoInsight -CocoInsight is a tool to help you understand your data pipeline and data index. -CocoInsight is in Early Access now (Free) 😊 You found us! A quick 3 minute video tutorial about CocoInsight: [Watch on YouTube](https://www.youtube.com/watch?v=ZnmyoHslBSc). - -#### 1. Run the CocoIndex server - -```sh -cocoindex server -ci main.py -``` - -to see the CocoInsight dashboard https://cocoindex.io/cocoinsight. It connects to your local CocoIndex server with zero data retention. - - + -## Add Summary to the data -Using cocoindex as framework, you can easily add any transformation on the data (including LLM summary), and collect it as part of the data index. -For example, let's add some simple summary to each module - like number of classes and methods, using simple Python funciton. +![ExtractByLlm](/img/examples/manual_extraction/extraction.png) -We will add a LLM example later. - -### 1. Define output -First, let's add the structure we want as part of the output definition. +## Add summarization to module info +Using CocoIndex as framework, you can easily add any transformation on the data, and collect it as part of the data index. Let's add some simple summary to each module - like number of classes and methods, using simple Python function. +### Define Schema ``` python @dataclasses.dataclass class ModuleSummary: @@ -174,11 +176,8 @@ class ModuleSummary: num_methods: int ``` -### 2. Define cocoIndex Flow -Next, let's define a custom function to summarize the data. You can see detailed documentation [here](https://cocoindex.io/docs/core/custom_function#option-1-by-a-standalone-function) - - -``` python +### A simple custom function to summarize the data +```python @cocoindex.op.function() def summarize_module(module_info: ModuleInfo) -> ModuleSummary: """Summarize a Python module.""" @@ -186,89 +185,65 @@ def summarize_module(module_info: ModuleInfo) -> ModuleSummary: num_classes=len(module_info.classes), num_methods=len(module_info.methods), ) -``` - -### 3. Plug in the function into the flow +``` -``` python +### Plug in the function into the flow +```python with data_scope["documents"].row() as doc: # ... after the extraction doc["module_summary"] = doc["module_info"].transform(summarize_module) ``` -🎉 Now you are all set! - -Run the following command to setup and update the index. -```sh -cocoindex update --setup main.py -``` + -## Extract Structured Data from PDF files -Ollama does not support PDF files directly as input, so we need to convert them to markdown first. +![Summarize Module](/img/examples/manual_extraction/summary.png) -To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation [here](https://cocoindex.io/docs/core/custom_function). +## Collect the data -### 1. Define a function spec -The function spec of a function configures behavior of a specific instance of the function. +After the extraction, we need to cherrypick anything we like from the output using the `collect` function from the collector of a data scope defined above. -``` python -class PdfToMarkdown(cocoindex.op.FunctionSpec): - """Convert a PDF to markdown.""" +```python +modules_index.collect( + filename=doc["filename"], + module_info=doc["module_info"], +) ``` -### 2. Define an executor class - -The executor class is a class that implements the function spec. It is responsible for the actual execution of the function. - -This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format. - -It is associated with the function spec by `spec: PdfToMarkdown`. - -``` python -@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1) -class PdfToMarkdownExecutor: - """Executor for PdfToMarkdown.""" - - spec: PdfToMarkdown - _converter: PdfConverter +Finally, let's export the extracted data to a table. - def prepare(self): - config_parser = ConfigParser({}) - self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict()) +```python +modules_index.export( + "modules", + cocoindex.storages.Postgres(table_name="modules_info"), + primary_key_fields=["filename"], +) +``` - def __call__(self, content: bytes) -> str: - with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file: - temp_file.write(content) - temp_file.flush() - text, _, _ = text_from_rendered(self._converter(temp_file.name)) - return text +## Query and test your index +Run the following command to setup and update the index. +```sh +cocoindex update -L main.py ``` -You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data. +You'll see the index updates state in the terminal -### 3. Plugin it to the flow +After the index is built, you have a table with the name `modules_info`. You can query it at any time, e.g., start a Postgres shell: -``` python - # Note the binary = True for PDF - data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True)) - modules_index = data_scope.add_collector() +```bash +psql postgres://cocoindex:cocoindex@localhost/cocoindex +``` - with data_scope["documents"].row() as doc: - # plug in your custom function here - doc["markdown"] = doc["content"].transform(PdfToMarkdown()) +And run the SQL query: +```sql +SELECT filename, module_info->'title' AS title, module_summary FROM modules_info; ``` -🎉 Now you are all set! - -Run the following command to setup and update the index. +## CocoInsight +[CocoInsight](https://www.youtube.com/watch?v=ZnmyoHslBSc) is a really cool tool to help you understand your data pipeline and data index. It is in Early Access now (Free). ```sh -cocoindex update --setup main.py +cocoindex server -ci main.py ``` +CocoInsight dashboard is here `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server with zero data retention. -## Community - -We love to hear from the community! You can find us on [Github](https://github.com/cocoindex-io/cocoindex) and [Discord](https://discord.com/invite/zpA9S2DR7s). - -If you like this post and our work, please **⭐ star [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) to support us**. Thank you with a warm coconut hug 🥥🤗. \ No newline at end of file diff --git a/docs/static/img/examples/manual_extraction/cover.png b/docs/static/img/examples/manual_extraction/cover.png index 3d8233117..2b5e77673 100644 Binary files a/docs/static/img/examples/manual_extraction/cover.png and b/docs/static/img/examples/manual_extraction/cover.png differ diff --git a/docs/static/img/examples/manual_extraction/extraction.png b/docs/static/img/examples/manual_extraction/extraction.png new file mode 100644 index 000000000..e2d2d0438 Binary files /dev/null and b/docs/static/img/examples/manual_extraction/extraction.png differ diff --git a/docs/static/img/examples/manual_extraction/flow.png b/docs/static/img/examples/manual_extraction/flow.png new file mode 100644 index 000000000..af7cf9706 Binary files /dev/null and b/docs/static/img/examples/manual_extraction/flow.png differ diff --git a/docs/static/img/examples/manual_extraction/summary.png b/docs/static/img/examples/manual_extraction/summary.png new file mode 100644 index 000000000..22999394f Binary files /dev/null and b/docs/static/img/examples/manual_extraction/summary.png differ