Skip to content

Commit 3d2593f

Browse files
committed
docs: use document ai as custom parser
1 parent 5491677 commit 3d2593f

File tree

5 files changed

+169
-2
lines changed

5 files changed

+169
-2
lines changed
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
---
2+
title: Bring your own parser as building block with Google Document AI
3+
description: Use Google Document AI to parse document, embed the resulting text, and store it in a vectorized database for semantic search.
4+
sidebar_class_name: hidden
5+
slug: /examples/document_ai
6+
canonicalUrl: '/examples/document_ai'
7+
sidebar_custom_props:
8+
image: /img/examples/document_ai/cover.png
9+
tags: [vector-index, custom-building-block]
10+
tags: [vector-index, custom-building-block]
11+
---
12+
import { GitHubButton, DocumentationButton, ExampleButton } from '../../../src/components/GitHubButton';
13+
14+
15+
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/document_ai" margin="0 0 24px 0" />
16+
17+
![Document AI](/img/examples/document_ai/cover.png)
18+
19+
CocoIndex is a flexible ETL framework with incremental processing. We don’t build parser ourselves, and users can bring in any open source or commercial parser that works best for their scenarios. In this example, we show how to use **Google Document AI to parse document**, embed the resulting text, and store it in a vectorized database for semantic search.
20+
21+
## Set up
22+
- [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
23+
- Configure Project and Processor ID for Document AI API
24+
- [Official Google document AI API](https://cloud.google.com/document-ai/docs/try-docai) with free live demo.
25+
- Sign in to [Google Cloud Console](https://console.cloud.google.com/), create or open a project, and enable Document AI API.
26+
- ![image.png](/img/examples/document_ai/document_ai.png)
27+
- ![image.png](/img/examples/document_ai/processor.png)
28+
- update `.env` with `GOOGLE_CLOUD_PROJECT_ID` and `GOOGLE_CLOUD_PROCESSOR_ID`.
29+
30+
31+
## Create Your building block to convert PDFs to Markdown
32+
33+
We define a `ToMarkdown` custom function spec, which leverages Google Document AI to parse PDF content:
34+
35+
```python
36+
class ToMarkdown(cocoindex.op.FunctionSpec):
37+
"""Convert a PDF to markdown using Google Document AI."""
38+
```
39+
40+
The corresponding executor class handles API initialization and parsing logic:
41+
42+
```python
43+
@cocoindex.op.executor_class(cache=True, behavior_version=1)
44+
class DocumentAIExecutor:
45+
"""Executor for Google Document AI to parse PDF files."""
46+
47+
spec: ToMarkdown
48+
_client: documentai.DocumentProcessorServiceClient
49+
_processor_name: str
50+
51+
def prepare(self):
52+
# Initialize the Document AI client
53+
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT_ID")
54+
location = os.environ.get("GOOGLE_CLOUD_LOCATION", "us")
55+
processor_id = os.environ.get("GOOGLE_CLOUD_PROCESSOR_ID")
56+
57+
opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
58+
self._client = documentai.DocumentProcessorServiceClient(client_options=opts)
59+
self._processor_name = self._client.processor_path(project_id, location, processor_id)
60+
61+
async def __call__(self, content: bytes) -> str:
62+
"""Parse PDF content and convert to markdown text."""
63+
request = documentai.ProcessRequest(
64+
name=self._processor_name,
65+
raw_document=documentai.RawDocument(content=content, mime_type="application/pdf")
66+
)
67+
response = self._client.process_document(request=request)
68+
return response.document.text
69+
```
70+
71+
Make sure you configure the `cache` and `behavior_version` parameters for heavy operations like this.
72+
73+
- `cache`: Whether the executor will enable cache for this function. When True, the executor will cache the result of the function for reuse during reprocessing. We recommend to set this to True for any function that is computationally intensive.
74+
75+
- `behavior_version`: The version of the behavior of the function. When the version is changed, the function will be re-executed even if cache is enabled. It's required to be set if cache is True.
76+
77+
78+
<DocumentationButton url="https://cocoindex.io/docs/custom_ops/custom_functions#option-2-by-a-function-spec-and-an-executor" text="Custom Functions" margin="0 0 16px 0" />
79+
80+
<DocumentationButton url="https://cocoindex.io/docs/custom_ops/custom_functions#parameters-for-custom-functions" text="Parameters for Custom Functions" margin="0 0 16px 0" />
81+
82+
## Define the flow
83+
84+
```python
85+
@cocoindex.flow_def(name="DocumentAiPdfEmbedding")
86+
def pdf_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
87+
# flow definition
88+
```
89+
90+
### Add source & collector
91+
92+
```python
93+
data_scope["documents"] = flow_builder.add_source(
94+
cocoindex.sources.LocalFile(path="pdf_files", binary=True)
95+
)
96+
97+
doc_embeddings = data_scope.add_collector()
98+
```
99+
100+
<DocumentationButton url="https://cocoindex.io/docs/ops/sources" text="Source" margin="0 0 16px 0" />
101+
102+
<DocumentationButton url="https://cocoindex.io/docs/ops/collectors" text="Collector" margin="0 0 16px 0" />
103+
104+
### Process each document
105+
106+
```python
107+
with data_scope["documents"].row() as doc:
108+
doc["markdown"] = doc["content"].transform(ToMarkdown())
109+
doc["chunks"] = doc["markdown"].transform(
110+
cocoindex.functions.SplitRecursively(),
111+
language="markdown",
112+
chunk_size=2000,
113+
chunk_overlap=500
114+
)
115+
with doc["chunks"].row() as chunk:
116+
chunk["embedding"] = chunk["text"].call(text_to_embedding)
117+
doc_embeddings.collect(
118+
id=cocoindex.GeneratedField.UUID,
119+
filename=doc["filename"],
120+
location=chunk["location"],
121+
text=chunk["text"],
122+
embedding=chunk["embedding"]
123+
)
124+
```
125+
126+
1. Convert them to Markdown using Document AI.
127+
2. Split the Markdown into chunks.
128+
3. Embed each chunk.
129+
130+
## Export to Postgres
131+
132+
```python
133+
doc_embeddings.export(
134+
"doc_embeddings",
135+
cocoindex.storages.Postgres(),
136+
primary_key_fields=["id"],
137+
vector_indexes=[
138+
cocoindex.VectorIndexDef(
139+
field_name="embedding",
140+
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY
141+
)
142+
]
143+
)
144+
```
145+
146+
## End to End Example
147+
148+
For a step-by-step walkthrough of each indexing stage and the query path, check out this example:
149+
150+
<ExampleButton href="https://cocoindex.io/docs/examples/simple_vector_index" text="Simple Vector Index" margin="0 0 16px 0" />
151+

docs/src/components/GitHubButton/index.tsx

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import type { ReactNode } from 'react';
22
import { FaGithub, FaYoutube } from 'react-icons/fa';
3-
import { MdMenuBook } from 'react-icons/md';
3+
import { MdMenuBook, MdDriveEta } from 'react-icons/md';
44

55
type ButtonProps = {
66
href: string;
@@ -73,4 +73,20 @@ function DocumentationButton({ url, text, margin }: DocumentationButtonProps): R
7373
);
7474
}
7575

76-
export { GitHubButton, YouTubeButton, DocumentationButton };
76+
// ExampleButton as requested
77+
type ExampleButtonProps = {
78+
href: string;
79+
text: string;
80+
margin?: string;
81+
};
82+
83+
function ExampleButton({ href, text, margin }: ExampleButtonProps): ReactNode {
84+
return (
85+
<Button href={href} margin={margin}>
86+
<MdDriveEta style={{ marginRight: '8px', verticalAlign: 'middle', fontSize: '1rem' }} />
87+
{text}
88+
</Button>
89+
);
90+
}
91+
92+
export { GitHubButton, YouTubeButton, DocumentationButton, ExampleButton };
781 KB
Loading
38.4 KB
Loading
83.9 KB
Loading

0 commit comments

Comments
 (0)