Skip to content

Commit 253d512

Browse files
authored
docs: patient form extraction example (#919)
1 parent 4f02279 commit 253d512

File tree

6 files changed

+141
-116
lines changed

6 files changed

+141
-116
lines changed

docs/docs/examples/examples/patient_form_extraction.md

Lines changed: 141 additions & 116 deletions
Original file line numberDiff line numberDiff line change
@@ -10,30 +10,103 @@ sidebar_custom_props:
1010
tags: [structured-data-extraction, data-mapping]
1111
---
1212

13-
import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
13+
import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';
1414

1515
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction"/>
1616
<YouTubeButton url="https://youtu.be/_mjlwVtnBn0?si=-TBImMyZbnKh-5FB" />
1717

18-
## Prerequisites
19-
### Install Postgres
20-
If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
18+
## Overview
19+
With CocoIndex, you can easily define nested schema in Python dataclass and use LLM to extract structured data from unstructured data. This example shows how to extract structured data from patient intake forms.
2120

2221
:::info
23-
The extraction quality is highly dependent on the OCR quality. You can use CocoIndex with any commercial parser (or open source ones) that is tailored for your domain for better results. For example, Document AI from Google Cloud and more.
22+
The extraction quality is highly dependent on the OCR quality. You can use CocoIndex with any commercial parser or open source ones that is tailored for your domain for better results. For example, Document AI from Google Cloud and more.
2423
:::
2524

26-
### Google Drive as alternative source (optional)
27-
If you plan to load patient intake forms from Google Drive, you can refer to this [example](https://cocoindex.io/blogs/text-embedding-from-google-drive#enable-google-drive-access-by-service-account) for more details.
25+
## Flow Overview
2826

27+
![Flow overview](/img/examples/patient_form_extraction/flow.png)
2928

30-
## Extract Structured Data from Google Drive
31-
### 1. Define output schema
29+
The flow itself is fairly simple.
30+
1. Import a list o intake forms.
31+
2. For each file:
32+
- Convert the file to Markdown.
33+
- Extract structured data from the Markdown.
34+
3. Export selected fields to tables in Postgres with PGVector.
35+
36+
## Setup
37+
- If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
38+
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Create a `.env` file from `.env.example`, and fill `OPENAI_API_KEY`.
39+
40+
Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.
41+
42+
<DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />
43+
44+
45+
## Add source
46+
47+
Add source from local files.
48+
49+
```python
50+
@cocoindex.flow_def(name="PatientIntakeExtraction")
51+
def patient_intake_extraction_flow(
52+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
53+
):
54+
"""
55+
Define a flow that extracts patient information from intake forms.
56+
"""
57+
data_scope["documents"] = flow_builder.add_source(
58+
cocoindex.sources.LocalFile(path="data/patient_forms", binary=True)
59+
)
60+
```
61+
62+
`flow_builder.add_source` will create a table with a few sub fields.
63+
64+
<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="Sources" margin="0 0 16px 0" />
65+
66+
67+
## Parse documents with different formats to Markdown
68+
69+
Define a custom function to parse documents in any format to Markdown. Here we use [MarkItDown](https://github.com/microsoft/markitdown) to convert the file to Markdown. It also provides options to parse by LLM, like `gpt-4o`. At present, MarkItDown supports: PDF, Word, Excel, Images (EXIF metadata and OCR), etc.
70+
71+
```python
72+
class ToMarkdown(cocoindex.op.FunctionSpec):
73+
"""Convert a document to markdown."""
74+
75+
@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
76+
class ToMarkdownExecutor:
77+
"""Executor for ToMarkdown."""
78+
79+
spec: ToMarkdown
80+
_converter: MarkItDown
81+
82+
def prepare(self):
83+
client = OpenAI()
84+
self._converter = MarkItDown(llm_client=client, llm_model="gpt-4o")
85+
86+
def __call__(self, content: bytes, filename: str) -> str:
87+
suffix = os.path.splitext(filename)[1]
88+
with tempfile.NamedTemporaryFile(delete=True, suffix=suffix) as temp_file:
89+
temp_file.write(content)
90+
temp_file.flush()
91+
text = self._converter.convert(temp_file.name).text_content
92+
return text
93+
```
94+
95+
Next we plug it into the data flow.
96+
97+
```python
98+
with data_scope["documents"].row() as doc:
99+
doc["markdown"] = doc["content"].transform(ToMarkdown(), filename=doc["filename"])
100+
```
101+
102+
![Markdown](/img/examples/patient_form_extraction/tomarkdown.png)
103+
104+
## Define output schema
32105

33106
We are going to define the patient info schema for structured extraction. One of the best examples to define a patient info schema is probably following the [FHIR standard - Patient Resource](https://build.fhir.org/patient.html#resource).
34107

35108

36-
In this tutorial, we'll define a simplified schema for patient information extraction:
109+
In this tutorial, we'll define a simplified schema in nested dataclass for patient information extraction:
37110

38111
```python
39112
@dataclasses.dataclass
@@ -105,98 +178,73 @@ class Patient:
105178
consent_date: datetime.date | None
106179
```
107180

108-
### 2. Define CocoIndex Flow
109-
Let's define the CocoIndex flow to extract the structured data from patient intake forms.
110-
111-
1. Add Google Drive as a source
112-
```python
113-
@cocoindex.flow_def(name="PatientIntakeExtraction")
114-
def patient_intake_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
115-
"""
116-
Define a flow that extracts patient information from intake forms.
117-
"""
118-
credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
119-
root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")
120-
121-
data_scope["documents"] = flow_builder.add_source(
122-
cocoindex.sources.GoogleDrive(
123-
service_account_credential_path=credential_path,
124-
root_folder_ids=root_folder_ids,
125-
binary=True))
126-
127-
patients_index = data_scope.add_collector()
128-
```
181+
A simplified illustration of the nested fields and its definition:
129182

130-
`flow_builder.add_source` will create a table with a few sub fields. See [documentation](https://cocoindex.io/docs/ops/sources) here.
183+
![Patient Fields](/img/examples/patient_form_extraction/fields.png)
131184

132-
2. Parse documents with different formats to Markdown
133-
134-
Define a custom function to parse documents in any format to Markdown. Here we use [MarkItDown](https://github.com/microsoft/markitdown) to convert the file to Markdown. It also provides options to parse by LLM, like `gpt-4o`.
135-
At present, MarkItDown supports: PDF, Word, Excel, Images (EXIF metadata and OCR), etc. You could find its documentation [here](https://github.com/microsoft/markitdown).
185+
## Extract structured data from Markdown
186+
CocoIndex provides built-in functions (e.g. `ExtractByLlm`) that process data using LLMs. With CocoIndex, you can directly pass the Python dataclass `Patient` to the function, and it will automatically parse the LLM response into the dataclass.
136187

188+
```python
189+
with data_scope["documents"].row() as doc:
190+
doc["patient_info"] = doc["markdown"].transform(
191+
cocoindex.functions.ExtractByLlm(
192+
llm_spec=cocoindex.LlmSpec(
193+
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"),
194+
output_type=Patient,
195+
instruction="Please extract patient information from the intake form."))
196+
patients_index.collect(
197+
filename=doc["filename"],
198+
patient_info=doc["patient_info"],
199+
)
200+
```
137201

138-
```python
139-
class ToMarkdown(cocoindex.op.FunctionSpec):
140-
"""Convert a document to markdown."""
202+
<DocumentationButton href="https://cocoindex.io/docs/ops/functions#extractbyllm" text="ExtractByLlm" margin="0 0 16px 0" />
141203

142-
@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
143-
class ToMarkdownExecutor:
144-
"""Executor for ToMarkdown."""
204+
![Extracted](/img/examples/patient_form_extraction/extraction.png)
145205

146-
spec: ToMarkdown
147-
_converter: MarkItDown
206+
After the extraction, we collect all the fields for simplicity. You can also select any fields and also perform data mapping and field level transformation on the fields before the collection. If you have any questions, feel free to ask us in [Discord](https://discord.com/invite/zpA9S2DR7s).
148207

149-
def prepare(self):
150-
client = OpenAI()
151-
self._converter = MarkItDown(llm_client=client, llm_model="gpt-4o")
152208

153-
def __call__(self, content: bytes, filename: str) -> str:
154-
suffix = os.path.splitext(filename)[1]
155-
with tempfile.NamedTemporaryFile(delete=True, suffix=suffix) as temp_file:
156-
temp_file.write(content)
157-
temp_file.flush()
158-
text = self._converter.convert(temp_file.name).text_content
159-
return text
160-
```
209+
## Export the extracted data to a table
161210

162-
Next we plug it into the data flow.
211+
```python
212+
patients_index.export(
213+
"patients",
214+
cocoindex.storages.Postgres(table_name="patients_info"),
215+
primary_key_fields=["filename"],
216+
)
217+
```
163218

164-
```python
165-
with data_scope["documents"].row() as doc:
166-
doc["markdown"] = doc["content"].transform(ToMarkdown(), filename=doc["filename"])
219+
## Run and Query
220+
### Install dependencies
221+
```bash
222+
pip install -e .
167223
```
168224

169-
3. Extract structured data from Markdown
170-
CocoIndex provides built-in functions (e.g. `ExtractByLlm`) that process data using LLMs. In this example, we use `gpt-4o` from OpenAI to extract structured data from the Markdown. We also provide built-in support for Ollama, which allows you to run LLM models on your local machine easily.
171-
172-
```python
173-
with data_scope["documents"].row() as doc:
174-
doc["patient_info"] = doc["markdown"].transform(
175-
cocoindex.functions.ExtractByLlm(
176-
llm_spec=cocoindex.LlmSpec(
177-
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"),
178-
output_type=Patient,
179-
instruction="Please extract patient information from the intake form."))
180-
patients_index.collect(
181-
filename=doc["filename"],
182-
patient_info=doc["patient_info"],
183-
)
225+
### Setup and update the index
226+
```sh
227+
cocoindex update --setup main.py
184228
```
229+
You'll see the index updates state in the terminal
185230

186-
After the extraction, we just need to cherrypick anything we like from the output by calling the collect method on the collector defined above.
231+
### Query the output table
232+
After the index is built, you have a table with the name `patients_info`. You can query it at any time, e.g., start a Postgres shell:
187233

188-
4. Export the extracted data to a table.
234+
```bash
235+
psql postgres://cocoindex:cocoindex@localhost/cocoindex
236+
```
189237

190-
```python
191-
patients_index.export(
192-
"patients",
193-
cocoindex.storages.Postgres(table_name="patients_info"),
194-
primary_key_fields=["filename"],
195-
)
196-
```
238+
The run:
239+
240+
```sql
241+
select * from patients_info;
242+
```
243+
244+
You could see the patients_info table.
197245

198246
## Evaluate
199-
🎉 Now you are all set with the extraction! For mission-critical use cases, it is important to evaluate the quality of the extraction. CocoIndex supports a simple way to evaluate the extraction. There may be some fancier ways to evaluate the extraction, but for now, we'll use a simple approach.
247+
For mission-critical use cases, it is important to evaluate the quality of the extraction. CocoIndex supports a simple way to evaluate the extraction. More updates are coming soon.
200248

201249
1. Dump the extracted data to YAML files.
202250

@@ -223,49 +271,26 @@ Let's define the CocoIndex flow to extract the structured data from patient inta
223271
And double click on any row to see file level diff. In my case, there's missing `condition` for `Patient_Intake_Form_Joe.pdf` file.
224272
225273
226-
### Troubleshooting
227-
228-
My original golden file for this record is [this one](https://github.com/cocoindex-io/patient-intake-extraction/blob/main/data/example_forms/Patient_Intake_Form_Joe_Artificial.pdf).
274+
## Troubleshooting
275+
If extraction is not ideal, this is how I troubleshoot. My original golden file for this record is [this one](https://github.com/cocoindex-io/patient-intake-extraction/blob/main/data/example_forms/Patient_Intake_Form_Joe_Artificial.pdf).
229276
230-
231-
We will troubleshoot in two steps:
277+
We could troubleshoot in two steps:
232278
1. Convert to Markdown
233279
2. Extract structured data from Markdown
234280
235-
In this tutorial, we'll show how to use CocoInsight to troubleshoot this issue.
281+
I also use CocoInsight to help me troubleshoot.
236282
237283
```bash
238284
cocoindex server -ci main.py
239285
```
240286
241-
Go to https://cocoindex.io/cocoinsight. You could see an interactive UI to explore the data.
242-
243-
244-
Click on the `markdown` column for `Patient_Intake_Form_Joe.pdf`, you could see the Markdown content.
287+
Go to `https://cocoindex.io/cocoinsight`. You could see an interactive UI to explore the data.
245288
246289
247-
It is not well understood by LLM extraction. So here we could try a few different models with the Markdown converter/LLM to iterate and see if we can get better results, or needs manual correction.
290+
Click on the `markdown` column for `Patient_Intake_Form_Joe.pdf`, you could see the Markdown content. We could try a few different models with the Markdown converter/LLM to iterate and see if we can get better results, or needs manual correction.
248291
249292
250-
## Query the extracted data
251-
252-
Run following commands to setup and update the index.
253-
```
254-
cocoindex setup main.py
255-
cocoindex update main.py
256-
```
257-
You'll see the index updates state in the terminal.
258-
259-
After the index is built, you have a table with the name `patients_info`. You can query it at any time, e.g., start a Postgres shell:
260-
261-
```bash
262-
psql postgres://cocoindex:cocoindex@localhost/cocoindex
263-
```
264-
265-
The run:
266-
267-
```sql
268-
select * from patients_info;
269-
```
293+
## Connect to other sources
294+
CocoIndex natively supports Google Drive, Amazon S3, Azure Blob Storage, and more.
270295
271-
You could see the patients_info table.
296+
<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="Sources" margin="0 0 16px 0" />
-50 KB
Loading
96.1 KB
Loading
27.1 KB
Loading
115 KB
Loading
92.6 KB
Loading

0 commit comments

Comments
 (0)