You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
18
+
## Overview
19
+
With CocoIndex, you can easily define nested schema in Python dataclass and use LLM to extract structured data from unstructured data. This example shows how to extract structured data from patient intake forms.
21
20
22
21
:::info
23
-
The extraction quality is highly dependent on the OCR quality. You can use CocoIndex with any commercial parser (or open source ones) that is tailored for your domain for better results. For example, Document AI from Google Cloud and more.
22
+
The extraction quality is highly dependent on the OCR quality. You can use CocoIndex with any commercial parser or open source ones that is tailored for your domain for better results. For example, Document AI from Google Cloud and more.
24
23
:::
25
24
26
-
### Google Drive as alternative source (optional)
27
-
If you plan to load patient intake forms from Google Drive, you can refer to this [example](https://cocoindex.io/blogs/text-embedding-from-google-drive#enable-google-drive-access-by-service-account) for more details.
## Parse documents with different formats to Markdown
68
+
69
+
Define a custom function to parse documents in any format to Markdown. Here we use [MarkItDown](https://github.com/microsoft/markitdown) to convert the file to Markdown. It also provides options to parse by LLM, like `gpt-4o`. At present, MarkItDown supports: PDF, Word, Excel, Images (EXIF metadata and OCR), etc.
We are going to define the patient info schema for structured extraction. One of the best examples to define a patient info schema is probably following the [FHIR standard - Patient Resource](https://build.fhir.org/patient.html#resource).
34
107
35
108
36
-
In this tutorial, we'll define a simplified schema for patient information extraction:
109
+
In this tutorial, we'll define a simplified schema in nested dataclass for patient information extraction:
37
110
38
111
```python
39
112
@dataclasses.dataclass
@@ -105,98 +178,73 @@ class Patient:
105
178
consent_date: datetime.date |None
106
179
```
107
180
108
-
### 2. Define CocoIndex Flow
109
-
Let's define the CocoIndex flow to extract the structured data from patient intake forms.
2. Parse documents with different formats to Markdown
133
-
134
-
Define a custom function to parse documents inanyformat to Markdown. Here we use [MarkItDown](https://github.com/microsoft/markitdown) to convert the file to Markdown. It also provides options to parse by LLM, like `gpt-4o`.
135
-
At present, MarkItDown supports: PDF, Word, Excel, Images (EXIF metadata andOCR), etc. You could find its documentation [here](https://github.com/microsoft/markitdown).
185
+
## Extract structured data from Markdown
186
+
CocoIndex provides built-in functions (e.g. `ExtractByLlm`) that process data using LLMs. With CocoIndex, you can directly pass the Python dataclass `Patient` to the function, and it will automatically parse the LLM response into the dataclass.
After the extraction, we collect all the fields for simplicity. You can also select any fields and also perform data mapping and field level transformation on the fields before the collection. If you have any questions, feel free to ask us in [Discord](https://discord.com/invite/zpA9S2DR7s).
CocoIndex provides built-in functions (e.g. `ExtractByLlm`) that process data using LLMs. In this example, we use `gpt-4o`from OpenAI to extract structured data from the Markdown. We also provide built-in support for Ollama, which allows you to run LLM models on your local machine easily.
🎉 Now you are allsetwith the extraction! For mission-critical use cases, it is important to evaluate the quality of the extraction. CocoIndex supports a simple way to evaluate the extraction. There may be some fancier ways to evaluate the extraction, but for now, we'll use a simple approach.
247
+
For mission-critical use cases, it is important to evaluate the quality of the extraction. CocoIndex supports a simple way to evaluate the extraction. More updates are coming soon.
200
248
201
249
1. Dump the extracted data to YAML files.
202
250
@@ -223,49 +271,26 @@ Let's define the CocoIndex flow to extract the structured data from patient inta
223
271
And double click on any row to see file level diff. In my case, there's missing `condition` for `Patient_Intake_Form_Joe.pdf` file.
224
272
225
273
226
-
### Troubleshooting
227
-
228
-
My original golden filefor this record is [this one](https://github.com/cocoindex-io/patient-intake-extraction/blob/main/data/example_forms/Patient_Intake_Form_Joe_Artificial.pdf).
274
+
## Troubleshooting
275
+
If extraction is not ideal, this is how I troubleshoot. My original golden file for this record is [this one](https://github.com/cocoindex-io/patient-intake-extraction/blob/main/data/example_forms/Patient_Intake_Form_Joe_Artificial.pdf).
229
276
230
-
231
-
We will troubleshoot in two steps:
277
+
We could troubleshoot in two steps:
232
278
1. Convert to Markdown
233
279
2. Extract structured data from Markdown
234
280
235
-
In this tutorial, we'll show how to use CocoInsight to troubleshoot this issue.
281
+
I also use CocoInsight to help me troubleshoot.
236
282
237
283
```bash
238
284
cocoindex server -ci main.py
239
285
```
240
286
241
-
Go to https://cocoindex.io/cocoinsight. You could see an interactive UI to explore the data.
242
-
243
-
244
-
Click on the `markdown` column for`Patient_Intake_Form_Joe.pdf`, you could see the Markdown content.
287
+
Go to `https://cocoindex.io/cocoinsight`. You could see an interactive UI to explore the data.
245
288
246
289
247
-
It isnot well understood by LLM extraction. So here we could try a few different models with the Markdown converter/LLM to iterate and see if we can get better results, or needs manual correction.
290
+
Click on the `markdown` column for `Patient_Intake_Form_Joe.pdf`, you could see the Markdown content. We could try a few different models with the Markdown converter/LLM to iterate and see if we can get better results, or needs manual correction.
248
291
249
292
250
-
## Query the extracted data
251
-
252
-
Run following commands to setup and update the index.
253
-
```
254
-
cocoindex setup main.py
255
-
cocoindex update main.py
256
-
```
257
-
You'll see the index updates state in the terminal.
258
-
259
-
After the index is built, you have a table with the name `patients_info`. You can query it at any time, e.g., start a Postgres shell:
0 commit comments