Skip to content

Commit 3016e50

Browse files
authored
Add extraction feature (#2419)
1 parent b5eef00 commit 3016e50

File tree

9 files changed

+369
-92
lines changed

9 files changed

+369
-92
lines changed

gemini/use-cases/entity-extraction/README.md

Lines changed: 100 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,26 @@
22

33
## About
44

5-
This project provides a flexible solution for extracting structured information from
6-
documents using Gemini family of models via the Vertex AI API. It exposes this
7-
functionality through a simple Flask-based web server running on Cloud Run.
5+
This project provides a flexible solution for classifying and extracting structured
6+
information from documents using Gemini family of models via the Vertex AI API. It
7+
exposes this functionality through a simple Flask-based web server running on Cloud Run.
88

99
### When to use this solution
1010

11-
The use of Gemini API for entity extraction is especially useful when the structure of
12-
the document varies from one document type to another or when it is unknown. For more
13-
structured documents like complex forms, we recommend taking a look into
14-
[Document AI](https://cloud.google.com/document-ai/docs/overview), which provides
15-
powerful mechanisms for entity extraction and layout parsing.
11+
The use of Gemini API for document classification and entity extraction is especially
12+
useful when the structure of the document varies from one document type to another or
13+
when it is unknown. For more structured documents like complex forms, we recommend
14+
taking a look into [Document AI](https://cloud.google.com/document-ai/docs/overview),
15+
that provides powerful mechanisms for entity extraction and layout parsing.
1616

1717
## Overview
1818

1919
The core of this project is a Python script that takes a document and a configuration
2020
ID as input. The configuration specifies which Gemini model to use, the name of the
2121
document type, and a JSON schema of the fields to extract. The script then prompts the
2222
Gemini model to extract the requested information from the document and return it as a
23-
JSON object.
23+
JSON object. It also allows you to classify a document based on a description of the
24+
document type.
2425

2526
This is wrapped in a Flask web application, allowing you to easily integrate document
2627
extraction capabilities into your own services via an HTTP API. We provide a script for
@@ -29,7 +30,7 @@ deployment to a service in Cloud Run.
2930
## Features
3031

3132
- **AI-Powered Extraction**: Leverages the multimodal capabilities of Gemini models to
32-
understand and extract data from documents.
33+
understand, classify and extract data from documents.
3334
- **Configurable Schemas**: Easily define different extraction schemas for various
3435
document types (e.g., reports, legal documents) in a central configuration file.
3536
- **JSON Output**: The model is prompted to return structured data in JSON format,
@@ -41,9 +42,9 @@ deployment to a service in Cloud Run.
4142

4243
## Architecture
4344

44-
The current solution includes the ability to extract entities based on the specific
45-
document type and the fields specified in the configuration using the Gemini API
46-
(online).
45+
The current solution includes the ability to classify documents and extract entities
46+
based on the specific document type and the fields specified in the configuration using
47+
the Gemini API (online).
4748

4849
![Current Architecture](./images/current_architecture.png)
4950

@@ -85,18 +86,18 @@ document type and the fields specified in the configuration using the Gemini API
8586

8687
## Testing
8788

88-
A simple test case is provided in `entity_extraction_test.py`.
89+
A simple test case is provided in `document_processing_test.py`.
8990

9091
To run the provided test:
9192

9293
```bash
93-
python entity_extraction_test.py
94+
python document_processing_test.py
9495
```
9596

96-
This will call the `extract_from_document` function with a sample document and assert
97-
that the output matches the expected JSON.
97+
This will call the relevant functions from `document_processing.py` with sample
98+
documents and assert that the outputs match the expected JSON.
9899

99-
**Note:** Running the test will make a live call to the Vertex AI API and may incur
100+
**Note:** Running the test will make live calls to the Vertex AI API and may incur
100101
costs.
101102

102103
## Usage
@@ -130,7 +131,7 @@ curl -X POST https://YOUR-CLOUD-RUN-URL/extract \
130131
-H "Content-Type: application/json" \
131132
-H "Authorization: Bearer $(gcloud auth print-identity-token)" \
132133
-d '{
133-
"extract_config_id": "exhibit_2021q1",
134+
"extract_config_id": "form_10_q",
134135
"document_uri": "gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q1_alphabet_earnings_release.pdf"
135136
}'
136137
```
@@ -139,16 +140,70 @@ curl -X POST https://YOUR-CLOUD-RUN-URL/extract \
139140

140141
```json
141142
{
142-
"google_ceo": "Sundar Pichai",
143-
"company_name": "Alphabet Inc."
143+
"year": "2021",
144+
"quarter": "Q1",
145+
"company_name": "Alphabet Inc.",
146+
"ceo": "Sundar Pichai",
147+
"net_income_millions": "17930"
144148
}
145149
```
146150

147-
## Configuration of entities for extraction and prompt
151+
### Sending a Classification Request
152+
153+
You can send a POST request to the /classify endpoint with a JSON payload containing
154+
the document_uri (a GCS URI for the PDF).
155+
Here is an example using curl. Replace with the URL that you get after you deploy the
156+
service to Cloud Run,
157+
158+
```bash
159+
curl -X POST https://YOUR-CLOUD-RUN-URL/classify
160+
-H "Content-Type: application/json"
161+
-H "Authorization: Bearer $(gcloud auth print-identity-token)"
162+
-d '{
163+
"document_uri": "gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q1_alphabet_earnings_release.pdf"
164+
}'
165+
```
166+
167+
**Expected Response:**
168+
169+
```json
170+
{
171+
"class": "form_10_q"
172+
}
173+
```
174+
175+
### Sending a Classification and Extraction Request
176+
177+
You can send a POST request to the `/classify_and_extract` endpoint with a JSON payload
178+
containing the document_uri. The service will first classify the document and then use
179+
the corresponding extraction configuration.
180+
181+
```bash
182+
curl -X POST https://YOUR-CLOUD-RUN-URL/classify_and_extract
183+
-H "Content-Type: application/json"
184+
-H "Authorization: Bearer $(gcloud auth print-identity-token)"
185+
-d '{
186+
"document_uri": "gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q1_alphabet_earnings_release.pdf"
187+
}'
188+
```
189+
190+
**Expected Response:**
191+
192+
```json
193+
{
194+
"year": "2021",
195+
"quarter": "Q1",
196+
"company_name": "Alphabet Inc.",
197+
"ceo": "Sundar Pichai",
198+
"net_income_millions": "17930"
199+
}
200+
```
201+
202+
## Configuration of entities for classification, extraction and prompts
148203

149204
### Entities
150205

151-
The extraction behavior is controlled by the entity extraction configuration file
206+
The classification and extraction behavior is controlled by the configuration file
152207
`config.json`, which holds the configurations for different document types and the
153208
fields to extract. To add a new document type or field, simply add new key-value pairs.
154209

@@ -173,14 +228,28 @@ look for it locally. Example:
173228
CONFIG_PATH="config.json"
174229
```
175230

231+
For classification, you can define the classes under the classification_config key.
232+
The model will use the descriptions to classify the document.
233+
234+
```json
235+
"classification_config": {
236+
"document_mime_type": "application/pdf",
237+
"model": "gemini-2.5-flash",
238+
"classes": {
239+
"class_name_1": "Description of the first document class",
240+
"class_name_2": "Description of the second document class"
241+
}
242+
}
243+
```
244+
176245
### Prompt
177246

178-
The constant `PROMPT_TEMPLATE` in `entity_extraction.py` is the template for the prompt
179-
sent to the Gemini model. You can customize it to improve extraction accuracy for your
180-
specific use case.
247+
The constants `EXTRACT_PROMPT_TEMPLATE` and `CLASSIFY_PROMPT_TEMPLATE` in
248+
`document_processing.py` are the templates for the prompts sent to the Gemini model.
249+
You can customize it to improve extraction accuracy for your specific use case.
181250

182251
```python
183-
PROMPT_TEMPLATE = """\
252+
EXTRACT_PROMPT_TEMPLATE = """\
184253
Based solely on this {document_name}, extract the following fields.
185254
If the information is missing, write "missing" next to the field.
186255
Output as JSON.
@@ -203,15 +272,15 @@ We are planning to add the following functionalities to this project:
203272
of an evaluation dataset. This feature will provide a robust method for evaluating
204273
model performance and fine-tuning prompts to achieve better results.
205274

206-
- Document classification to detect the type of document instead of requiring from the
207-
user to provide the document type before the step of entity extraction.
208-
209275
Future architecture:
210276
![Future Architecture](./images/future_architecture.png)
211277

212278
## Authors
213279

214-
[Ariel Jassan](https://github.com/arieljassan), [Ben Mizrahi](https://github.com/benmizrahi)
280+
| Authors |
281+
| ---------------------------------------------- |
282+
| [Ariel Jassan](https://github.com/arieljassan) |
283+
| [Ben Mizrahi](https://github.com/benmizrahi) |
215284

216285
## Disclaimer
217286

Lines changed: 35 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,38 @@
11
{
2-
"exhibit_2021q1": {
3-
"document_name": "Alphabet First Quarter 2021 Results",
4-
"document_mime_type": "application/pdf",
5-
"model": "gemini-2.5-flash",
6-
"fields": {
7-
"google_ceo": "CEO of Google",
8-
"company_name": "Name of the company"
2+
"extraction_configs": {
3+
"form_10_k": {
4+
"document_name": "Annual report of a company",
5+
"document_mime_type": "application/pdf",
6+
"model": "gemini-2.5-flash",
7+
"fields": {
8+
"year": "Year of the report",
9+
"company_name": "Name of the company",
10+
"ceo": "Name of the CEO of the company",
11+
"net_income_millions": "Net income in million dollars"
12+
}
13+
},
14+
15+
"form_10_q": {
16+
"document_name": "Quarter report of a company",
17+
"document_mime_type": "application/pdf",
18+
"model": "gemini-2.5-flash",
19+
"fields": {
20+
"year": "Year of the report",
21+
"quarter": "Quarter of the report (Q1, Q2, Q3, Q4)",
22+
"company_name": "Name of the company",
23+
"ceo": "Name of the CEO of the company",
24+
"net_income_millions": "Net income in million dollars"
25+
}
26+
}
27+
},
28+
29+
"classification_config": {
30+
"document_mime_type": "application/pdf",
31+
"model": "gemini-2.5-flash",
32+
"classes": {
33+
"form_10_k": "Annual report of a company",
34+
"form_10_q": "Quarter report of a company",
35+
"other": "None of the above"
36+
}
937
}
10-
}
1138
}

gemini/use-cases/entity-extraction/entity_extraction.py renamed to gemini/use-cases/entity-extraction/document_processing.py

Lines changed: 50 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
"""Main logic for entity extraction."""
15+
"""Main logic for classification and entity extraction."""
1616

1717
import json
1818
import os
@@ -22,7 +22,7 @@
2222
from google import genai
2323
from google.genai import types
2424

25-
PROMPT_TEMPLATE = """\
25+
EXTRACT_PROMPT_TEMPLATE = """\
2626
Based solely on this {document_name}, extract the following fields.
2727
If the information is missing, write "missing" next to the field.
2828
Output as JSON.
@@ -31,6 +31,15 @@
3131
{fields}
3232
"""
3333

34+
CLASSIFY_PROMPT_TEMPLATE = """\
35+
Based on the content of the document, classify it as one of the following classes.
36+
Output as JSON in the following format:
37+
"class": "class_name"
38+
39+
40+
Classes:\n
41+
{classes}
42+
"""
3443

3544
dotenv.load_dotenv()
3645
project_id = os.environ.get("GEMINI_PROJECT_ID")
@@ -45,11 +54,12 @@
4554

4655

4756
def extract_from_document(extract_config_id: str, document_uri: str) -> str:
48-
extract_config = CONFIGS[extract_config_id]
57+
"""Extract entities from a document."""
58+
extract_config = CONFIGS["extraction_configs"][extract_config_id]
4959

50-
prompt = PROMPT_TEMPLATE.format(
60+
prompt = EXTRACT_PROMPT_TEMPLATE.format(
5161
document_name=extract_config["document_name"],
52-
fields=json.dumps(extract_config["fields"], indent=2),
62+
fields=json.dumps(extract_config["fields"], indent=4),
5363
)
5464

5565
response = client.models.generate_content(
@@ -66,3 +76,38 @@ def extract_from_document(extract_config_id: str, document_uri: str) -> str:
6676
},
6777
)
6878
return response.text
79+
80+
81+
def classify_document(document_uri: str) -> str:
82+
"""Classify a document."""
83+
classification_config = CONFIGS["classification_config"]
84+
85+
prompt = CLASSIFY_PROMPT_TEMPLATE.format(
86+
classes=json.dumps(classification_config["classes"], indent=4),
87+
)
88+
89+
response = client.models.generate_content(
90+
model=classification_config["model"],
91+
contents=[
92+
types.Part.from_uri(
93+
file_uri=document_uri,
94+
mime_type=classification_config["document_mime_type"],
95+
),
96+
prompt,
97+
],
98+
config={
99+
"response_mime_type": "application/json",
100+
},
101+
)
102+
return response.text
103+
104+
105+
def classify_and_extract_document(document_uri: str) -> str:
106+
"""Classify a document and extract entities from it."""
107+
classification_response = classify_document(document_uri)
108+
classification_result = json.loads(classification_response)
109+
document_class = classification_result.get("class")
110+
if not document_class or document_class not in CONFIGS["extraction_configs"]:
111+
raise ValueError("Document classification failed.")
112+
113+
return extract_from_document(document_class, document_uri)

0 commit comments

Comments
 (0)