22
33## About
44
5- This project provides a flexible solution for extracting structured information from
6- documents using Gemini family of models via the Vertex AI API. It exposes this
7- functionality through a simple Flask-based web server running on Cloud Run.
5+ This project provides a flexible solution for classifying and extracting structured
6+ information from documents using Gemini family of models via the Vertex AI API. It
7+ exposes this functionality through a simple Flask-based web server running on Cloud Run.
88
99### When to use this solution
1010
11- The use of Gemini API for entity extraction is especially useful when the structure of
12- the document varies from one document type to another or when it is unknown. For more
13- structured documents like complex forms, we recommend taking a look into
14- [ Document AI] ( https://cloud.google.com/document-ai/docs/overview ) , which provides
15- powerful mechanisms for entity extraction and layout parsing.
11+ The use of Gemini API for document classification and entity extraction is especially
12+ useful when the structure of the document varies from one document type to another or
13+ when it is unknown. For more structured documents like complex forms, we recommend
14+ taking a look into [ Document AI] ( https://cloud.google.com/document-ai/docs/overview ) ,
15+ that provides powerful mechanisms for entity extraction and layout parsing.
1616
1717## Overview
1818
1919The core of this project is a Python script that takes a document and a configuration
2020ID as input. The configuration specifies which Gemini model to use, the name of the
2121document type, and a JSON schema of the fields to extract. The script then prompts the
2222Gemini model to extract the requested information from the document and return it as a
23- JSON object.
23+ JSON object. It also allows you to classify a document based on a description of the
24+ document type.
2425
2526This is wrapped in a Flask web application, allowing you to easily integrate document
2627extraction capabilities into your own services via an HTTP API. We provide a script for
@@ -29,7 +30,7 @@ deployment to a service in Cloud Run.
2930## Features
3031
3132- ** AI-Powered Extraction** : Leverages the multimodal capabilities of Gemini models to
32- understand and extract data from documents.
33+ understand, classify and extract data from documents.
3334- ** Configurable Schemas** : Easily define different extraction schemas for various
3435 document types (e.g., reports, legal documents) in a central configuration file.
3536- ** JSON Output** : The model is prompted to return structured data in JSON format,
@@ -41,9 +42,9 @@ deployment to a service in Cloud Run.
4142
4243## Architecture
4344
44- The current solution includes the ability to extract entities based on the specific
45- document type and the fields specified in the configuration using the Gemini API
46- (online).
45+ The current solution includes the ability to classify documents and extract entities
46+ based on the specific document type and the fields specified in the configuration using
47+ the Gemini API (online).
4748
4849![ Current Architecture] ( ./images/current_architecture.png )
4950
@@ -85,18 +86,18 @@ document type and the fields specified in the configuration using the Gemini API
8586
8687# # Testing
8788
88- A simple test case is provided in ` entity_extraction_test .py` .
89+ A simple test case is provided in ` document_processing_test .py` .
8990
9091To run the provided test:
9192
9293` ` ` bash
93- python entity_extraction_test .py
94+ python document_processing_test .py
9495` ` `
9596
96- This will call the ` extract_from_document ` function with a sample document and assert
97- that the output matches the expected JSON.
97+ This will call the relevant functions from ` document_processing.py ` with sample
98+ documents and assert that the outputs match the expected JSON.
9899
99- ** Note:** Running the test will make a live call to the Vertex AI API and may incur
100+ ** Note:** Running the test will make live calls to the Vertex AI API and may incur
100101costs.
101102
102103## Usage
@@ -130,7 +131,7 @@ curl -X POST https://YOUR-CLOUD-RUN-URL/extract \
130131-H " Content-Type: application/json" \
131132-H " Authorization: Bearer $( gcloud auth print-identity-token) " \
132133-d ' {
133- "extract_config_id": "exhibit_2021q1 ",
134+ "extract_config_id": "form_10_q ",
134135 "document_uri": "gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q1_alphabet_earnings_release.pdf"
135136}'
136137` ` `
@@ -139,16 +140,70 @@ curl -X POST https://YOUR-CLOUD-RUN-URL/extract \
139140
140141` ` ` json
141142{
142- " google_ceo" : " Sundar Pichai" ,
143- " company_name" : " Alphabet Inc."
143+ " year" : " 2021" ,
144+ " quarter" : " Q1" ,
145+ " company_name" : " Alphabet Inc." ,
146+ " ceo" : " Sundar Pichai" ,
147+ " net_income_millions" : " 17930"
144148}
145149` ` `
146150
147- ## Configuration of entities for extraction and prompt
151+ ### Sending a Classification Request
152+
153+ You can send a POST request to the /classify endpoint with a JSON payload containing
154+ the document_uri (a GCS URI for the PDF).
155+ Here is an example using curl. Replace with the URL that you get after you deploy the
156+ service to Cloud Run,
157+
158+ ` ` ` bash
159+ curl -X POST https://YOUR-CLOUD-RUN-URL/classify
160+ -H " Content-Type: application/json"
161+ -H " Authorization: Bearer $( gcloud auth print-identity-token) "
162+ -d ' {
163+ "document_uri": "gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q1_alphabet_earnings_release.pdf"
164+ }'
165+ ` ` `
166+
167+ ** Expected Response:**
168+
169+ ` ` ` json
170+ {
171+ " class" : " form_10_q"
172+ }
173+ ` ` `
174+
175+ ### Sending a Classification and Extraction Request
176+
177+ You can send a POST request to the ` /classify_and_extract` endpoint with a JSON payload
178+ containing the document_uri. The service will first classify the document and then use
179+ the corresponding extraction configuration.
180+
181+ ` ` ` bash
182+ curl -X POST https://YOUR-CLOUD-RUN-URL/classify_and_extract
183+ -H " Content-Type: application/json"
184+ -H " Authorization: Bearer $( gcloud auth print-identity-token) "
185+ -d ' {
186+ "document_uri": "gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q1_alphabet_earnings_release.pdf"
187+ }'
188+ ` ` `
189+
190+ ** Expected Response:**
191+
192+ ` ` ` json
193+ {
194+ " year" : " 2021" ,
195+ " quarter" : " Q1" ,
196+ " company_name" : " Alphabet Inc." ,
197+ " ceo" : " Sundar Pichai" ,
198+ " net_income_millions" : " 17930"
199+ }
200+ ` ` `
201+
202+ ## Configuration of entities for classification, extraction and prompts
148203
149204### Entities
150205
151- The extraction behavior is controlled by the entity extraction configuration file
206+ The classification and extraction behavior is controlled by the configuration file
152207` config.json` , which holds the configurations for different document types and the
153208fields to extract. To add a new document type or field, simply add new key-value pairs.
154209
@@ -173,14 +228,28 @@ look for it locally. Example:
173228CONFIG_PATH=" config.json"
174229` ` `
175230
231+ For classification, you can define the classes under the classification_config key.
232+ The model will use the descriptions to classify the document.
233+
234+ ` ` ` json
235+ " classification_config" : {
236+ " document_mime_type" : " application/pdf" ,
237+ " model" : " gemini-2.5-flash" ,
238+ " classes" : {
239+ " class_name_1" : " Description of the first document class" ,
240+ " class_name_2" : " Description of the second document class"
241+ }
242+ }
243+ ` ` `
244+
176245### Prompt
177246
178- The constant ` PROMPT_TEMPLATE ` in ` entity_extraction.py ` is the template for the prompt
179- sent to the Gemini model. You can customize it to improve extraction accuracy for your
180- specific use case.
247+ The constants ` EXTRACT_PROMPT_TEMPLATE ` and ` CLASSIFY_PROMPT_TEMPLATE ` in
248+ ` document_processing.py ` are the templates for the prompts sent to the Gemini model.
249+ You can customize it to improve extraction accuracy for your specific use case.
181250
182251` ` ` python
183- PROMPT_TEMPLATE = " " " \
252+ EXTRACT_PROMPT_TEMPLATE = " " " \
184253 Based solely on this {document_name}, extract the following fields.
185254 If the information is missing, write " missing" next to the field.
186255 Output as JSON.
@@ -203,15 +272,15 @@ We are planning to add the following functionalities to this project:
203272 of an evaluation dataset. This feature will provide a robust method for evaluating
204273 model performance and fine-tuning prompts to achieve better results.
205274
206- - Document classification to detect the type of document instead of requiring from the
207- user to provide the document type before the step of entity extraction.
208-
209275Future architecture:
210276
211277
212278# # Authors
213279
214- [Ariel Jassan](https://github.com/arieljassan), [Ben Mizrahi](https://github.com/benmizrahi)
280+ | Authors |
281+ | ---------------------------------------------- |
282+ | [Ariel Jassan](https://github.com/arieljassan) |
283+ | [Ben Mizrahi](https://github.com/benmizrahi) |
215284
216285# # Disclaimer
217286
0 commit comments