Skip to content

Commit de3a6fd

Browse files
author
Taniya Mathur
committed
Merge branch 'develop' into feature/max-pages-for-classification
2 parents 939f68d + 109db50 commit de3a6fd

File tree

30 files changed

+1417
-1419
lines changed

30 files changed

+1417
-1419
lines changed

config_library/pattern-1/lending-package-sample/config.yaml

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,3 +215,105 @@ pricing:
215215
price: '1.5E-6'
216216
- name: cacheWriteInputTokens
217217
price: '1.875E-5'
218+
discovery:
219+
output_format:
220+
sample_json: |-
221+
{
222+
"document_class" : "Form-1040",
223+
"document_description" : "Brief summary of the document",
224+
"groups" : [
225+
{
226+
"name" : "PersonalInformation",
227+
"description" : "Personal information of Tax payer",
228+
"attributeType" : "group",
229+
"groupAttributes" : [
230+
{
231+
"name": "FirstName",
232+
"dataType" : "string",
233+
"description" : "First Name of Taxpayer"
234+
},
235+
{
236+
"name": "Age",
237+
"dataType" : "number",
238+
"description" : "Age of Taxpayer"
239+
}
240+
]
241+
},
242+
{
243+
"name" : "Dependents",
244+
"description" : "Dependents of taxpayer",
245+
"attributeType" : "list",
246+
"listItemTemplate": {
247+
"itemAttributes" : [
248+
{
249+
"name": "FirstName",
250+
"dataType" : "string",
251+
"description" : "Dependent first name"
252+
},
253+
{
254+
"name": "Age",
255+
"dataType" : "number",
256+
"description" : "Dependent Age"
257+
}
258+
]
259+
}
260+
}
261+
]
262+
}
263+
with_ground_truth:
264+
top_p: '0.1'
265+
temperature: '1.0'
266+
user_prompt: >-
267+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
268+
<GROUND_TRUTH_REFERENCE>
269+
{ground_truth_json}
270+
</GROUND_TRUTH_REFERENCE>
271+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
272+
Image may contain multiple pages, process all pages.
273+
Extract all field names including those without values.
274+
Do not change the group name and field name from ground truth in the extracted data json.
275+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
276+
Add two fields document_class and document_description.
277+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
278+
For document_description generate a description about the document in less than 50 words.
279+
If the group repeats and follows table format, update the attributeType as "list".
280+
Do not extract the values.
281+
Format the extracted data using the below JSON format:
282+
Format the extracted groups and fields using the below JSON format:
283+
284+
model_id: us.amazon.nova-pro-v1:0
285+
system_prompt: >-
286+
You are an expert in processing forms. Extracting data from images and
287+
documents. Use provided ground truth data as reference to optimize field
288+
extraction and ensure consistency with expected document structure and
289+
field definitions.
290+
max_tokens: '10000'
291+
without_ground_truth:
292+
top_p: '0.1'
293+
temperature: '1.0'
294+
user_prompt: >-
295+
This image contains forms data. Analyze the form line by line.
296+
Image may contains multiple pages, process all the pages.
297+
Form may contain multiple name value pair in one line.
298+
Extract all the names in the form including the name value pair which doesn't have value.
299+
Organize them into groups, extract field_name, data_type and field description
300+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
301+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
302+
Field_name should be unique within the group.
303+
Add two fields document_class and document_description.
304+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
305+
For document_description generate a description about the document in less than 50 words.
306+
307+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
308+
If the group repeats and follows table format, update the attributeType as "list".
309+
Do not extract the values.
310+
Return the extracted data in JSON format.
311+
Format the extracted data using the below JSON format:
312+
Format the extracted groups and fields using the below JSON format:
313+
model_id: us.amazon.nova-pro-v1:0
314+
system_prompt: >-
315+
You are an expert in processing forms. Extracting data from images and
316+
documents. Analyze forms line by line to identify field names, data types,
317+
and organizational structure. Focus on creating comprehensive blueprints
318+
for document processing without extracting actual values.
319+
max_tokens: '10000'

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -372,7 +372,7 @@ summarization:
372372
373373
assessment:
374374
enabled: true
375-
validation_enabled: true
375+
validation_enabled: false
376376
image:
377377
target_height: ''
378378
target_width: ''
@@ -693,3 +693,105 @@ pricing:
693693
price: '1.5E-6'
694694
- name: cacheWriteInputTokens
695695
price: '1.875E-5'
696+
discovery:
697+
output_format:
698+
sample_json: |-
699+
{
700+
"document_class" : "Form-1040",
701+
"document_description" : "Brief summary of the document",
702+
"groups" : [
703+
{
704+
"name" : "PersonalInformation",
705+
"description" : "Personal information of Tax payer",
706+
"attributeType" : "group",
707+
"groupAttributes" : [
708+
{
709+
"name": "FirstName",
710+
"dataType" : "string",
711+
"description" : "First Name of Taxpayer"
712+
},
713+
{
714+
"name": "Age",
715+
"dataType" : "number",
716+
"description" : "Age of Taxpayer"
717+
}
718+
]
719+
},
720+
{
721+
"name" : "Dependents",
722+
"description" : "Dependents of taxpayer",
723+
"attributeType" : "list",
724+
"listItemTemplate": {
725+
"itemAttributes" : [
726+
{
727+
"name": "FirstName",
728+
"dataType" : "string",
729+
"description" : "Dependent first name"
730+
},
731+
{
732+
"name": "Age",
733+
"dataType" : "number",
734+
"description" : "Dependent Age"
735+
}
736+
]
737+
}
738+
}
739+
]
740+
}
741+
with_ground_truth:
742+
top_p: '0.1'
743+
temperature: '1.0'
744+
user_prompt: >-
745+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
746+
<GROUND_TRUTH_REFERENCE>
747+
{ground_truth_json}
748+
</GROUND_TRUTH_REFERENCE>
749+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
750+
Image may contain multiple pages, process all pages.
751+
Extract all field names including those without values.
752+
Do not change the group name and field name from ground truth in the extracted data json.
753+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
754+
Add two fields document_class and document_description.
755+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
756+
For document_description generate a description about the document in less than 50 words.
757+
If the group repeats and follows table format, update the attributeType as "list".
758+
Do not extract the values.
759+
Format the extracted data using the below JSON format:
760+
Format the extracted groups and fields using the below JSON format:
761+
762+
model_id: us.amazon.nova-pro-v1:0
763+
system_prompt: >-
764+
You are an expert in processing forms. Extracting data from images and
765+
documents. Use provided ground truth data as reference to optimize field
766+
extraction and ensure consistency with expected document structure and
767+
field definitions.
768+
max_tokens: '10000'
769+
without_ground_truth:
770+
top_p: '0.1'
771+
temperature: '1.0'
772+
user_prompt: >-
773+
This image contains forms data. Analyze the form line by line.
774+
Image may contains multiple pages, process all the pages.
775+
Form may contain multiple name value pair in one line.
776+
Extract all the names in the form including the name value pair which doesn't have value.
777+
Organize them into groups, extract field_name, data_type and field description
778+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
779+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
780+
Field_name should be unique within the group.
781+
Add two fields document_class and document_description.
782+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
783+
For document_description generate a description about the document in less than 50 words.
784+
785+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
786+
If the group repeats and follows table format, update the attributeType as "list".
787+
Do not extract the values.
788+
Return the extracted data in JSON format.
789+
Format the extracted data using the below JSON format:
790+
Format the extracted groups and fields using the below JSON format:
791+
model_id: us.amazon.nova-pro-v1:0
792+
system_prompt: >-
793+
You are an expert in processing forms. Extracting data from images and
794+
documents. Analyze forms line by line to identify field names, data types,
795+
and organizational structure. Focus on creating comprehensive blueprints
796+
for document processing without extracting actual values.
797+
max_tokens: '10000'

config_library/pattern-2/criteria-validation/config.yaml

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
notes: Criteria validation configuration for healthcare/insurance prior authorization
55
assessment:
66
enabled: true
7-
validation_enabled: true
7+
validation_enabled: false
88
criteria_validation:
99
model: us.anthropic.claude-3-5-sonnet-20240620-v1:0
1010
temperature: 0.0
@@ -212,3 +212,105 @@ pricing:
212212
price: 0.0000032
213213
- name: cacheReadInputTokens
214214
price: 0.0000002
215+
discovery:
216+
output_format:
217+
sample_json: |-
218+
{
219+
"document_class" : "Form-1040",
220+
"document_description" : "Brief summary of the document",
221+
"groups" : [
222+
{
223+
"name" : "PersonalInformation",
224+
"description" : "Personal information of Tax payer",
225+
"attributeType" : "group",
226+
"groupAttributes" : [
227+
{
228+
"name": "FirstName",
229+
"dataType" : "string",
230+
"description" : "First Name of Taxpayer"
231+
},
232+
{
233+
"name": "Age",
234+
"dataType" : "number",
235+
"description" : "Age of Taxpayer"
236+
}
237+
]
238+
},
239+
{
240+
"name" : "Dependents",
241+
"description" : "Dependents of taxpayer",
242+
"attributeType" : "list",
243+
"listItemTemplate": {
244+
"itemAttributes" : [
245+
{
246+
"name": "FirstName",
247+
"dataType" : "string",
248+
"description" : "Dependent first name"
249+
},
250+
{
251+
"name": "Age",
252+
"dataType" : "number",
253+
"description" : "Dependent Age"
254+
}
255+
]
256+
}
257+
}
258+
]
259+
}
260+
with_ground_truth:
261+
top_p: '0.1'
262+
temperature: '1.0'
263+
user_prompt: >-
264+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
265+
<GROUND_TRUTH_REFERENCE>
266+
{ground_truth_json}
267+
</GROUND_TRUTH_REFERENCE>
268+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
269+
Image may contain multiple pages, process all pages.
270+
Extract all field names including those without values.
271+
Do not change the group name and field name from ground truth in the extracted data json.
272+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
273+
Add two fields document_class and document_description.
274+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
275+
For document_description generate a description about the document in less than 50 words.
276+
If the group repeats and follows table format, update the attributeType as "list".
277+
Do not extract the values.
278+
Format the extracted data using the below JSON format:
279+
Format the extracted groups and fields using the below JSON format:
280+
281+
model_id: us.amazon.nova-pro-v1:0
282+
system_prompt: >-
283+
You are an expert in processing forms. Extracting data from images and
284+
documents. Use provided ground truth data as reference to optimize field
285+
extraction and ensure consistency with expected document structure and
286+
field definitions.
287+
max_tokens: '10000'
288+
without_ground_truth:
289+
top_p: '0.1'
290+
temperature: '1.0'
291+
user_prompt: >-
292+
This image contains forms data. Analyze the form line by line.
293+
Image may contains multiple pages, process all the pages.
294+
Form may contain multiple name value pair in one line.
295+
Extract all the names in the form including the name value pair which doesn't have value.
296+
Organize them into groups, extract field_name, data_type and field description
297+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
298+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
299+
Field_name should be unique within the group.
300+
Add two fields document_class and document_description.
301+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
302+
For document_description generate a description about the document in less than 50 words.
303+
304+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
305+
If the group repeats and follows table format, update the attributeType as "list".
306+
Do not extract the values.
307+
Return the extracted data in JSON format.
308+
Format the extracted data using the below JSON format:
309+
Format the extracted groups and fields using the below JSON format:
310+
model_id: us.amazon.nova-pro-v1:0
311+
system_prompt: >-
312+
You are an expert in processing forms. Extracting data from images and
313+
documents. Analyze forms line by line to identify field names, data types,
314+
and organizational structure. Focus on creating comprehensive blueprints
315+
for document processing without extracting actual values.
316+
max_tokens: '10000'

0 commit comments

Comments
 (0)