Skip to content

Commit 0820287

Browse files
author
Bob Strahan
committed
Merge branch 'develop' into feature/s3-vectorstore
2 parents 7470af2 + 109db50 commit 0820287

File tree

30 files changed

+1417
-1419
lines changed

30 files changed

+1417
-1419
lines changed

config_library/pattern-1/lending-package-sample/config.yaml

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,3 +215,105 @@ pricing:
215215
price: '1.5E-6'
216216
- name: cacheWriteInputTokens
217217
price: '1.875E-5'
218+
discovery:
219+
output_format:
220+
sample_json: |-
221+
{
222+
"document_class" : "Form-1040",
223+
"document_description" : "Brief summary of the document",
224+
"groups" : [
225+
{
226+
"name" : "PersonalInformation",
227+
"description" : "Personal information of Tax payer",
228+
"attributeType" : "group",
229+
"groupAttributes" : [
230+
{
231+
"name": "FirstName",
232+
"dataType" : "string",
233+
"description" : "First Name of Taxpayer"
234+
},
235+
{
236+
"name": "Age",
237+
"dataType" : "number",
238+
"description" : "Age of Taxpayer"
239+
}
240+
]
241+
},
242+
{
243+
"name" : "Dependents",
244+
"description" : "Dependents of taxpayer",
245+
"attributeType" : "list",
246+
"listItemTemplate": {
247+
"itemAttributes" : [
248+
{
249+
"name": "FirstName",
250+
"dataType" : "string",
251+
"description" : "Dependent first name"
252+
},
253+
{
254+
"name": "Age",
255+
"dataType" : "number",
256+
"description" : "Dependent Age"
257+
}
258+
]
259+
}
260+
}
261+
]
262+
}
263+
with_ground_truth:
264+
top_p: '0.1'
265+
temperature: '1.0'
266+
user_prompt: >-
267+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
268+
<GROUND_TRUTH_REFERENCE>
269+
{ground_truth_json}
270+
</GROUND_TRUTH_REFERENCE>
271+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
272+
Image may contain multiple pages, process all pages.
273+
Extract all field names including those without values.
274+
Do not change the group name and field name from ground truth in the extracted data json.
275+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
276+
Add two fields document_class and document_description.
277+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
278+
For document_description generate a description about the document in less than 50 words.
279+
If the group repeats and follows table format, update the attributeType as "list".
280+
Do not extract the values.
281+
Format the extracted data using the below JSON format:
282+
Format the extracted groups and fields using the below JSON format:
283+
284+
model_id: us.amazon.nova-pro-v1:0
285+
system_prompt: >-
286+
You are an expert in processing forms. Extracting data from images and
287+
documents. Use provided ground truth data as reference to optimize field
288+
extraction and ensure consistency with expected document structure and
289+
field definitions.
290+
max_tokens: '10000'
291+
without_ground_truth:
292+
top_p: '0.1'
293+
temperature: '1.0'
294+
user_prompt: >-
295+
This image contains forms data. Analyze the form line by line.
296+
Image may contains multiple pages, process all the pages.
297+
Form may contain multiple name value pair in one line.
298+
Extract all the names in the form including the name value pair which doesn't have value.
299+
Organize them into groups, extract field_name, data_type and field description
300+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
301+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
302+
Field_name should be unique within the group.
303+
Add two fields document_class and document_description.
304+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
305+
For document_description generate a description about the document in less than 50 words.
306+
307+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
308+
If the group repeats and follows table format, update the attributeType as "list".
309+
Do not extract the values.
310+
Return the extracted data in JSON format.
311+
Format the extracted data using the below JSON format:
312+
Format the extracted groups and fields using the below JSON format:
313+
model_id: us.amazon.nova-pro-v1:0
314+
system_prompt: >-
315+
You are an expert in processing forms. Extracting data from images and
316+
documents. Analyze forms line by line to identify field names, data types,
317+
and organizational structure. Focus on creating comprehensive blueprints
318+
for document processing without extracting actual values.
319+
max_tokens: '10000'

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,7 @@ summarization:
371371
372372
assessment:
373373
enabled: true
374-
validation_enabled: true
374+
validation_enabled: false
375375
image:
376376
target_height: ''
377377
target_width: ''
@@ -692,3 +692,105 @@ pricing:
692692
price: '1.5E-6'
693693
- name: cacheWriteInputTokens
694694
price: '1.875E-5'
695+
discovery:
696+
output_format:
697+
sample_json: |-
698+
{
699+
"document_class" : "Form-1040",
700+
"document_description" : "Brief summary of the document",
701+
"groups" : [
702+
{
703+
"name" : "PersonalInformation",
704+
"description" : "Personal information of Tax payer",
705+
"attributeType" : "group",
706+
"groupAttributes" : [
707+
{
708+
"name": "FirstName",
709+
"dataType" : "string",
710+
"description" : "First Name of Taxpayer"
711+
},
712+
{
713+
"name": "Age",
714+
"dataType" : "number",
715+
"description" : "Age of Taxpayer"
716+
}
717+
]
718+
},
719+
{
720+
"name" : "Dependents",
721+
"description" : "Dependents of taxpayer",
722+
"attributeType" : "list",
723+
"listItemTemplate": {
724+
"itemAttributes" : [
725+
{
726+
"name": "FirstName",
727+
"dataType" : "string",
728+
"description" : "Dependent first name"
729+
},
730+
{
731+
"name": "Age",
732+
"dataType" : "number",
733+
"description" : "Dependent Age"
734+
}
735+
]
736+
}
737+
}
738+
]
739+
}
740+
with_ground_truth:
741+
top_p: '0.1'
742+
temperature: '1.0'
743+
user_prompt: >-
744+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
745+
<GROUND_TRUTH_REFERENCE>
746+
{ground_truth_json}
747+
</GROUND_TRUTH_REFERENCE>
748+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
749+
Image may contain multiple pages, process all pages.
750+
Extract all field names including those without values.
751+
Do not change the group name and field name from ground truth in the extracted data json.
752+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
753+
Add two fields document_class and document_description.
754+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
755+
For document_description generate a description about the document in less than 50 words.
756+
If the group repeats and follows table format, update the attributeType as "list".
757+
Do not extract the values.
758+
Format the extracted data using the below JSON format:
759+
Format the extracted groups and fields using the below JSON format:
760+
761+
model_id: us.amazon.nova-pro-v1:0
762+
system_prompt: >-
763+
You are an expert in processing forms. Extracting data from images and
764+
documents. Use provided ground truth data as reference to optimize field
765+
extraction and ensure consistency with expected document structure and
766+
field definitions.
767+
max_tokens: '10000'
768+
without_ground_truth:
769+
top_p: '0.1'
770+
temperature: '1.0'
771+
user_prompt: >-
772+
This image contains forms data. Analyze the form line by line.
773+
Image may contains multiple pages, process all the pages.
774+
Form may contain multiple name value pair in one line.
775+
Extract all the names in the form including the name value pair which doesn't have value.
776+
Organize them into groups, extract field_name, data_type and field description
777+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
778+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
779+
Field_name should be unique within the group.
780+
Add two fields document_class and document_description.
781+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
782+
For document_description generate a description about the document in less than 50 words.
783+
784+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
785+
If the group repeats and follows table format, update the attributeType as "list".
786+
Do not extract the values.
787+
Return the extracted data in JSON format.
788+
Format the extracted data using the below JSON format:
789+
Format the extracted groups and fields using the below JSON format:
790+
model_id: us.amazon.nova-pro-v1:0
791+
system_prompt: >-
792+
You are an expert in processing forms. Extracting data from images and
793+
documents. Analyze forms line by line to identify field names, data types,
794+
and organizational structure. Focus on creating comprehensive blueprints
795+
for document processing without extracting actual values.
796+
max_tokens: '10000'

config_library/pattern-2/criteria-validation/config.yaml

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
notes: Criteria validation configuration for healthcare/insurance prior authorization
55
assessment:
66
enabled: true
7-
validation_enabled: true
7+
validation_enabled: false
88
criteria_validation:
99
model: us.anthropic.claude-3-5-sonnet-20240620-v1:0
1010
temperature: 0.0
@@ -212,3 +212,105 @@ pricing:
212212
price: 0.0000032
213213
- name: cacheReadInputTokens
214214
price: 0.0000002
215+
discovery:
216+
output_format:
217+
sample_json: |-
218+
{
219+
"document_class" : "Form-1040",
220+
"document_description" : "Brief summary of the document",
221+
"groups" : [
222+
{
223+
"name" : "PersonalInformation",
224+
"description" : "Personal information of Tax payer",
225+
"attributeType" : "group",
226+
"groupAttributes" : [
227+
{
228+
"name": "FirstName",
229+
"dataType" : "string",
230+
"description" : "First Name of Taxpayer"
231+
},
232+
{
233+
"name": "Age",
234+
"dataType" : "number",
235+
"description" : "Age of Taxpayer"
236+
}
237+
]
238+
},
239+
{
240+
"name" : "Dependents",
241+
"description" : "Dependents of taxpayer",
242+
"attributeType" : "list",
243+
"listItemTemplate": {
244+
"itemAttributes" : [
245+
{
246+
"name": "FirstName",
247+
"dataType" : "string",
248+
"description" : "Dependent first name"
249+
},
250+
{
251+
"name": "Age",
252+
"dataType" : "number",
253+
"description" : "Dependent Age"
254+
}
255+
]
256+
}
257+
}
258+
]
259+
}
260+
with_ground_truth:
261+
top_p: '0.1'
262+
temperature: '1.0'
263+
user_prompt: >-
264+
This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference.
265+
<GROUND_TRUTH_REFERENCE>
266+
{ground_truth_json}
267+
</GROUND_TRUTH_REFERENCE>
268+
Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference.
269+
Image may contain multiple pages, process all pages.
270+
Extract all field names including those without values.
271+
Do not change the group name and field name from ground truth in the extracted data json.
272+
Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field.
273+
Add two fields document_class and document_description.
274+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
275+
For document_description generate a description about the document in less than 50 words.
276+
If the group repeats and follows table format, update the attributeType as "list".
277+
Do not extract the values.
278+
Format the extracted data using the below JSON format:
279+
Format the extracted groups and fields using the below JSON format:
280+
281+
model_id: us.amazon.nova-pro-v1:0
282+
system_prompt: >-
283+
You are an expert in processing forms. Extracting data from images and
284+
documents. Use provided ground truth data as reference to optimize field
285+
extraction and ensure consistency with expected document structure and
286+
field definitions.
287+
max_tokens: '10000'
288+
without_ground_truth:
289+
top_p: '0.1'
290+
temperature: '1.0'
291+
user_prompt: >-
292+
This image contains forms data. Analyze the form line by line.
293+
Image may contains multiple pages, process all the pages.
294+
Form may contain multiple name value pair in one line.
295+
Extract all the names in the form including the name value pair which doesn't have value.
296+
Organize them into groups, extract field_name, data_type and field description
297+
Field_name should be less than 60 characters, should not have space use '-' instead of space.
298+
field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form.
299+
Field_name should be unique within the group.
300+
Add two fields document_class and document_description.
301+
For document_class generate a short name based on the document content like W4, I-9, Paystub.
302+
For document_description generate a description about the document in less than 50 words.
303+
304+
Group the fields based on the section they are grouped in the form. Group should have attributeType as "group".
305+
If the group repeats and follows table format, update the attributeType as "list".
306+
Do not extract the values.
307+
Return the extracted data in JSON format.
308+
Format the extracted data using the below JSON format:
309+
Format the extracted groups and fields using the below JSON format:
310+
model_id: us.amazon.nova-pro-v1:0
311+
system_prompt: >-
312+
You are an expert in processing forms. Extracting data from images and
313+
documents. Analyze forms line by line to identify field names, data types,
314+
and organizational structure. Focus on creating comprehensive blueprints
315+
for document processing without extracting actual values.
316+
max_tokens: '10000'

0 commit comments

Comments
 (0)