Skip to content

Commit 553155a

Browse files
committed
added discovery config to pattern2 and pattern3
1 parent 8e191b1 commit 553155a

File tree

2 files changed

+216
-0
lines changed

2 files changed

+216
-0
lines changed

patterns/pattern-2/template.yaml

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -815,6 +815,114 @@ Resources:
815815
format: textarea
816816
description: Task prompt for LLM evaluation - supports placeholders {DOCUMENT_CLASS}, {ATTRIBUTE_NAME}, {ATTRIBUTE_DESCRIPTION}, {EXPECTED_VALUE} and {ACTUAL_VALUE}
817817
order: 7
818+
discovery:
819+
order: 5
820+
type: object
821+
sectionLabel: Discovery Configuration
822+
description: Configuration for document class discovery functionality
823+
properties:
824+
without_ground_truth:
825+
order: 0
826+
type: object
827+
sectionLabel: Discovery Without Ground Truth
828+
description: Configuration for discovering document classes without reference data
829+
properties:
830+
model_id:
831+
type: string
832+
description: Bedrock model ID for discovery without ground truth
833+
enum: ["us.amazon.nova-lite-v1:0", "us.amazon.nova-pro-v1:0", "us.amazon.nova-premier-v1:0", "us.anthropic.claude-3-haiku-20240307-v1:0", "us.anthropic.claude-3-5-sonnet-20241022-v2:0", "us.anthropic.claude-3-7-sonnet-20250219-v1:0"]
834+
default: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
835+
order: 0
836+
temperature:
837+
type: number
838+
description: Temperature parameter for model creativity (0.0-1.0)
839+
minimum: 0.0
840+
maximum: 1.0
841+
default: 1.0
842+
order: 1
843+
top_p:
844+
type: number
845+
description: Top-p parameter for nucleus sampling (0.0-1.0)
846+
minimum: 0.0
847+
maximum: 1.0
848+
default: 0.1
849+
order: 2
850+
max_tokens:
851+
type: number
852+
description: Maximum number of tokens to generate
853+
minimum: 1000
854+
maximum: 20000
855+
default: 10000
856+
order: 3
857+
system_prompt:
858+
type: string
859+
format: textarea
860+
description: System prompt for the discovery model
861+
default: "You are an expert in processing forms. Extracting data from images and documents. Analyze forms line by line to identify field names, data types, and organizational structure. Focus on creating comprehensive blueprints for document processing without extracting actual values."
862+
order: 4
863+
user_prompt:
864+
type: string
865+
format: textarea
866+
description: User prompt template for discovery without ground truth
867+
default: "This image contains forms data. Analyze the form line by line. Image may contains multiple pages, process all the pages. Form may contain multiple name value pair in one line. Extract all the names in the form including the name value pair which doesn't have value. Organize them into groups, extract field_name, data_type and field description. Field_name should be less than 60 characters, should not have space use '-' instead of space. field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form. Field_name should be unique within the group. Add two fields document_class and document_description. For document_class generate a short name based on the document content like W4, I-9, Paystub. For document_description generate a description about the document in less than 50 words. Group the fields based on the section they are grouped in the form. Group should have attributeType as \"group\". If the group repeats, add an additional field groupType and set the value as \"Table\". Do not extract the values. Return the extracted data in JSON format."
868+
order: 5
869+
with_ground_truth:
870+
order: 1
871+
type: object
872+
sectionLabel: Discovery With Ground Truth
873+
description: Configuration for discovering document classes using reference data
874+
properties:
875+
model_id:
876+
type: string
877+
description: Bedrock model ID for discovery with ground truth
878+
enum: ["us.amazon.nova-lite-v1:0", "us.amazon.nova-pro-v1:0", "us.amazon.nova-premier-v1:0", "us.anthropic.claude-3-haiku-20240307-v1:0", "us.anthropic.claude-3-5-sonnet-20241022-v2:0", "us.anthropic.claude-3-7-sonnet-20250219-v1:0"]
879+
default: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
880+
order: 0
881+
temperature:
882+
type: number
883+
description: Temperature parameter for model creativity (0.0-1.0)
884+
minimum: 0.0
885+
maximum: 1.0
886+
default: 1.0
887+
order: 1
888+
top_p:
889+
type: number
890+
description: Top-p parameter for nucleus sampling (0.0-1.0)
891+
minimum: 0.0
892+
maximum: 1.0
893+
default: 0.1
894+
order: 2
895+
max_tokens:
896+
type: number
897+
description: Maximum number of tokens to generate
898+
minimum: 1000
899+
maximum: 20000
900+
default: 10000
901+
order: 3
902+
system_prompt:
903+
type: string
904+
format: textarea
905+
description: System prompt for the discovery model with ground truth
906+
default: "You are an expert in processing forms. Extracting data from images and documents. Use provided ground truth data as reference to optimize field extraction and ensure consistency with expected document structure and field definitions."
907+
order: 4
908+
user_prompt:
909+
type: string
910+
format: textarea
911+
description: User prompt template for discovery with ground truth (use {ground_truth_json} placeholder)
912+
default: "This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference. <GROUND_TRUTH_REFERENCE>{ground_truth_json}</GROUND_TRUTH_REFERENCE> Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference. Image may contain multiple pages, process all pages. Extract all field names including those without values. Do not change the group name and field name from ground truth in the extracted data json. Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field. Add two fields document_class and document_description. For document_class generate a short name based on the document content like W4, I-9, Paystub. For document_description generate a description about the document in less than 50 words. If the group repeats and follows table format, add a special field group_type with value \"Table\" and description field for the group. Do not extract the values."
913+
order: 5
914+
output_format:
915+
order: 2
916+
type: object
917+
sectionLabel: Output Format Configuration
918+
description: Configuration for discovery output format
919+
properties:
920+
sample_json:
921+
type: string
922+
format: textarea
923+
description: Sample JSON format for discovery output
924+
default: "{\n \"document_class\": \"Form-1040\",\n \"document_description\": \"Brief summary of the document\",\n \"groups\": [\n {\n \"name\": \"PersonalInformation\",\n \"description\": \"Personal information of Tax payer\",\n \"attributeType\": \"group\",\n \"groupType\": \"normal\",\n \"groupAttributes\": [\n {\n \"name\": \"FirstName\",\n \"dataType\": \"string\",\n \"description\": \"First Name of Taxpayer\"\n },\n {\n \"name\": \"Age\",\n \"dataType\": \"number\",\n \"description\": \"Age of Taxpayer\"\n }\n ]\n }\n ]\n}"
925+
order: 0
818926
pricing:
819927
order: 8
820928
type: array

patterns/pattern-3/template.yaml

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -724,6 +724,114 @@ Resources:
724724
format: textarea
725725
description: Task prompt for LLM evaluation - supports placeholders {DOCUMENT_CLASS}, {ATTRIBUTE_NAME}, {ATTRIBUTE_DESCRIPTION}, {EXPECTED_VALUE} and {ACTUAL_VALUE}
726726
order: 7
727+
discovery:
728+
order: 5
729+
type: object
730+
sectionLabel: Discovery Configuration
731+
description: Configuration for document class discovery functionality
732+
properties:
733+
without_ground_truth:
734+
order: 0
735+
type: object
736+
sectionLabel: Discovery Without Ground Truth
737+
description: Configuration for discovering document classes without reference data
738+
properties:
739+
model_id:
740+
type: string
741+
description: Bedrock model ID for discovery without ground truth
742+
enum: ["us.amazon.nova-lite-v1:0", "us.amazon.nova-pro-v1:0", "us.amazon.nova-premier-v1:0", "us.anthropic.claude-3-haiku-20240307-v1:0", "us.anthropic.claude-3-5-sonnet-20241022-v2:0", "us.anthropic.claude-3-7-sonnet-20250219-v1:0"]
743+
default: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
744+
order: 0
745+
temperature:
746+
type: number
747+
description: Temperature parameter for model creativity (0.0-1.0)
748+
minimum: 0.0
749+
maximum: 1.0
750+
default: 1.0
751+
order: 1
752+
top_p:
753+
type: number
754+
description: Top-p parameter for nucleus sampling (0.0-1.0)
755+
minimum: 0.0
756+
maximum: 1.0
757+
default: 0.1
758+
order: 2
759+
max_tokens:
760+
type: number
761+
description: Maximum number of tokens to generate
762+
minimum: 1000
763+
maximum: 20000
764+
default: 10000
765+
order: 3
766+
system_prompt:
767+
type: string
768+
format: textarea
769+
description: System prompt for the discovery model
770+
default: "You are an expert in processing forms. Extracting data from images and documents. Analyze forms line by line to identify field names, data types, and organizational structure. Focus on creating comprehensive blueprints for document processing without extracting actual values."
771+
order: 4
772+
user_prompt:
773+
type: string
774+
format: textarea
775+
description: User prompt template for discovery without ground truth
776+
default: "This image contains forms data. Analyze the form line by line. Image may contains multiple pages, process all the pages. Form may contain multiple name value pair in one line. Extract all the names in the form including the name value pair which doesn't have value. Organize them into groups, extract field_name, data_type and field description. Field_name should be less than 60 characters, should not have space use '-' instead of space. field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form. Field_name should be unique within the group. Add two fields document_class and document_description. For document_class generate a short name based on the document content like W4, I-9, Paystub. For document_description generate a description about the document in less than 50 words. Group the fields based on the section they are grouped in the form. Group should have attributeType as \"group\". If the group repeats, add an additional field groupType and set the value as \"Table\". Do not extract the values. Return the extracted data in JSON format."
777+
order: 5
778+
with_ground_truth:
779+
order: 1
780+
type: object
781+
sectionLabel: Discovery With Ground Truth
782+
description: Configuration for discovering document classes using reference data
783+
properties:
784+
model_id:
785+
type: string
786+
description: Bedrock model ID for discovery with ground truth
787+
enum: ["us.amazon.nova-lite-v1:0", "us.amazon.nova-pro-v1:0", "us.amazon.nova-premier-v1:0", "us.anthropic.claude-3-haiku-20240307-v1:0", "us.anthropic.claude-3-5-sonnet-20241022-v2:0", "us.anthropic.claude-3-7-sonnet-20250219-v1:0"]
788+
default: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
789+
order: 0
790+
temperature:
791+
type: number
792+
description: Temperature parameter for model creativity (0.0-1.0)
793+
minimum: 0.0
794+
maximum: 1.0
795+
default: 1.0
796+
order: 1
797+
top_p:
798+
type: number
799+
description: Top-p parameter for nucleus sampling (0.0-1.0)
800+
minimum: 0.0
801+
maximum: 1.0
802+
default: 0.1
803+
order: 2
804+
max_tokens:
805+
type: number
806+
description: Maximum number of tokens to generate
807+
minimum: 1000
808+
maximum: 20000
809+
default: 10000
810+
order: 3
811+
system_prompt:
812+
type: string
813+
format: textarea
814+
description: System prompt for the discovery model with ground truth
815+
default: "You are an expert in processing forms. Extracting data from images and documents. Use provided ground truth data as reference to optimize field extraction and ensure consistency with expected document structure and field definitions."
816+
order: 4
817+
user_prompt:
818+
type: string
819+
format: textarea
820+
description: User prompt template for discovery with ground truth (use {ground_truth_json} placeholder)
821+
default: "This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference. <GROUND_TRUTH_REFERENCE>{ground_truth_json}</GROUND_TRUTH_REFERENCE> Ground truth reference JSON has the fields we are interested in extracting from the document/image. Use the ground truth to optimize field extraction. Match field names, data types, and groupings from the reference. Image may contain multiple pages, process all pages. Extract all field names including those without values. Do not change the group name and field name from ground truth in the extracted data json. Add field_description field for every field which will contain instruction to LLM to extract the field data from the image/document. Add data_type field for every field. Add two fields document_class and document_description. For document_class generate a short name based on the document content like W4, I-9, Paystub. For document_description generate a description about the document in less than 50 words. If the group repeats and follows table format, add a special field group_type with value \"Table\" and description field for the group. Do not extract the values."
822+
order: 5
823+
output_format:
824+
order: 2
825+
type: object
826+
sectionLabel: Output Format Configuration
827+
description: Configuration for discovery output format
828+
properties:
829+
sample_json:
830+
type: string
831+
format: textarea
832+
description: Sample JSON format for discovery output
833+
default: "{\n \"document_class\": \"Form-1040\",\n \"document_description\": \"Brief summary of the document\",\n \"groups\": [\n {\n \"name\": \"PersonalInformation\",\n \"description\": \"Personal information of Tax payer\",\n \"attributeType\": \"group\",\n \"groupType\": \"normal\",\n \"groupAttributes\": [\n {\n \"name\": \"FirstName\",\n \"dataType\": \"string\",\n \"description\": \"First Name of Taxpayer\"\n },\n {\n \"name\": \"Age\",\n \"dataType\": \"number\",\n \"description\": \"Age of Taxpayer\"\n }\n ]\n }\n ]\n}"
834+
order: 0
727835
pricing:
728836
order: 8
729837
type: array

0 commit comments

Comments
 (0)