Skip to content

Commit 6cc765b

Browse files
author
Bob Strahan
committed
Merge branch 'develop' into feature/modify-class-rerun-extraction
2 parents 4d3af79 + dffe70a commit 6cc765b

File tree

37 files changed

+2192
-300
lines changed

37 files changed

+2192
-300
lines changed

.gitlab-ci.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ image: public.ecr.aws/docker/library/python:3.13-bookworm
1616

1717
stages:
1818
- developer_tests
19+
- deployment_validation
1920
- integration_tests
2021

2122
developer_tests:
@@ -93,4 +94,23 @@ integration_tests:
9394
- poetry install
9495
- make put
9596
- make wait
97+
98+
deployment_validation:
99+
stage: deployment_validation
100+
rules:
101+
- when: always
102+
103+
before_script:
104+
- apt-get update -y
105+
- apt-get install curl unzip python3-pip -y
106+
# Install AWS CLI
107+
- curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
108+
- unzip awscliv2.zip
109+
- ./aws/install
110+
# Install PyYAML for template analysis
111+
- pip install PyYAML
112+
113+
script:
114+
# Check if service role has sufficient permissions for main stack deployment
115+
- python3 scripts/validate_service_role_permissions.py
96116

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,16 @@ SPDX-License-Identifier: MIT-0
66
## [Unreleased]
77

88
### Added
9+
- **Analytics Agent Schema Optimization for Improved Performance**
10+
- **Embedded Database Overview**: Complete table listing and guidance embedded directly in system prompt (no tool call needed)
11+
- **On-Demand Detailed Schemas**: `get_table_info(['specific_tables'])` loads detailed column information only for tables actually needed by the query
12+
- **Significant Performance Gains**: Eliminates redundant tool calls on every query while maintaining token efficiency
13+
- **Enhanced SQL Guidance**: Comprehensive Athena/Trino function reference with explicit PostgreSQL operator warnings to prevent common query failures like `~` regex operator mistakes
14+
- **Faster Time-to-Query**: Agent has immediate access to table overview and can proceed directly to detailed schema loading for relevant tables
15+
16+
### Fixed
17+
- Fix missing data in Glue tables when using a document class that contains a dash (-).
18+
919

1020
### Fixed
1121
- **Edit Sections Mode Performance and Architecture Optimizations**

docs/deployment.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ You need to have the following packages installed on your computer:
3636
4. python 3.11 or later
3737
5. A local Docker daemon
3838
6. Python packages for publish.py: `pip install boto3 rich typer PyYAML botocore setuptools`
39+
7. **Node.js 18+** and **npm** (required for UI validation in publish script)
3940

4041
For guidance on setting up a development environment, see:
4142
- [Development Environment Setup Guide on Linux](./setup-development-env-linux.md)
@@ -136,12 +137,12 @@ aws cloudformation update-stack \
136137

137138

138139
**Pattern Parameter Options:**
139-
* `Pattern1` - Packet or Media processing with Bedrock Data Automation (BDA)
140+
* `Pattern1 - Packet or Media processing with Bedrock Data Automation (BDA)`
140141
* Can use an existing BDA project or create a new demo project
141-
* `Pattern2` - Packet processing with Textract and Bedrock
142+
* `Pattern2 - Packet processing with Textract and Bedrock`
142143
* Supports both page-level and holistic classification
143144
* Recommended for first-time users
144-
* `Pattern3` - Packet processing with Textract, SageMaker(UDOP), and Bedrock
145+
* `Pattern3 - Packet processing with Textract, SageMaker(UDOP), and Bedrock`
145146
* Requires a UDOP model in S3 that will be deployed on SageMaker
146147

147148
After deployment, check the Outputs tab in the CloudFormation console to find links to dashboards, buckets, workflows, and other solution resources.

docs/pattern-3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ This pattern implements an intelligent document processing workflow that uses UD
3030

3131
## Fine tuning a UDOP model for classification
3232

33-
See [Fine-Tuning Models on SageMaker](./fine-tune-sm-udop-classification/README.md)
33+
See [Fine-Tuning Models on SageMaker](../patterns/pattern-3/fine-tune-sm-udop-classification/README.md)
3434

3535
Once you have trained the model, deploy the GenAIIDP stack for Pattern-3 using the path for your new fine tuned model.
3636

iam-roles/cloudformation-management/IDP-Cloudformation-Service-Role.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Resources:
1313
CloudFormationServiceRole:
1414
Type: AWS::IAM::Role
1515
Properties:
16-
RoleName: IDPAcceleratorCloudFormationServiceRole
16+
RoleName: !Sub '${AWS::StackName}-CFServiceRole'
1717
AssumeRolePolicyDocument:
1818
Version: '2012-10-17'
1919
Statement:
@@ -109,7 +109,7 @@ Resources:
109109
PassRolePolicy:
110110
Type: AWS::IAM::ManagedPolicy
111111
Properties:
112-
ManagedPolicyName: IDP-PassRolePolicy
112+
ManagedPolicyName: !Sub '${AWS::StackName}-PassRolePolicy'
113113
Description: Policy to allow passing the IDP CloudFormation service role
114114
PolicyDocument:
115115
Version: '2012-10-17'

lib/idp_common_pkg/idp_common/agents/analytics/agent.py

Lines changed: 187 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,16 @@
1111

1212
import boto3
1313
import strands
14-
from strands.models import BedrockModel
1514

1615
from ..common.config import load_result_format_description
16+
from ..common.strands_bedrock_model import create_strands_bedrock_model
1717
from .config import load_python_plot_generation_examples
18-
from .tools import CodeInterpreterTools, get_database_info, run_athena_query
18+
from .schema_provider import get_database_overview as _get_database_overview
19+
from .tools import (
20+
CodeInterpreterTools,
21+
get_table_info,
22+
run_athena_query,
23+
)
1924
from .utils import register_code_interpreter_tools
2025

2126
logger = logging.getLogger(__name__)
@@ -43,36 +48,192 @@ def create_analytics_agent(
4348
# Load python code examples
4449
python_plot_generation_examples = load_python_plot_generation_examples()
4550

51+
# Load database overview once during agent creation for embedding in system prompt
52+
database_overview = _get_database_overview()
53+
4654
# Define the system prompt for the analytics agent
4755
system_prompt = f"""
4856
You are an AI agent that converts natural language questions into Athena queries, executes those queries, and writes python code to convert the query results into json representing either a plot, a table, or a string.
4957
5058
# Task
5159
Your task is to:
5260
1. Understand the user's question
53-
2. Use get_database_info tool to understand initial information about the database schema
54-
3. Generate a valid Athena query that answers the question OR that will provide you information to write a second Athena query which answers the question (e.g. listing tables first, if not enough information was provided by the get_database_info tool)
55-
4. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
56-
5. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
57-
6. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
58-
7. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
59-
8. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
61+
2. **EFFICIENT APPROACH**: Review the database overview below to see available tables and their purposes
62+
3. Apply the Question-to-Table mapping rules below to select the correct tables for your query
63+
4. Use get_table_info(['table1', 'table2']) to get detailed schemas ONLY for the tables you need
64+
5. Generate a valid Athena query based on the targeted schema information
65+
6. **VALIDATE YOUR SQL**: Before executing, check for these common mistakes:
66+
- All column names enclosed in double quotes: `"column_name"`
67+
- No PostgreSQL operators: Replace `~` with `REGEXP_LIKE()`
68+
- No invalid functions: Replace `CONTAINS()` with `LIKE`, `ILIKE` with `LOWER() + LIKE`
69+
- Only valid Trino functions used
70+
- Proper date formatting and casting
71+
7. Execute your validated query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
72+
8. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
73+
9. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
74+
10. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
75+
76+
# Database Overview - Available Tables
77+
{database_overview}
78+
79+
# CRITICAL: Optimized Database Information Approach
80+
**For optimal performance and accuracy:**
81+
82+
## Step 1: Review Database Overview (Above)
83+
- The complete database overview is provided above in this prompt
84+
- This gives you table names, purposes, and question-to-table mapping guidance
85+
- No tool call needed - information is immediately available
86+
87+
## Step 2: Get Detailed Schemas (On-Demand Only)
88+
- Use `get_table_info(['table1', 'table2'])` for specific tables you need
89+
- Only request detailed info for tables relevant to your query
90+
- Get complete column listings, sample queries, and aggregation rules
91+
92+
# CRITICAL: Question-to-Table Mapping Rules
93+
**ALWAYS follow these rules to select the correct table:**
94+
95+
## For Classification/Document Type Questions:
96+
- "How many X documents?" → Use `document_sections_x` table
97+
- "Documents classified as Y" → Use `document_sections_y` table
98+
- "What document types processed?" → Query document_sections_* tables
99+
- **NEVER use metering table for classification info - it only has usage/cost data**
100+
101+
Examples:
102+
```sql
103+
-- ✅ CORRECT: Count W2 documents
104+
SELECT COUNT(DISTINCT "document_id") FROM document_sections_w2 WHERE "date" = CAST(CURRENT_DATE AS VARCHAR)
105+
106+
-- ❌ WRONG: Don't use metering for classification
107+
SELECT COUNT(*) FROM metering WHERE "service_api" LIKE '%w2%'
108+
```
109+
110+
## For Volume/Cost/Consumption Questions:
111+
- "How much did processing cost?" → Use `metering` table
112+
- "Token usage by model" → Use `metering` table
113+
- "Pages processed" → Use `metering` table (with proper MAX aggregation)
114+
115+
## For Accuracy Questions:
116+
- "Document accuracy" → Use `evaluation` tables (may be empty)
117+
- "Precision/recall metrics" → Use `evaluation` tables
118+
119+
## For Content/Extraction Questions:
120+
- "What was extracted from documents?" → Use appropriate `document_sections_*` table
121+
- "Show invoice amounts" → Use `document_sections_invoice` table
60122
61123
DO NOT attempt to execute multiple tools in parallel. The input of some tools depend on the output of others. Only ever execute one tool at a time.
62124
63-
When generating Athena:
64-
- ALWAYS put ALL column names in double quotes when including ANYHWERE inside of a query.
65-
- Use standard Athena syntax compatible with Amazon Athena, for example use standard date arithmetic that's compatible with Athena.
66-
- Do not guess at table or column names. Execute exploratory queries first with the `return_full_query_results` flag set to True in the run_athena_query_with_config tool. Your final query should use `return_full_query_results` set to False. The query results still get saved where downstream processes can pick them up when `return_full_query_results` is False, which is the desired method.
67-
- Use a "SHOW TABLES" query to list all dynamic tables available to you.
68-
- Use a "DESCRIBE" query to see the precise names of columns and their associated data types, before writing any of your own queries.
69-
- Include appropriate table joins when needed
70-
- Use column names exactly as they appear in the schema, ALWAYS in double quotes within your query.
71-
- When querying strings, be aware that tables may contain ALL CAPS strings (or they may not). So, make your queries agnostic to case whenever possible.
72-
- If you cannot get your query to work successfully, stop. DO NOT EVER generate fake or synthetic data. Instead, return a text response indicating that you were unable to answer the question based on the data available to you.
73-
- The Athena query does not have to answer the question directly, it just needs to return the data required to answer the question. Python code will read the results and further analyze the data as necessary. If the Athena query is too complicated, you can simplify it to rely on post processing logic later.
74-
- If your query returns 0 rows, it may be that the query needs to be changed and tried again. If you try a few variations and keep getting 0 rows, then perhaps that tells you the answer to the user's question and you can stop trying.
75-
- If you get an error related to the column not existing or not having permissions to access the column, this is likely fixed by putting the column name in double quotes within your Athena query.
125+
# CRITICAL: Athena SQL Function Reference (Trino-based)
126+
**Athena engine version 3 uses Trino functions. DO NOT use PostgreSQL-style operators or invalid functions.**
127+
128+
## CRITICAL: Regular Expression Operators
129+
**Athena does NOT support PostgreSQL-style regex operators:**
130+
- ❌ NEVER use `~`, `~*`, `!~`, or `!~*` operators (these will cause query failures)
131+
- ✅ ALWAYS use `REGEXP_LIKE(column, 'pattern')` for regex matching
132+
- ✅ Use `NOT REGEXP_LIKE(column, 'pattern')` for negative matching
133+
134+
### Common Regex Examples:
135+
```sql
136+
-- ❌ WRONG: PostgreSQL-style (will fail with operator error)
137+
WHERE "inference_result.wages" ~ '^[0-9.]+$'
138+
WHERE "service_api" ~* 'classification'
139+
WHERE "document_type" !~ 'invalid'
140+
141+
-- ✅ CORRECT: Athena/Trino style
142+
WHERE REGEXP_LIKE("inference_result.wages", '^[0-9.]+$')
143+
WHERE REGEXP_LIKE(LOWER("service_api"), 'classification')
144+
WHERE NOT REGEXP_LIKE("document_type", 'invalid')
145+
```
146+
147+
## Valid String Functions (Trino-based):
148+
- `LIKE '%pattern%'` - Pattern matching (NOT CONTAINS function)
149+
- `REGEXP_LIKE(string, pattern)` - Regular expression matching (NOT ~ operator)
150+
- `LOWER()`, `UPPER()` - Case conversion
151+
- `POSITION(substring IN string)` - Find substring position (NOT STRPOS)
152+
- `SUBSTRING(string, start, length)` - String extraction
153+
- `CONCAT(string1, string2)` - String concatenation
154+
- `LENGTH(string)` - String length
155+
- `TRIM(string)` - Remove whitespace
156+
157+
## ❌ COMMON MISTAKES - Functions/Operators that DON'T exist in Athena:
158+
- `CONTAINS(string, substring)` → Use `string LIKE '%substring%'`
159+
- `ILIKE` operator → Use `LOWER(column) LIKE LOWER('pattern')`
160+
- `STRPOS(string, substring)` → Use `POSITION(substring IN string)`
161+
- `~` regex operator → Use `REGEXP_LIKE(column, 'pattern')`
162+
163+
## Valid Date/Time Functions:
164+
- `CURRENT_DATE` - Current date
165+
- `DATE_ADD(unit, value, date)` - Date arithmetic (e.g., `DATE_ADD('day', 1, CURRENT_DATE)`)
166+
- `CAST(expression AS type)` - Type conversion
167+
- `FORMAT_DATETIME(timestamp, format)` - Date formatting
168+
169+
## Critical Query Patterns:
170+
```sql
171+
-- ✅ CORRECT: String matching
172+
WHERE LOWER("service_api") LIKE '%classification%'
173+
174+
-- ❌ WRONG: Invalid function
175+
WHERE CONTAINS("service_api", 'classification')
176+
177+
-- ✅ CORRECT: Numeric validation with regex
178+
WHERE REGEXP_LIKE("inference_result.amount", '^[0-9]+\.?[0-9]*$')
179+
180+
-- ❌ WRONG: PostgreSQL regex operator
181+
WHERE "inference_result.amount" ~ '^[0-9.]+$'
182+
183+
-- ✅ CORRECT: Case-insensitive pattern matching
184+
WHERE LOWER("document_type") LIKE LOWER('%invoice%')
185+
186+
-- ❌ WRONG: ILIKE operator
187+
WHERE "document_type" ILIKE '%invoice%'
188+
189+
-- ✅ CORRECT: Today's data
190+
WHERE "date" = CAST(CURRENT_DATE AS VARCHAR)
191+
192+
-- ✅ CORRECT: Date range
193+
WHERE "date" >= '2024-01-01' AND "date" <= '2024-12-31'
194+
```
195+
196+
**TRUST THIS INFORMATION - Do not run discovery queries like SHOW TABLES or DESCRIBE unless genuinely needed.**
197+
198+
When generating Athena queries:
199+
- **ALWAYS put ALL column names in double quotes** - this includes dot-notation columns like `"document_class.type"`
200+
- **Use only valid Trino functions** listed above - Athena engine v3 is Trino-based
201+
- **Leverage comprehensive schema first** - it contains complete table/column information
202+
- **Follow aggregation patterns**: MAX for page counts per document (not SUM), SUM for costs
203+
- **Use case-insensitive matching**: `WHERE LOWER("column") LIKE LOWER('%pattern%')`
204+
- **Handle dot-notation carefully**: `"document_class.type"` is a SINGLE column name with dots
205+
- **Prefer simple queries**: Complex logic can be handled in Python post-processing
206+
207+
## Error Recovery Patterns:
208+
- **`~ operator not found`** → Replace with `REGEXP_LIKE(column, 'pattern')`
209+
- **`ILIKE operator not found`** → Use `LOWER(column) LIKE LOWER('pattern')`
210+
- **`Function CONTAINS not found`** → Use `column LIKE '%substring%'`
211+
- **`Function STRPOS not found`** → Use `POSITION(substring IN column)`
212+
- **Column not found** → Check double quotes: `"column_name"`
213+
- **Function not found** → Use valid Trino functions only
214+
- **0 rows returned** → Check table names, date filters, and case sensitivity
215+
- **Case sensitivity** → Use `LOWER()` for string comparisons
216+
217+
## Standard Query Templates:
218+
```sql
219+
-- Document classification count
220+
SELECT COUNT(DISTINCT "document_id")
221+
FROM document_sections_{type}
222+
WHERE "date" = CAST(CURRENT_DATE AS VARCHAR)
223+
224+
-- Cost analysis
225+
SELECT "context", SUM("estimated_cost") as total_cost
226+
FROM metering
227+
WHERE "date" >= '2024-01-01'
228+
GROUP BY "context"
229+
230+
-- Joined analysis
231+
SELECT ds."document_class.type", AVG(CAST(m."estimated_cost" AS DOUBLE)) as avg_cost
232+
FROM document_sections_w2 ds
233+
JOIN metering m ON ds."document_id" = m."document_id"
234+
WHERE ds."date" = CAST(CURRENT_DATE AS VARCHAR)
235+
GROUP BY ds."document_class.type"
236+
```
76237
77238
When writing python:
78239
- Only write python code to generate plots or tables. Do not use python for any other purpose.
@@ -132,13 +293,15 @@ def run_athena_query_with_config(
132293
run_athena_query_with_config,
133294
code_interpreter_tools.write_query_results_to_code_sandbox,
134295
code_interpreter_tools.execute_python,
135-
get_database_info,
296+
get_table_info, # Detailed schema for specific tables
136297
]
137298

138299
# Get model ID from environment variable
139300
model_id = os.environ.get("DOCUMENT_ANALYSIS_AGENT_MODEL_ID")
140301

141-
bedrock_model = BedrockModel(model_id=model_id, boto_session=session)
302+
bedrock_model = create_strands_bedrock_model(
303+
model_id=model_id, boto_session=session
304+
)
142305

143306
# Create the Strands agent with tools and system prompt
144307
strands_agent = strands.Agent(

0 commit comments

Comments
 (0)