Skip to content

Commit 2e45893

Browse files
author
Bob Strahan
committed
Add analytics agent schema provider and enhance UI message display
1 parent 23c10ad commit 2e45893

File tree

5 files changed

+964
-145
lines changed

5 files changed

+964
-145
lines changed

lib/idp_common_pkg/idp_common/agents/analytics/agent.py

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -50,22 +50,26 @@ def create_analytics_agent(
5050
# Task
5151
Your task is to:
5252
1. Understand the user's question
53-
2. Use get_database_info tool to understand initial information about the database schema
54-
3. Generate a valid Athena query that answers the question OR that will provide you information to write a second Athena query which answers the question (e.g. listing tables first, if not enough information was provided by the get_database_info tool)
55-
4. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
56-
5. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
57-
6. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
58-
7. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
59-
8. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
53+
2. Use get_database_info tool to get comprehensive database schema information (this now includes detailed table descriptions, column schemas, usage patterns, and sample queries)
54+
3. Analyze the provided schema information to determine the appropriate tables and columns for your query - the schema info includes detailed guidance on which tables to use for different types of questions
55+
4. Generate a valid Athena query based on the comprehensive schema information. Only use exploratory queries (SHOW TABLES, DESCRIBE) if the provided schema info is insufficient for your specific question
56+
5. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
57+
6. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
58+
7. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
59+
8. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
60+
9. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
6061
6162
DO NOT attempt to execute multiple tools in parallel. The input of some tools depend on the output of others. Only ever execute one tool at a time.
6263
63-
When generating Athena:
64-
- ALWAYS put ALL column names in double quotes when including ANYHWERE inside of a query.
64+
When generating Athena queries:
65+
- ALWAYS put ALL column names in double quotes when including ANYWHERE inside of a query.
6566
- Use standard Athena syntax compatible with Amazon Athena, for example use standard date arithmetic that's compatible with Athena.
66-
- Do not guess at table or column names. Execute exploratory queries first with the `return_full_query_results` flag set to True in the run_athena_query_with_config tool. Your final query should use `return_full_query_results` set to False. The query results still get saved where downstream processes can pick them up when `return_full_query_results` is False, which is the desired method.
67-
- Use a "SHOW TABLES" query to list all dynamic tables available to you.
68-
- Use a "DESCRIBE" query to see the precise names of columns and their associated data types, before writing any of your own queries.
67+
- Leverage the comprehensive schema information provided by get_database_info first - it includes detailed table descriptions, column schemas, usage patterns, and critical aggregation rules
68+
- Pay special attention to the metering table aggregation patterns (use MAX for page counts per document, not SUM since values are replicated)
69+
- For questions about volume/costs/consumption, use the metering table as described in the schema
70+
- For questions about accuracy, use the evaluation tables (but note they may be empty if no evaluation jobs were run)
71+
- For questions about extracted content, use the appropriate document_sections_* tables
72+
- Only use exploratory queries like "SHOW TABLES" or "DESCRIBE" if the comprehensive schema information doesn't provide enough detail for your specific question
6973
- Include appropriate table joins when needed
7074
- Use column names exactly as they appear in the schema, ALWAYS in double quotes within your query.
7175
- When querying strings, be aware that tables may contain ALL CAPS strings (or they may not). So, make your queries agnostic to case whenever possible.

lib/idp_common_pkg/idp_common/agents/analytics/config.py

Lines changed: 34 additions & 132 deletions
Original file line numberDiff line numberDiff line change
@@ -50,147 +50,49 @@ def get_analytics_config() -> Dict[str, Any]:
5050

5151
def load_db_description() -> str:
5252
"""
53-
Load the database description from the assets directory.
54-
TODO: this is hard coded for now because the assets directory was hard to find in the lambda environment.
53+
Load the database description using the comprehensive schema provider.
54+
55+
This function now generates detailed table descriptions including:
56+
- Metering table with proper aggregation patterns
57+
- Evaluation tables with comprehensive schemas
58+
- Dynamic document sections tables based on configuration
5559
5660
Returns:
57-
String containing the database description
61+
String containing the comprehensive database description
5862
"""
59-
60-
return """
61-
# Athena Table Information
62-
63-
## Overview
64-
65-
### Metering table
66-
67-
Metering Table (metering)
68-
* Captures detailed usage metrics for document processing operations
69-
* Useful for monitoring IDP application usage (document processing throughput, costs, token usage, etc)
70-
* Populated every time a document is processed (even if no evaluations are run)
71-
72-
The metering table is particularly valuable for:
73-
- Cost analysis and allocation
74-
- Usage pattern identification
75-
- Resource optimization
76-
- Performance benchmarking across different document types and sizes
77-
78-
### Dynamic Document Section Tables
79-
80-
The solution also creates dynamic tables for document sections. The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically discovered by AWS Glue Crawler and are organized by section type (classification).
81-
82-
* Tables are automatically created based on the section classification
83-
* Each section type gets its own table (e.g., document_sections_invoice, document_sections_receipt)
84-
* Common columns include: section_id, document_id, section_classification, section_confidence, timestamp
85-
* Additional columns are dynamically inferred from the JSON extraction results
86-
* Tables are partitioned by date (YYYY-MM-DD format)
87-
* When querying columns with a period in their name (e.g. inference_result.currentnetpay) the column name must be included in quotation marks to be compatible with Athena querying, for example `SELECT "inference_result.currentnetpay" FROM document_sections_payslip`.
88-
89-
Many columns in dynamic document section tables are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:
90-
- Nested JSON objects are flattened using dot notation (e.g., `customer.name`, `customer.address.street`)
91-
- Arrays are converted to JSON strings
92-
- Primitive values (strings, numbers, booleans) are preserved as their native types
93-
94-
Each section type table is also partitioned by date (YYYY-MM-DD format) for efficient querying.
95-
96-
Since the columns and data types in the document section tables are dynamic, you may want to run a query to learn more about them before writing any other queries. Consider e.g. executing `SHOW TABLES` to see what dynamic tables exist, or `describe document_sections_payslip` to learn more about that table (if it exists).
97-
98-
### Evaluation tables
99-
100-
The evaluation tables store metrics and results from comparing extracted document data against baseline (ground truth) data. These tables provide insights into the accuracy and performance of the document processing system. These tables are usually empty unless the user has run separate evaluation jobs. These tables are useful if the user wants to know about the "accuracy" of their solution for example.
101-
102-
1. Document Evaluations Table (document_evaluations)
103-
* Only useful if users have run "evaluation" jobs, which are not run by default. Contains information like accuracy compared to ground truth datasets.
104-
* Contains document-level evaluation metrics
105-
* Columns include: document_id, input_key, evaluation_date, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, execution_time
106-
* Partitioned by date (YYYY-MM-DD format)
107-
108-
2. Section Evaluations Table (section_evaluations)
109-
* Only useful if users have run "evaluation" jobs, which are not run by default. Contains information like accuracy compared to ground truth datasets.
110-
* Contains section-level evaluation metrics
111-
* Columns include: document_id, section_id, section_type, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, evaluation_date
112-
* Partitioned by date (YYYY-MM-DD format)
113-
114-
3. Attribute Evaluations Table (attribute_evaluations)
115-
* Only useful if users have run "evaluation" jobs, which are not run by default. Contains information like accuracy compared to ground truth datasets.
116-
* Contains attribute-level evaluation metrics
117-
* Columns include: document_id, section_id, section_type, attribute_name, expected, actual, matched, score, reason, evaluation_method, confidence, confidence_threshold, evaluation_date
118-
* Partitioned by date (YYYY-MM-DD format)
119-
120-
## Additional notes
121-
122-
* The "timestamp" and "date" columns pertain to when the document was uploaded to the system for processing, NOT any dates on the document itself.
123-
124-
## Sample Athena Queries
125-
126-
Here are some example queries to get you started:
127-
128-
**List tables available, including dynamic ones**
129-
```athena
63+
try:
64+
# Import here to avoid circular imports
65+
from .schema_provider import generate_comprehensive_database_description
66+
67+
logger.info("Loading comprehensive database description from schema provider")
68+
description = generate_comprehensive_database_description()
69+
logger.debug(f"Generated database description of length: {len(description)}")
70+
return description
71+
72+
except Exception as e:
73+
logger.error(f"Error loading comprehensive database description: {e}")
74+
# Fallback to basic description if schema provider fails
75+
return """
76+
# Database Schema Information
77+
78+
## Note
79+
Advanced schema information is temporarily unavailable. Use the following basic queries to explore:
80+
81+
```sql
82+
-- List all tables
13083
SHOW TABLES
131-
```
132-
133-
**List tables available starting with "document_sections" with wildcard notation**
134-
```athena
135-
SHOW TABLES LIKE 'document_sections*'
136-
```
13784
138-
**List tables available starting with "document_sections" with an SQL LIKE pattern**
139-
```athena
140-
SHOW TABLES WHERE table_name LIKE 'document_sections%' -- Uses SQL LIKE pattern
141-
```
142-
143-
(note that `SHOW TABLES LIKE 'document_sections%'` does NOT WORK here because in Athena, the pattern is treated as a regular expression rather than the SQL LIKE pattern matching. In regular expressions, '%' doesn't serve as a wildcard - it denotes "zero or more occurrences of the preceding element.")
85+
-- Describe table structure
86+
DESCRIBE table_name
14487
88+
-- Explore metering data
89+
SELECT * FROM metering LIMIT 10
14590
146-
**View the dynamic columns of a table named "document_sections_payslip"**
147-
```athena
148-
DESCRIBE document_sections_payslip
149-
```
150-
151-
**Token usage by model:**
152-
```athena
153-
SELECT
154-
"service_api",
155-
SUM(CASE WHEN "unit" = "inputTokens" THEN value ELSE 0 END) as total_input_tokens,
156-
SUM(CASE WHEN "unit" = "outputTokens" THEN value ELSE 0 END) as total_output_tokens,
157-
SUM(CASE WHEN "unit" = "totalTokens" THEN value ELSE 0 END) as total_tokens,
158-
COUNT(DISTINCT "document_id") as document_count
159-
FROM
160-
metering
161-
GROUP BY
162-
"service_api"
163-
ORDER BY
164-
total_tokens DESC;
165-
```
166-
(note all columns are included within double quotes)
167-
168-
**Total net pay added across all paystub type documents**
169-
```athena
170-
SELECT SUM(CAST(REPLACE(REPLACE("inference_result.currentnetpay", '$', ''), ',', '') AS DECIMAL(10,2))) as total_net_pay
171-
FROM document_sections_payslip;
91+
-- List document sections tables
92+
SHOW TABLES LIKE 'document_sections*'
17293
```
173-
(note the double quotation marks around the column name)
17494
175-
**All payslip information for an employee named David Calico***
176-
```athena
177-
SELECT * FROM document_sections_payslip WHERE LOWER("inference_result.employeename.firstname") = 'david' AND LOWER("inference_result.employeename.lastname") = 'calico'
178-
```
179-
(note the use of LOWER because case of strings in the database is unknown, and note the double quotation marks around the column name)
180-
181-
**Overall accuracy by document type:**
182-
```athena
183-
SELECT
184-
"section_type",
185-
AVG("accuracy") as avg_accuracy,
186-
COUNT(*) as document_count
187-
FROM
188-
section_evaluations
189-
GROUP BY
190-
"section_type"
191-
ORDER BY
192-
"avg_accuracy" DESC;
193-
```
95+
**Important**: Always enclose column names in double quotes in Athena queries.
19496
"""
19597

19698

0 commit comments

Comments
 (0)