You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lib/idp_common_pkg/idp_common/agents/analytics/agent.py
+16-12Lines changed: 16 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -50,22 +50,26 @@ def create_analytics_agent(
50
50
# Task
51
51
Your task is to:
52
52
1. Understand the user's question
53
-
2. Use get_database_info tool to understand initial information about the database schema
54
-
3. Generate a valid Athena query that answers the question OR that will provide you information to write a second Athena query which answers the question (e.g. listing tables first, if not enough information was provided by the get_database_info tool)
55
-
4. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
56
-
5. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
57
-
6. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
58
-
7. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
59
-
8. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
53
+
2. Use get_database_info tool to get comprehensive database schema information (this now includes detailed table descriptions, column schemas, usage patterns, and sample queries)
54
+
3. Analyze the provided schema information to determine the appropriate tables and columns for your query - the schema info includes detailed guidance on which tables to use for different types of questions
55
+
4. Generate a valid Athena query based on the comprehensive schema information. Only use exploratory queries (SHOW TABLES, DESCRIBE) if the provided schema info is insufficient for your specific question
56
+
5. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
57
+
6. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
58
+
7. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
59
+
8. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
60
+
9. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
60
61
61
62
DO NOT attempt to execute multiple tools in parallel. The input of some tools depend on the output of others. Only ever execute one tool at a time.
62
63
63
-
When generating Athena:
64
-
- ALWAYS put ALL column names in double quotes when including ANYHWERE inside of a query.
64
+
When generating Athena queries:
65
+
- ALWAYS put ALL column names in double quotes when including ANYWHERE inside of a query.
65
66
- Use standard Athena syntax compatible with Amazon Athena, for example use standard date arithmetic that's compatible with Athena.
66
-
- Do not guess at table or column names. Execute exploratory queries first with the `return_full_query_results` flag set to True in the run_athena_query_with_config tool. Your final query should use `return_full_query_results` set to False. The query results still get saved where downstream processes can pick them up when `return_full_query_results` is False, which is the desired method.
67
-
- Use a "SHOW TABLES" query to list all dynamic tables available to you.
68
-
- Use a "DESCRIBE" query to see the precise names of columns and their associated data types, before writing any of your own queries.
67
+
- Leverage the comprehensive schema information provided by get_database_info first - it includes detailed table descriptions, column schemas, usage patterns, and critical aggregation rules
68
+
- Pay special attention to the metering table aggregation patterns (use MAX for page counts per document, not SUM since values are replicated)
69
+
- For questions about volume/costs/consumption, use the metering table as described in the schema
70
+
- For questions about accuracy, use the evaluation tables (but note they may be empty if no evaluation jobs were run)
71
+
- For questions about extracted content, use the appropriate document_sections_* tables
72
+
- Only use exploratory queries like "SHOW TABLES" or "DESCRIBE" if the comprehensive schema information doesn't provide enough detail for your specific question
69
73
- Include appropriate table joins when needed
70
74
- Use column names exactly as they appear in the schema, ALWAYS in double quotes within your query.
71
75
- When querying strings, be aware that tables may contain ALL CAPS strings (or they may not). So, make your queries agnostic to case whenever possible.
* Populated every time a document is processed (even if no evaluations are run)
71
-
72
-
The metering table is particularly valuable for:
73
-
- Cost analysis and allocation
74
-
- Usage pattern identification
75
-
- Resource optimization
76
-
- Performance benchmarking across different document types and sizes
77
-
78
-
### Dynamic Document Section Tables
79
-
80
-
The solution also creates dynamic tables for document sections. The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically discovered by AWS Glue Crawler and are organized by section type (classification).
81
-
82
-
* Tables are automatically created based on the section classification
83
-
* Each section type gets its own table (e.g., document_sections_invoice, document_sections_receipt)
84
-
* Common columns include: section_id, document_id, section_classification, section_confidence, timestamp
85
-
* Additional columns are dynamically inferred from the JSON extraction results
86
-
* Tables are partitioned by date (YYYY-MM-DD format)
87
-
* When querying columns with a period in their name (e.g. inference_result.currentnetpay) the column name must be included in quotation marks to be compatible with Athena querying, for example `SELECT "inference_result.currentnetpay" FROM document_sections_payslip`.
88
-
89
-
Many columns in dynamic document section tables are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:
90
-
- Nested JSON objects are flattened using dot notation (e.g., `customer.name`, `customer.address.street`)
91
-
- Arrays are converted to JSON strings
92
-
- Primitive values (strings, numbers, booleans) are preserved as their native types
93
-
94
-
Each section type table is also partitioned by date (YYYY-MM-DD format) for efficient querying.
95
-
96
-
Since the columns and data types in the document section tables are dynamic, you may want to run a query to learn more about them before writing any other queries. Consider e.g. executing `SHOW TABLES` to see what dynamic tables exist, or `describe document_sections_payslip` to learn more about that table (if it exists).
97
-
98
-
### Evaluation tables
99
-
100
-
The evaluation tables store metrics and results from comparing extracted document data against baseline (ground truth) data. These tables provide insights into the accuracy and performance of the document processing system. These tables are usually empty unless the user has run separate evaluation jobs. These tables are useful if the user wants to know about the "accuracy" of their solution for example.
# Fallback to basic description if schema provider fails
75
+
return"""
76
+
# Database Schema Information
77
+
78
+
## Note
79
+
Advanced schema information is temporarily unavailable. Use the following basic queries to explore:
80
+
81
+
```sql
82
+
-- List all tables
130
83
SHOW TABLES
131
-
```
132
-
133
-
**List tables available starting with "document_sections" with wildcard notation**
134
-
```athena
135
-
SHOW TABLES LIKE 'document_sections*'
136
-
```
137
84
138
-
**List tables available starting with "document_sections" with an SQL LIKE pattern**
139
-
```athena
140
-
SHOW TABLES WHERE table_name LIKE 'document_sections%' -- Uses SQL LIKE pattern
141
-
```
142
-
143
-
(note that `SHOW TABLES LIKE 'document_sections%'` does NOT WORK here because in Athena, the pattern is treated as a regular expression rather than the SQL LIKE pattern matching. In regular expressions, '%' doesn't serve as a wildcard - it denotes "zero or more occurrences of the preceding element.")
85
+
-- Describe table structure
86
+
DESCRIBE table_name
144
87
88
+
-- Explore metering data
89
+
SELECT * FROM metering LIMIT 10
145
90
146
-
**View the dynamic columns of a table named "document_sections_payslip"**
147
-
```athena
148
-
DESCRIBE document_sections_payslip
149
-
```
150
-
151
-
**Token usage by model:**
152
-
```athena
153
-
SELECT
154
-
"service_api",
155
-
SUM(CASE WHEN "unit" = "inputTokens" THEN value ELSE 0 END) as total_input_tokens,
156
-
SUM(CASE WHEN "unit" = "outputTokens" THEN value ELSE 0 END) as total_output_tokens,
157
-
SUM(CASE WHEN "unit" = "totalTokens" THEN value ELSE 0 END) as total_tokens,
158
-
COUNT(DISTINCT "document_id") as document_count
159
-
FROM
160
-
metering
161
-
GROUP BY
162
-
"service_api"
163
-
ORDER BY
164
-
total_tokens DESC;
165
-
```
166
-
(note all columns are included within double quotes)
167
-
168
-
**Total net pay added across all paystub type documents**
169
-
```athena
170
-
SELECT SUM(CAST(REPLACE(REPLACE("inference_result.currentnetpay", '$', ''), ',', '') AS DECIMAL(10,2))) as total_net_pay
171
-
FROM document_sections_payslip;
91
+
-- List document sections tables
92
+
SHOW TABLES LIKE 'document_sections*'
172
93
```
173
-
(note the double quotation marks around the column name)
174
94
175
-
**All payslip information for an employee named David Calico***
176
-
```athena
177
-
SELECT * FROM document_sections_payslip WHERE LOWER("inference_result.employeename.firstname") = 'david' AND LOWER("inference_result.employeename.lastname") = 'calico'
178
-
```
179
-
(note the use of LOWER because case of strings in the database is unknown, and note the double quotation marks around the column name)
180
-
181
-
**Overall accuracy by document type:**
182
-
```athena
183
-
SELECT
184
-
"section_type",
185
-
AVG("accuracy") as avg_accuracy,
186
-
COUNT(*) as document_count
187
-
FROM
188
-
section_evaluations
189
-
GROUP BY
190
-
"section_type"
191
-
ORDER BY
192
-
"avg_accuracy" DESC;
193
-
```
95
+
**Important**: Always enclose column names in double quotes in Athena queries.
0 commit comments