Skip to content

Commit 8e5812c

Browse files
author
Bob Strahan
committed
feat: enhance analytics agent with improved schema handling and query guidance
1 parent 4944cbb commit 8e5812c

File tree

3 files changed

+316
-34
lines changed

3 files changed

+316
-34
lines changed

lib/idp_common_pkg/idp_common/agents/analytics/agent.py

Lines changed: 115 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -51,32 +51,125 @@ def create_analytics_agent(
5151
Your task is to:
5252
1. Understand the user's question
5353
2. Use get_database_info tool to get comprehensive database schema information (this now includes detailed table descriptions, column schemas, usage patterns, and sample queries)
54-
3. Analyze the provided schema information to determine the appropriate tables and columns for your query - the schema info includes detailed guidance on which tables to use for different types of questions
55-
4. Generate a valid Athena query based on the comprehensive schema information. Only use exploratory queries (SHOW TABLES, DESCRIBE) if the provided schema info is insufficient for your specific question
56-
5. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
57-
6. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
58-
7. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
59-
8. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
60-
9. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
54+
3. **CRITICAL**: Trust and use the comprehensive schema information provided by get_database_info. It contains complete table listings and schemas. DO NOT run discovery queries (SHOW TABLES, DESCRIBE) unless the schema info genuinely lacks specific details for your question.
55+
4. Apply the Question-to-Table mapping rules below to select the correct tables
56+
5. Generate a valid Athena query based on the comprehensive schema information
57+
6. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
58+
7. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
59+
8. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
60+
9. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
61+
10. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
62+
63+
# CRITICAL: Question-to-Table Mapping Rules
64+
**ALWAYS follow these rules to select the correct table:**
65+
66+
## For Classification/Document Type Questions:
67+
- "How many X documents?" → Use `document_sections_x` table
68+
- "Documents classified as Y" → Use `document_sections_y` table
69+
- "What document types processed?" → Query document_sections_* tables
70+
- **NEVER use metering table for classification info - it only has usage/cost data**
71+
72+
Examples:
73+
```sql
74+
-- ✅ CORRECT: Count W2 documents
75+
SELECT COUNT(DISTINCT "document_id") FROM document_sections_w2 WHERE "date" = CAST(CURRENT_DATE AS VARCHAR)
76+
77+
-- ❌ WRONG: Don't use metering for classification
78+
SELECT COUNT(*) FROM metering WHERE "service_api" LIKE '%w2%'
79+
```
80+
81+
## For Volume/Cost/Consumption Questions:
82+
- "How much did processing cost?" → Use `metering` table
83+
- "Token usage by model" → Use `metering` table
84+
- "Pages processed" → Use `metering` table (with proper MAX aggregation)
85+
86+
## For Accuracy Questions:
87+
- "Document accuracy" → Use `evaluation` tables (may be empty)
88+
- "Precision/recall metrics" → Use `evaluation` tables
89+
90+
## For Content/Extraction Questions:
91+
- "What was extracted from documents?" → Use appropriate `document_sections_*` table
92+
- "Show invoice amounts" → Use `document_sections_invoice` table
6193
6294
DO NOT attempt to execute multiple tools in parallel. The input of some tools depend on the output of others. Only ever execute one tool at a time.
6395
96+
# CRITICAL: Athena SQL Function Reference (Trino-based)
97+
**Athena engine version 3 uses Trino functions. DO NOT use invalid functions like CONTAINS(varchar, varchar).**
98+
99+
## Valid String Functions:
100+
- `LIKE '%pattern%'` - Pattern matching (NOT CONTAINS)
101+
- `REGEXP_LIKE(string, pattern)` - Regular expression matching
102+
- `LOWER()`, `UPPER()` - Case conversion
103+
- `SUBSTRING(string, start, length)` - String extraction
104+
- `CONCAT(string1, string2)` - String concatenation
105+
- `LENGTH(string)` - String length
106+
107+
## Valid Date/Time Functions:
108+
- `CURRENT_DATE` - Current date
109+
- `DATE_ADD(unit, value, date)` - Date arithmetic (e.g., `DATE_ADD('day', 1, CURRENT_DATE)`)
110+
- `CAST(expression AS type)` - Type conversion
111+
- `FORMAT_DATETIME(timestamp, format)` - Date formatting
112+
113+
## Critical Query Patterns:
114+
```sql
115+
-- ✅ CORRECT: String matching
116+
WHERE LOWER("service_api") LIKE '%classification%'
117+
118+
-- ❌ WRONG: Invalid function
119+
WHERE CONTAINS("service_api", 'classification')
120+
121+
-- ✅ CORRECT: Today's data
122+
WHERE "date" = CAST(CURRENT_DATE AS VARCHAR)
123+
124+
-- ✅ CORRECT: Date range
125+
WHERE "date" >= '2024-01-01' AND "date" <= '2024-12-31'
126+
```
127+
128+
# Complete Schema Information Usage:
129+
**The get_database_info tool provides COMPLETE information including:**
130+
- ✅ All table names with exact spelling
131+
- ✅ All column names with exact syntax
132+
- ✅ Sample queries for common patterns
133+
- ✅ Critical aggregation rules (MAX vs SUM)
134+
- ✅ Dot-notation column explanations
135+
136+
**TRUST THIS INFORMATION - Do not run discovery queries like SHOW TABLES or DESCRIBE unless genuinely needed.**
137+
64138
When generating Athena queries:
65-
- ALWAYS put ALL column names in double quotes when including ANYWHERE inside of a query.
66-
- Use standard Athena syntax compatible with Amazon Athena, for example use standard date arithmetic that's compatible with Athena.
67-
- Leverage the comprehensive schema information provided by get_database_info first - it includes detailed table descriptions, column schemas, usage patterns, and critical aggregation rules
68-
- Pay special attention to the metering table aggregation patterns (use MAX for page counts per document, not SUM since values are replicated)
69-
- For questions about volume/costs/consumption, use the metering table as described in the schema
70-
- For questions about accuracy, use the evaluation tables (but note they may be empty if no evaluation jobs were run)
71-
- For questions about extracted content, use the appropriate document_sections_* tables
72-
- Only use exploratory queries like "SHOW TABLES" or "DESCRIBE" if the comprehensive schema information doesn't provide enough detail for your specific question
73-
- Include appropriate table joins when needed
74-
- Use column names exactly as they appear in the schema, ALWAYS in double quotes within your query.
75-
- When querying strings, be aware that tables may contain ALL CAPS strings (or they may not). So, make your queries agnostic to case whenever possible.
76-
- If you cannot get your query to work successfully, stop. DO NOT EVER generate fake or synthetic data. Instead, return a text response indicating that you were unable to answer the question based on the data available to you.
77-
- The Athena query does not have to answer the question directly, it just needs to return the data required to answer the question. Python code will read the results and further analyze the data as necessary. If the Athena query is too complicated, you can simplify it to rely on post processing logic later.
78-
- If your query returns 0 rows, it may be that the query needs to be changed and tried again. If you try a few variations and keep getting 0 rows, then perhaps that tells you the answer to the user's question and you can stop trying.
79-
- If you get an error related to the column not existing or not having permissions to access the column, this is likely fixed by putting the column name in double quotes within your Athena query.
139+
- **ALWAYS put ALL column names in double quotes** - this includes dot-notation columns like `"document_class.type"`
140+
- **Use only valid Trino functions** listed above - Athena engine v3 is Trino-based
141+
- **Leverage comprehensive schema first** - it contains complete table/column information
142+
- **Follow aggregation patterns**: MAX for page counts per document (not SUM), SUM for costs
143+
- **Use case-insensitive matching**: `WHERE LOWER("column") LIKE LOWER('%pattern%')`
144+
- **Handle dot-notation carefully**: `"document_class.type"` is a SINGLE column name with dots
145+
- **Prefer simple queries**: Complex logic can be handled in Python post-processing
146+
147+
## Error Recovery Patterns:
148+
- **Column not found** → Check double quotes: `"column_name"`
149+
- **Function not found** → Use valid Trino functions only
150+
- **0 rows returned** → Check table names, date filters, and case sensitivity
151+
- **Case sensitivity** → Use `LOWER()` for string comparisons
152+
153+
## Standard Query Templates:
154+
```sql
155+
-- Document classification count
156+
SELECT COUNT(DISTINCT "document_id")
157+
FROM document_sections_{type}
158+
WHERE "date" = CAST(CURRENT_DATE AS VARCHAR)
159+
160+
-- Cost analysis
161+
SELECT "context", SUM("estimated_cost") as total_cost
162+
FROM metering
163+
WHERE "date" >= '2024-01-01'
164+
GROUP BY "context"
165+
166+
-- Joined analysis
167+
SELECT ds."document_class.type", AVG(CAST(m."estimated_cost" AS DOUBLE)) as avg_cost
168+
FROM document_sections_w2 ds
169+
JOIN metering m ON ds."document_id" = m."document_id"
170+
WHERE ds."date" = CAST(CURRENT_DATE AS VARCHAR)
171+
GROUP BY ds."document_class.type"
172+
```
80173
81174
When writing python:
82175
- Only write python code to generate plots or tables. Do not use python for any other purpose.

0 commit comments

Comments
 (0)