Skip to content

Commit 9091822

Browse files
author
Bob Strahan
committed
Refactor analytics agent to use efficient two-step database info approach
1 parent 8e5812c commit 9091822

File tree

7 files changed

+292
-248
lines changed

7 files changed

+292
-248
lines changed

lib/idp_common_pkg/idp_common/agents/analytics/agent.py

Lines changed: 80 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,12 @@
1515

1616
from ..common.config import load_result_format_description
1717
from .config import load_python_plot_generation_examples
18-
from .tools import CodeInterpreterTools, get_database_info, run_athena_query
18+
from .tools import (
19+
CodeInterpreterTools,
20+
get_database_overview,
21+
get_table_info,
22+
run_athena_query,
23+
)
1924
from .utils import register_code_interpreter_tools
2025

2126
logger = logging.getLogger(__name__)
@@ -50,16 +55,34 @@ def create_analytics_agent(
5055
# Task
5156
Your task is to:
5257
1. Understand the user's question
53-
2. Use get_database_info tool to get comprehensive database schema information (this now includes detailed table descriptions, column schemas, usage patterns, and sample queries)
54-
3. **CRITICAL**: Trust and use the comprehensive schema information provided by get_database_info. It contains complete table listings and schemas. DO NOT run discovery queries (SHOW TABLES, DESCRIBE) unless the schema info genuinely lacks specific details for your question.
55-
4. Apply the Question-to-Table mapping rules below to select the correct tables
56-
5. Generate a valid Athena query based on the comprehensive schema information
57-
6. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
58-
7. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
58+
2. **EFFICIENT APPROACH**: Use get_database_overview() to get a fast overview of available tables and their purposes
59+
3. Apply the Question-to-Table mapping rules below to select the correct tables for your query
60+
4. Use get_table_info(['table1', 'table2']) to get detailed schemas ONLY for the tables you need
61+
5. Generate a valid Athena query based on the targeted schema information
62+
6. **VALIDATE YOUR SQL**: Before executing, check for these common mistakes:
63+
- All column names enclosed in double quotes: `"column_name"`
64+
- No PostgreSQL operators: Replace `~` with `REGEXP_LIKE()`
65+
- No invalid functions: Replace `CONTAINS()` with `LIKE`, `ILIKE` with `LOWER() + LIKE`
66+
- Only valid Trino functions used
67+
- Proper date formatting and casting
68+
7. Execute your validated query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
5969
8. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
6070
9. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
6171
10. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
6272
73+
# CRITICAL: Two-Step Database Information Approach
74+
**For optimal performance and accuracy:**
75+
76+
## Step 1: Overview (Fast)
77+
- Always start with `get_database_overview()` to see available tables
78+
- This gives you table names, purposes, and question-to-table mapping guidance
79+
- **~500 tokens vs 3000+ tokens** - much faster for simple questions
80+
81+
## Step 2: Detailed Schemas (On-Demand)
82+
- Use `get_table_info(['table1', 'table2'])` for specific tables you need
83+
- Only request detailed info for tables relevant to your query
84+
- Get complete column listings, sample queries, and aggregation rules
85+
6386
# CRITICAL: Question-to-Table Mapping Rules
6487
**ALWAYS follow these rules to select the correct table:**
6588
@@ -94,15 +117,42 @@ def create_analytics_agent(
94117
DO NOT attempt to execute multiple tools in parallel. The input of some tools depend on the output of others. Only ever execute one tool at a time.
95118
96119
# CRITICAL: Athena SQL Function Reference (Trino-based)
97-
**Athena engine version 3 uses Trino functions. DO NOT use invalid functions like CONTAINS(varchar, varchar).**
120+
**Athena engine version 3 uses Trino functions. DO NOT use PostgreSQL-style operators or invalid functions.**
121+
122+
## CRITICAL: Regular Expression Operators
123+
**Athena does NOT support PostgreSQL-style regex operators:**
124+
- ❌ NEVER use `~`, `~*`, `!~`, or `!~*` operators (these will cause query failures)
125+
- ✅ ALWAYS use `REGEXP_LIKE(column, 'pattern')` for regex matching
126+
- ✅ Use `NOT REGEXP_LIKE(column, 'pattern')` for negative matching
127+
128+
### Common Regex Examples:
129+
```sql
130+
-- ❌ WRONG: PostgreSQL-style (will fail with operator error)
131+
WHERE "inference_result.wages" ~ '^[0-9.]+$'
132+
WHERE "service_api" ~* 'classification'
133+
WHERE "document_type" !~ 'invalid'
134+
135+
-- ✅ CORRECT: Athena/Trino style
136+
WHERE REGEXP_LIKE("inference_result.wages", '^[0-9.]+$')
137+
WHERE REGEXP_LIKE(LOWER("service_api"), 'classification')
138+
WHERE NOT REGEXP_LIKE("document_type", 'invalid')
139+
```
98140
99-
## Valid String Functions:
100-
- `LIKE '%pattern%'` - Pattern matching (NOT CONTAINS)
101-
- `REGEXP_LIKE(string, pattern)` - Regular expression matching
141+
## Valid String Functions (Trino-based):
142+
- `LIKE '%pattern%'` - Pattern matching (NOT CONTAINS function)
143+
- `REGEXP_LIKE(string, pattern)` - Regular expression matching (NOT ~ operator)
102144
- `LOWER()`, `UPPER()` - Case conversion
145+
- `POSITION(substring IN string)` - Find substring position (NOT STRPOS)
103146
- `SUBSTRING(string, start, length)` - String extraction
104147
- `CONCAT(string1, string2)` - String concatenation
105148
- `LENGTH(string)` - String length
149+
- `TRIM(string)` - Remove whitespace
150+
151+
## ❌ COMMON MISTAKES - Functions/Operators that DON'T exist in Athena:
152+
- `CONTAINS(string, substring)` → Use `string LIKE '%substring%'`
153+
- `ILIKE` operator → Use `LOWER(column) LIKE LOWER('pattern')`
154+
- `STRPOS(string, substring)` → Use `POSITION(substring IN string)`
155+
- `~` regex operator → Use `REGEXP_LIKE(column, 'pattern')`
106156
107157
## Valid Date/Time Functions:
108158
- `CURRENT_DATE` - Current date
@@ -118,21 +168,25 @@ def create_analytics_agent(
118168
-- ❌ WRONG: Invalid function
119169
WHERE CONTAINS("service_api", 'classification')
120170
171+
-- ✅ CORRECT: Numeric validation with regex
172+
WHERE REGEXP_LIKE("inference_result.amount", '^[0-9]+\.?[0-9]*$')
173+
174+
-- ❌ WRONG: PostgreSQL regex operator
175+
WHERE "inference_result.amount" ~ '^[0-9.]+$'
176+
177+
-- ✅ CORRECT: Case-insensitive pattern matching
178+
WHERE LOWER("document_type") LIKE LOWER('%invoice%')
179+
180+
-- ❌ WRONG: ILIKE operator
181+
WHERE "document_type" ILIKE '%invoice%'
182+
121183
-- ✅ CORRECT: Today's data
122184
WHERE "date" = CAST(CURRENT_DATE AS VARCHAR)
123185
124186
-- ✅ CORRECT: Date range
125187
WHERE "date" >= '2024-01-01' AND "date" <= '2024-12-31'
126188
```
127-
128-
# Complete Schema Information Usage:
129-
**The get_database_info tool provides COMPLETE information including:**
130-
- ✅ All table names with exact spelling
131-
- ✅ All column names with exact syntax
132-
- ✅ Sample queries for common patterns
133-
- ✅ Critical aggregation rules (MAX vs SUM)
134-
- ✅ Dot-notation column explanations
135-
189+
136190
**TRUST THIS INFORMATION - Do not run discovery queries like SHOW TABLES or DESCRIBE unless genuinely needed.**
137191
138192
When generating Athena queries:
@@ -145,6 +199,10 @@ def create_analytics_agent(
145199
- **Prefer simple queries**: Complex logic can be handled in Python post-processing
146200
147201
## Error Recovery Patterns:
202+
- **`~ operator not found`** → Replace with `REGEXP_LIKE(column, 'pattern')`
203+
- **`ILIKE operator not found`** → Use `LOWER(column) LIKE LOWER('pattern')`
204+
- **`Function CONTAINS not found`** → Use `column LIKE '%substring%'`
205+
- **`Function STRPOS not found`** → Use `POSITION(substring IN column)`
148206
- **Column not found** → Check double quotes: `"column_name"`
149207
- **Function not found** → Use valid Trino functions only
150208
- **0 rows returned** → Check table names, date filters, and case sensitivity
@@ -229,7 +287,8 @@ def run_athena_query_with_config(
229287
run_athena_query_with_config,
230288
code_interpreter_tools.write_query_results_to_code_sandbox,
231289
code_interpreter_tools.execute_python,
232-
get_database_info,
290+
get_database_overview, # Fast, lightweight table overview
291+
get_table_info, # Detailed schema for specific tables
233292
]
234293

235294
# Get model ID from environment variable

lib/idp_common_pkg/idp_common/agents/analytics/config.py

Lines changed: 0 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -48,54 +48,6 @@ def get_analytics_config() -> Dict[str, Any]:
4848
return config
4949

5050

51-
def load_db_description() -> str:
52-
"""
53-
Load the database description using the comprehensive schema provider.
54-
55-
This function now generates detailed table descriptions including:
56-
- Metering table with proper aggregation patterns
57-
- Evaluation tables with comprehensive schemas
58-
- Dynamic document sections tables based on configuration
59-
60-
Returns:
61-
String containing the comprehensive database description
62-
"""
63-
try:
64-
# Import here to avoid circular imports
65-
from .schema_provider import generate_comprehensive_database_description
66-
67-
logger.info("Loading comprehensive database description from schema provider")
68-
description = generate_comprehensive_database_description()
69-
logger.debug(f"Generated database description of length: {len(description)}")
70-
return description
71-
72-
except Exception as e:
73-
logger.error(f"Error loading comprehensive database description: {e}")
74-
# Fallback to basic description if schema provider fails
75-
return """
76-
# Database Schema Information
77-
78-
## Note
79-
Advanced schema information is temporarily unavailable. Use the following basic queries to explore:
80-
81-
```sql
82-
-- List all tables
83-
SHOW TABLES
84-
85-
-- Describe table structure
86-
DESCRIBE table_name
87-
88-
-- Explore metering data
89-
SELECT * FROM metering LIMIT 10
90-
91-
-- List document sections tables
92-
SHOW TABLES LIKE 'document_sections*'
93-
```
94-
95-
**Important**: Always enclose column names in double quotes in Athena queries.
96-
"""
97-
98-
9951
def load_python_plot_generation_examples() -> str:
10052
"""
10153
Load sample python plot generation examples.

0 commit comments

Comments
 (0)