aws-solutions-library-samples
diff --git a/‎lib/idp_common_pkg/idp_common/agents/analytics/agent.py‎
Lines changed: 16 additions & 12 deletions b/‎lib/idp_common_pkg/idp_common/agents/analytics/agent.py‎
Lines changed: 16 additions & 12 deletions
diff --git a/‎lib/idp_common_pkg/idp_common/agents/analytics/config.py‎
Lines changed: 34 additions & 132 deletions b/‎lib/idp_common_pkg/idp_common/agents/analytics/config.py‎
Lines changed: 34 additions & 132 deletions
@@ -50,22 +50,26 @@ def create_analytics_agent(
     # Task
     Your task is to:
     1. Understand the user's question
-    2. Use get_database_info tool to understand initial information about the database schema
-    3. Generate a valid Athena query that answers the question OR that will provide you information to write a second Athena query which answers the question (e.g. listing tables first, if not enough information was provided by the get_database_info tool)
-    4. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
-    5. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
-    6. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
-    7. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
-    8. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
+    2. Use get_database_info tool to get comprehensive database schema information (this now includes detailed table descriptions, column schemas, usage patterns, and sample queries)
+    3. Analyze the provided schema information to determine the appropriate tables and columns for your query - the schema info includes detailed guidance on which tables to use for different types of questions
+    4. Generate a valid Athena query based on the comprehensive schema information. Only use exploratory queries (SHOW TABLES, DESCRIBE) if the provided schema info is insufficient for your specific question
+    5. Before executing the Athena query, re-read it and make sure _all_ column names mentioned _anywhere inside of the query_ are enclosed in double quotes.
+    6. Execute your revised query using the run_athena_query tool. If you receive an error message, correct your Athena query and try again a maximum of 5 times, then STOP. Do not ever make up fake data. For exploratory queries you can return the athena results directly. For larger or final queries, the results should need to be returned because downstream tools will download them separately.
+    7. Use the write_query_results_to_code_sandbox to convert the athena response into a file called "query_results.csv" in the same environment future python scripts will be executed.
+    8. If the query is best answered with a plot or a table, write python code to analyze the query results to create a plot or table. If the final response to the user's question is answerable with a human readable string, return it as described in the result format description section below.
+    9. To execute your plot generation code, use the execute_python tool and directly return its output without doing any more analysis.
     
     DO NOT attempt to execute multiple tools in parallel. The input of some tools depend on the output of others. Only ever execute one tool at a time.
     
-    When generating Athena:
-    - ALWAYS put ALL column names in double quotes when including ANYHWERE inside of a query.
+    When generating Athena queries:
+    - ALWAYS put ALL column names in double quotes when including ANYWHERE inside of a query.
     - Use standard Athena syntax compatible with Amazon Athena, for example use standard date arithmetic that's compatible with Athena.
-    - Do not guess at table or column names. Execute exploratory queries first with the `return_full_query_results` flag set to True in the run_athena_query_with_config tool. Your final query should use `return_full_query_results` set to False. The query results still get saved where downstream processes can pick them up when `return_full_query_results` is False, which is the desired method.
-    - Use a "SHOW TABLES" query to list all dynamic tables available to you.
-    - Use a "DESCRIBE" query to see the precise names of columns and their associated data types, before writing any of your own queries.
+    - Leverage the comprehensive schema information provided by get_database_info first - it includes detailed table descriptions, column schemas, usage patterns, and critical aggregation rules
+    - Pay special attention to the metering table aggregation patterns (use MAX for page counts per document, not SUM since values are replicated)
+    - For questions about volume/costs/consumption, use the metering table as described in the schema
+    - For questions about accuracy, use the evaluation tables (but note they may be empty if no evaluation jobs were run)
+    - For questions about extracted content, use the appropriate document_sections_* tables
+    - Only use exploratory queries like "SHOW TABLES" or "DESCRIBE" if the comprehensive schema information doesn't provide enough detail for your specific question
     - Include appropriate table joins when needed
     - Use column names exactly as they appear in the schema, ALWAYS in double quotes within your query.
     - When querying strings, be aware that tables may contain ALL CAPS strings (or they may not). So, make your queries agnostic to case whenever possible.
 
@@ -50,147 +50,49 @@ def get_analytics_config() -> Dict[str, Any]:
 
 def load_db_description() -> str:
     """
-    Load the database description from the assets directory.
-    TODO: this is hard coded for now because the assets directory was hard to find in the lambda environment.
+    Load the database description using the comprehensive schema provider.
+
+    This function now generates detailed table descriptions including:
+    - Metering table with proper aggregation patterns
+    - Evaluation tables with comprehensive schemas
+    - Dynamic document sections tables based on configuration
 
     Returns:
-        String containing the database description
+        String containing the comprehensive database description
     """
-
-    return """
-# Athena Table Information
-
-## Overview 
-
-### Metering table
-
-Metering Table (metering)
-   * Captures detailed usage metrics for document processing operations
-   * Useful for monitoring IDP application usage (document processing throughput, costs, token usage, etc)
-   * Populated every time a document is processed (even if no evaluations are run)
-
-The metering table is particularly valuable for:
-- Cost analysis and allocation
-- Usage pattern identification
-- Resource optimization
-- Performance benchmarking across different document types and sizes
-
-### Dynamic Document Section Tables
-
-The solution also creates dynamic tables for document sections. The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically discovered by AWS Glue Crawler and are organized by section type (classification).
-
-    * Tables are automatically created based on the section classification
-    * Each section type gets its own table (e.g., document_sections_invoice, document_sections_receipt)
-    * Common columns include: section_id, document_id, section_classification, section_confidence, timestamp
-    * Additional columns are dynamically inferred from the JSON extraction results
-    * Tables are partitioned by date (YYYY-MM-DD format)
-    * When querying columns with a period in their name (e.g. inference_result.currentnetpay) the column name must be included in quotation marks to be compatible with Athena querying, for example `SELECT "inference_result.currentnetpay" FROM document_sections_payslip`.
-
-Many columns in dynamic document section tables are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:
-- Nested JSON objects are flattened using dot notation (e.g., `customer.name`, `customer.address.street`)
-- Arrays are converted to JSON strings
-- Primitive values (strings, numbers, booleans) are preserved as their native types
-
-Each section type table is also partitioned by date (YYYY-MM-DD format) for efficient querying.
-
-Since the columns and data types in the document section tables are dynamic, you may want to run a query to learn more about them before writing any other queries. Consider e.g. executing `SHOW TABLES` to see what dynamic tables exist, or `describe document_sections_payslip` to learn more about that table (if it exists).
-
-### Evaluation tables
-
-The evaluation tables store metrics and results from comparing extracted document data against baseline (ground truth) data. These tables provide insights into the accuracy and performance of the document processing system. These tables are usually empty unless the user has run separate evaluation jobs. These tables are useful if the user wants to know about the "accuracy" of their solution for example.
-
-1. Document Evaluations Table (document_evaluations)
-   * Only useful if users have run "evaluation" jobs, which are not run by default. Contains information like accuracy compared to ground truth datasets.
-   * Contains document-level evaluation metrics
-   * Columns include: document_id, input_key, evaluation_date, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, execution_time
-   * Partitioned by date (YYYY-MM-DD format)
-
-2. Section Evaluations Table (section_evaluations)
-   * Only useful if users have run "evaluation" jobs, which are not run by default. Contains information like accuracy compared to ground truth datasets.
-   * Contains section-level evaluation metrics
-   * Columns include: document_id, section_id, section_type, accuracy, precision, recall, f1_score, false_alarm_rate, false_discovery_rate, evaluation_date
-   * Partitioned by date (YYYY-MM-DD format)
-
-3. Attribute Evaluations Table (attribute_evaluations)
-   * Only useful if users have run "evaluation" jobs, which are not run by default. Contains information like accuracy compared to ground truth datasets.
-   * Contains attribute-level evaluation metrics
-   * Columns include: document_id, section_id, section_type, attribute_name, expected, actual, matched, score, reason, evaluation_method, confidence, confidence_threshold, evaluation_date
-   * Partitioned by date (YYYY-MM-DD format)
-
-## Additional notes
-
-* The "timestamp" and "date" columns pertain to when the document was uploaded to the system for processing, NOT any dates on the document itself.
-
-## Sample Athena Queries
-
-Here are some example queries to get you started:
-
-**List tables available, including dynamic ones**
-```athena
+    try:
+        # Import here to avoid circular imports
+        from .schema_provider import generate_comprehensive_database_description
+
+        logger.info("Loading comprehensive database description from schema provider")
+        description = generate_comprehensive_database_description()
+        logger.debug(f"Generated database description of length: {len(description)}")
+        return description
+
+    except Exception as e:
+        logger.error(f"Error loading comprehensive database description: {e}")
+        # Fallback to basic description if schema provider fails
+        return """
+# Database Schema Information
+
+## Note
+Advanced schema information is temporarily unavailable. Use the following basic queries to explore:
+
+```sql
+-- List all tables
 SHOW TABLES
-```
-
-**List tables available starting with "document_sections" with wildcard notation**
-```athena
-SHOW TABLES LIKE 'document_sections*'
-```
 
-**List tables available starting with "document_sections" with an SQL LIKE pattern**
-```athena
-SHOW TABLES WHERE table_name LIKE 'document_sections%'  -- Uses SQL LIKE pattern
-```
-
-(note that `SHOW TABLES LIKE 'document_sections%'` does NOT WORK here because in Athena, the pattern is treated as a regular expression rather than the SQL LIKE pattern matching. In regular expressions, '%' doesn't serve as a wildcard - it denotes "zero or more occurrences of the preceding element.")
+-- Describe table structure  
+DESCRIBE table_name
 
+-- Explore metering data
+SELECT * FROM metering LIMIT 10
 
-**View the dynamic columns of a table named "document_sections_payslip"**
-```athena
-DESCRIBE document_sections_payslip
-```
-
-**Token usage by model:**
-```athena
-SELECT 
-  "service_api", 
-  SUM(CASE WHEN "unit" = "inputTokens" THEN value ELSE 0 END) as total_input_tokens,
-  SUM(CASE WHEN "unit" = "outputTokens" THEN value ELSE 0 END) as total_output_tokens,
-  SUM(CASE WHEN "unit" = "totalTokens" THEN value ELSE 0 END) as total_tokens,
-  COUNT(DISTINCT "document_id") as document_count
-FROM 
-  metering
-GROUP BY 
-  "service_api"
-ORDER BY 
-  total_tokens DESC;
-```
-(note all columns are included within double quotes)
-
-**Total net pay added across all paystub type documents**
-```athena
-SELECT SUM(CAST(REPLACE(REPLACE("inference_result.currentnetpay", '$', ''), ',', '') AS DECIMAL(10,2))) as total_net_pay
-FROM document_sections_payslip;
+-- List document sections tables
+SHOW TABLES LIKE 'document_sections*'
 ```
-(note the double quotation marks around the column name)
 
-**All payslip information for an employee named David Calico***
-```athena
-SELECT * FROM document_sections_payslip WHERE LOWER("inference_result.employeename.firstname") = 'david' AND LOWER("inference_result.employeename.lastname") = 'calico'
-```
-(note the use of LOWER because case of strings in the database is unknown, and note the double quotation marks around the column name)
-
-**Overall accuracy by document type:**
-```athena
-SELECT 
-  "section_type", 
-  AVG("accuracy") as avg_accuracy, 
-  COUNT(*) as document_count
-FROM 
-  section_evaluations
-GROUP BY 
-  "section_type"
-ORDER BY 
-  "avg_accuracy" DESC;
-```
+**Important**: Always enclose column names in double quotes in Athena queries.
 """