docs: Update evaluation framework to use configuration-based control

Bob Strahan · Bob Strahan · commit 6fb00c8381dc · 2025-10-26T16:26:56.000Z
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -155,7 +155,6 @@ Key parameters that can be configured during CloudFormation deployment:
 
 ### Optional Features
 - `EvaluationBaselineBucketName`: Optional existing bucket for ground truth data
-- `EvaluationAutoEnabled`: Enable automatic accuracy evaluation (default: true)
 - `DocumentKnowledgeBase`: Enable document knowledge base functionality
 - `KnowledgeBaseModelId`: Bedrock model for knowledge base queries
 - `PostProcessingLambdaHookFunctionArn`: Optional Lambda ARN for custom post-processing (see [post-processing-lambda-hook.md](post-processing-lambda-hook.md) for detailed implementation guidance)
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -16,9 +16,11 @@ The GenAIIDP solution includes a built-in evaluation framework to assess the acc
    - Use an existing bucket or let the solution create one
    - Can use outputs from another GenAIIDP stack to compare different patterns/prompts
 
-2. **Automatic Evaluation**
-   - When enabled, automatically evaluates each processed document
-   - Compares against baseline data if available
+2. **Integrated Evaluation Step**
+   - Evaluation runs as the final step in the Step Functions workflow (after summarization)
+   - Executes **before** the workflow marks documents as COMPLETE, eliminating race conditions
+   - When `evaluation.enabled: true` in configuration, evaluates against baseline data if available
+   - When `evaluation.enabled: false` in configuration, step executes but skips processing
    - Generates detailed markdown reports using AI analysis
 
 3. **Evaluation Reports**
@@ -79,21 +81,38 @@ The confidence integration is fully backward compatible:
 
 ## Configuration
 
-Set the following parameters during stack deployment:
+### Stack Deployment Parameters
+
+Set the following parameter during stack deployment:
 
 ```yaml
 EvaluationBaselineBucketName:
   Description: Existing bucket with baseline data, or leave empty to create new bucket
-  
-EvaluationAutoEnabled:
-  Default: true
-  Description: Automatically evaluate each document (if baseline exists)
-  
-EvaluationModelId:
-  Default: "anthropic.claude-3-sonnet-20240229-v1:0"
-  Description: Model to use for evaluation reports (e.g., "us.anthropic.claude-3-7-sonnet-20250219-v1:0")
 ```
 
+### Runtime Configuration
+
+Control evaluation behavior through the configuration file (no stack redeployment needed):
+
+```yaml
+evaluation:
+  enabled: true  # Set to false to disable evaluation processing
+  llm_method:
+    model: "us.anthropic.claude-3-haiku-20240307-v1:0"  # Model for evaluation reports
+    temperature: "0.0"
+    top_p: "0.1"
+    max_tokens: "4096"
+    # Additional model parameters...
+```
+
+**Benefits of Configuration-Based Control:**
+- Enable/disable evaluation without stack redeployment
+- Runtime control similar to summarization and assessment features
+- Zero LLM costs when disabled (step executes but skips processing)
+- Consistent feature control pattern across the solution
+
+### Attribute-Specific Evaluation Methods
+
 You can also configure evaluation methods for specific document classes and attributes through the solution's configuration. The framework supports three types of attributes with different evaluation approaches:
 
 ### Simple Attributes
diff --git a/docs/idp-cli.md b/docs/idp-cli.md
@@ -709,20 +709,10 @@ idp-cli run-inference \
 
 Download the evaluation results to analyze accuracy:
 
-**⏱️ Important Timing Note:** Evaluation processing runs as a separate step after the main document processing completes. This takes an additional 2-3 minutes per document. If you download results immediately after the batch shows "Complete", the evaluation data may not be ready yet.
-
-**Best practice:**
-1. Wait 5-10 minutes after batch completion before downloading evaluation results
-2. Check that the downloaded files include the `evaluation/` directory
-3. If evaluation data is missing, wait a few more minutes and download again
+**✓ Synchronous Evaluation:** Evaluation runs as the final step in the workflow before completion. When a document shows status "COMPLETE", all processing including evaluation is finished - results are immediately available for download.
 
 ```bash
-# Wait for evaluation to complete (check status)
-idp-cli status \
-    --stack-name eval-testing \
-    --batch-id eval-run-001
-
-# Download evaluation results
+# Download evaluation results (no waiting needed)
 idp-cli download-results \
     --stack-name eval-testing \
     --batch-id eval-run-001 \
diff --git a/docs/idp-configuration-best-practices.md b/docs/idp-configuration-best-practices.md
@@ -1362,16 +1362,10 @@ Set the following parameters during stack deployment:
 ```yaml
 EvaluationBaselineBucketName:
   Description: Existing bucket with baseline data, or leave empty to create new bucket
-  
-EvaluationAutoEnabled:
-  Default: true
-  Description: Automatically evaluate each document (if baseline exists)
-  
-EvaluationModelId:
-  Default: "anthropic.claude-3-sonnet-20240229-v1:0"
-  Description: Model to use for evaluation reports
 ```
 
+**Note:** Evaluation is now controlled via configuration file (`evaluation.enabled: true/false`) rather than stack parameters. See the [evaluation.md](./evaluation.md) documentation for details.
+
 ### Evaluation Methods Configuration
 
 Configure evaluation methods for specific document classes and attributes: