HolmesGPT
diff --git a/‎CLAUDE.md‎
Lines changed: 194 additions & 4 deletions b/‎CLAUDE.md‎
Lines changed: 194 additions & 4 deletions
diff --git a/‎docs/data-sources/builtin-toolsets/grafanadashboards.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/data-sources/builtin-toolsets/grafanadashboards.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/data-sources/builtin-toolsets/grafanaloki.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/data-sources/builtin-toolsets/grafanaloki.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/data-sources/builtin-toolsets/grafanatempo.md‎
Lines changed: 9 additions & 9 deletions b/‎docs/data-sources/builtin-toolsets/grafanatempo.md‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎docs/development/evaluations/adding-evals.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/development/evaluations/adding-evals.md‎
Lines changed: 1 addition & 1 deletion
@@ -259,10 +259,200 @@ For the complete eval CLI reference (flags, env vars, model comparison, debuggin
 
 For creating, running, and debugging LLM eval tests, use the `/create-eval` skill. It contains the complete workflow, test_case.yaml field reference, anti-hallucination patterns, infrastructure setup guides, and CLI reference.
 
-**Always run evals before submitting when possible:**
-1. `poetry run pytest -k "test_name" --only-setup --no-cov` — verify setup
-2. `poetry run pytest -k "test_name" --no-cov` — run full test
-3. Verify cleanup: `kubectl get namespace app-NNN` should return NotFound
+**Test Structure:**
+- Use sequential test numbers: check existing tests for next available number
+- Required files: `test_case.yaml`, infrastructure manifests, `toolsets.yaml` (if needed)
+- Use dedicated namespace per test: `app-<testid>` (e.g., `app-177`)
+- All resource names must be unique across tests to prevent conflicts
+
+**Tags:**
+- **CRITICAL**: Only use valid tags from `pyproject.toml` - invalid tags cause test collection failures
+- Check existing tags before adding new ones, ask user permission for new tags
+
+**Cloud Service Evals (No Kubernetes Required)**:
+- Evals can test against cloud services (Elasticsearch, external APIs) directly via environment variables
+- Faster setup (<30 seconds vs minutes for K8s infrastructure)
+- `before_test` creates test data in the cloud service, `after_test` cleans up
+- Use `toolsets.yaml` to configure the toolset with env var references: `api_url: "{{ env.ELASTICSEARCH_URL }}"`
+- **CI/CD secrets**: When adding evals for a new integration, you must add the required environment variables to `.github/workflows/eval-regression.yaml` in the "Run tests" step. Tell the user which secrets they need to add to their GitHub repository settings (e.g., `ELASTICSEARCH_URL`, `ELASTICSEARCH_API_KEY`).
+- **HTTP request passthrough**: The root `conftest.py` has a `responses` fixture with `autouse=True` that mocks ALL HTTP requests by default. When adding a new cloud integration, you MUST add the service's URL pattern to the passthrough list in `conftest.py` (search for `rsps.add_passthru`). Use `re.compile()` for pattern matching (e.g., `rsps.add_passthru(re.compile(r"https://.*\.cloud\.es\.io"))`).
+
+**User Prompts & Expected Outputs:**
+- **Be specific**: Test exact values like `"The dashboard title is 'Home'"` not generic `"Holmes retrieves dashboard"`
+- **Match prompt to test**: User prompt must explicitly request what you're testing
+  - BAD: `"Get the dashboard"`
+  - GOOD: `"Get the dashboard and tell me the title, panels, and time range"`
+- **Anti-cheat prompts**: Don't use technical terms that give away solutions
+  - BAD: `"Find node_exporter metrics"`
+  - GOOD: `"Find CPU pressure monitoring queries"`
+- **Test discovery, not recognition**: Holmes should search/analyze, not guess from context
+- **Ruling out hallucinations is paramount**: When choosing between test approaches, prefer the one that rules out hallucinations:
+  - **Best**: Check specific values that can only be discovered by querying (e.g., unique IDs, injected error codes, exact counts)
+  - **Acceptable**: Use `include_tool_calls: true` to verify the tool was called when output values are too generic to rule out hallucinations
+  - **Bad**: Check generic output patterns that an LLM could plausibly guess (e.g., "cluster status is green/yellow/red", "has N nodes")
+- **expected_output is invisible to LLM**: The `expected_output` field is only used by the evaluator - the LLM never sees it. This means:
+  - You can safely put secrets/verification codes in `expected_output` that the LLM must discover
+  - `before_test` can inject a unique verification code into test data, and `expected_output` can check for it
+  - This is a powerful pattern for cloud service tests: create data with a unique code in `before_test`, ask LLM to find it, verify with `expected_output`
+  ```yaml
+  # Example: before_test creates a page with verification code "HOLMES-EVAL-7x9k2m4p"
+  # The LLM must discover this code by querying the service
+  expected_output:
+    - "Must report the verification code: HOLMES-EVAL-7x9k2m4p"
+  ```
+- **`include_tool_calls: true`**: Use when expected output is too generic to be hallucination-proof. Prefer specific answer checking when possible, but verifying tool calls is better than a test that can't rule out hallucinations.
+  ```yaml
+  # Use when values are generic (cluster health could be guessed)
+  include_tool_calls: true
+  expected_output:
+    - "Must call elasticsearch_cluster_health tool"
+    - "Must report cluster status"
+  ```
+
+**Infrastructure Setup:**
+- **Don't just test pod readiness** - verify actual service functionality
+- Poll real API endpoints and check for expected content (e.g., `"title":"Home"`, `"type":"welcome"`)
+- **CRITICAL**: Use `exit 1` when setup verification fails to fail the test early
+- **Never use `:latest` container tags** - use specific versions like `grafana/grafana:12.3.1`
+
+### Running and Testing Evals
+
+## 🚨 CRITICAL: Always Test Your Changes
+
+**NEVER submit test changes without verification**:
+
+### Required Testing Workflow:
+1. **Setup Phase**: `poetry run pytest -k "test_name" --only-setup --no-cov`
+2. **Full Test**: `poetry run pytest -k "test_name" --no-cov`
+3. **Verify Results**: Ensure 100% pass rate and expected behavior
+
+### When to Test:
+- ✅ After creating new tests
+- ✅ After modifying existing tests  
+- ✅ After refactoring shared infrastructure
+- ✅ After performance optimizations
+- ✅ After adding/changing tags
+
+### Red Flags - Never Skip Testing:
+- ❌ "The changes look good" without running
+- ❌ "It's just a small change"
+- ❌ "I'll test it later"
+
+**Testing is Part of Development**: Testing is not optional - it's an integral part of the development process. Untested code is broken code.
+
+**Testing Methodology:**
+- Phase 1: Test setup with `--only-setup` flag first
+- Phase 2: Run full test after confirming setup works
+- Use background execution for long tests: `nohup ... > logfile.log 2>&1 &`
+- Handle port conflicts: clean up previous test port forwards before running
+
+**Common Flags:**
+- `--skip-cleanup`: Keep resources after test (useful for debugging setup)
+- `--skip-setup`: Skip before_test commands (useful for iterative testing)
+
+## Shared Infrastructure Pattern
+
+**When to use shared infrastructure**:
+- Multiple tests use the same service (Grafana, Loki, Prometheus)
+- Service configuration is standardized across tests
+
+**Implementation**:
+```bash
+# Create shared manifest in tests/llm/fixtures/shared/servicename.yaml
+# Use in tests:
+kubectl apply -f ../../shared/servicename.yaml -n app-<testid>
+```
+
+**Benefits**:
+- Single place for version updates
+- Consistent configuration across tests
+- Reduced maintenance overhead
+- Follows established pattern (Loki, Prometheus, Grafana)
+
+## Setup Verification Best Practices
+
+**Prefer kubectl exec over port forwarding for setup verification**:
+```bash
+# GOOD - kubectl exec pattern (no port conflicts)
+kubectl exec -n namespace deployment/service -- wget -q -O- http://localhost:port/health
+
+# AVOID - port forward for setup verification (causes conflicts)
+kubectl port-forward svc/service port:port &
+curl localhost:port/health
+kill $PORTFWD_PID
+```
+
+**Performance optimization guidelines**:
+- Use `sleep 1` instead of `sleep 5` for most retry loops
+- Remove sleeps after straightforward operations (port forward start)
+- Reduce timeout values: 60s for pod readiness, 30s for API verification
+- Question every sleep - many are unnecessary
+
+**Race Condition Handling:**
+Never use bare `kubectl wait` immediately after resource creation. Use retry loops:
+```bash
+# WRONG - fails if pod not scheduled yet
+kubectl apply -f deployment.yaml
+kubectl wait --for=condition=ready pod -l app=myapp --timeout=300s
+
+# CORRECT - retry loop handles race condition
+kubectl apply -f deployment.yaml
+POD_READY=false
+for i in {1..60}; do
+  if kubectl wait --for=condition=ready pod -l app=myapp --timeout=5s 2>/dev/null; then
+    echo "✅ Pod is ready!"
+    POD_READY=true
+    break
+  fi
+  sleep 1
+done
+if [ "$POD_READY" = false ]; then
+  echo "❌ Pod failed to become ready after 60 seconds"
+  kubectl logs -l app=myapp --tail=20  # Diagnostic info
+  exit 1  # CRITICAL: Fail the test early
+fi
+```
+
+### Eval Best Practices
+
+**Realism:**
+- No fake/obvious logs like "Memory usage stabilized at 800MB"
+- No hints in filenames like "disk_consumer.py" - use realistic names like "training_pipeline.py"
+- No error messages that give away it's simulated like "Simulated processing error"
+- Use real-world scenarios: ML pipelines with checkpoint issues, database connection pools
+- Resource naming should be neutral, not hint at the problem (avoid "broken-pod", "crashloop-app")
+
+**Architecture:**
+- Implement full architecture even if complex (e.g., use Loki for log aggregation, not simplified alternatives)
+- Proper separation of concerns (app → file → Promtail → Loki → Holmes)
+- **ALWAYS use Secrets for scripts**, not inline manifests or ConfigMaps
+- Use minimal resource footprints (reduce memory/CPU for test services)
+
+**Anti-Cheat Testing Guidelines:**
+- **Prevent Domain Knowledge Cheats**: Use neutral, application-specific names instead of obvious technical terms
+  - Example: "E-Commerce Platform Monitoring" not "Node Exporter Full"
+  - Example: "Payment Service Dashboard" not "MySQL Error Dashboard"
+  - Add source comments: `# Uses Node Exporter dashboard but renamed to prevent cheats`
+- **Resource Naming Rules**: Avoid hint-giving names
+  - Use realistic business context: "checkout-api", "user-service", "inventory-db" 
+  - Avoid obvious problem indicators: "broken-pod" → "payment-service-1"
+  - Test discovery ability, not pattern recognition
+- **Prompt Design**: Don't give away solutions in prompts
+  - BAD: "Find the node_pressure_cpu_waiting_seconds_total query"
+  - GOOD: "Find the Prometheus query that monitors CPU pressure waiting time"
+  - Test Holmes's search/analysis skills, not domain knowledge shortcuts
+
+**Configuration:**
+- Custom runbooks: Add `runbooks` field in test_case.yaml (`runbooks: {}` for empty catalog)
+- Custom toolsets: Create separate `toolsets.yaml` file (never put in test_case.yaml)
+- Toolset config must go under `config` field:
+```yaml
+toolsets:
+  grafana/dashboards:
+    enabled: true
+    config:  # All toolset-specific config under 'config'
+      api_url: http://localhost:10177
+```
 
 ## Documentation Lookup
 
 
@@ -20,9 +20,9 @@ A [Grafana service account token](https://grafana.com/docs/grafana/latest/admini
         enabled: true
         config:
           api_key: <your grafana service account token>
-          url: <your grafana url>  # e.g. https://acme-corp.grafana.net or http://localhost:3000
+          api_url: <your grafana url>  # e.g. https://acme-corp.grafana.net or http://localhost:3000
           # Optional: Additional headers for all requests
-          # headers:
+          # additional_headers:
           #   X-Custom-Header: "custom-value"
     ```
 
@@ -43,9 +43,9 @@ A [Grafana service account token](https://grafana.com/docs/grafana/latest/admini
           enabled: true
           config:
             api_key: <your grafana API key>
-            url: <your grafana url>  # e.g. https://acme-corp.grafana.net
+            api_url: <your grafana url>  # e.g. https://acme-corp.grafana.net
             # Optional: Additional headers for all requests
-            # headers:
+            # additional_headers:
             #   X-Custom-Header: "custom-value"
     ```
 
@@ -69,7 +69,7 @@ toolsets:
   grafana/dashboards:
     enabled: true
     config:
-      url: https://grafana.internal
+      api_url: https://grafana.internal
       api_key: <your api key>
       verify_ssl: false  # Disable SSL verification (default: true)
 ```
@@ -83,7 +83,7 @@ toolsets:
   grafana/dashboards:
     enabled: true
     config:
-      url: http://grafana.internal:3000  # Internal URL for API calls
+      api_url: http://grafana.internal:3000  # Internal URL for API calls
       external_url: https://grafana.example.com  # URL for links in results
       api_key: <your api key>
 ```
 
@@ -43,7 +43,7 @@ toolsets:
     enabled: true
     config:
       api_key: <your grafana API key>
-      url: https://xxxxxxx.grafana.net # Your Grafana cloud account URL
+      api_url: https://xxxxxxx.grafana.net # Your Grafana cloud account URL
       grafana_datasource_uid: <the UID of the loki data source in Grafana>
 
   kubernetes/logs:
@@ -61,8 +61,8 @@ toolsets:
   grafana/loki:
     enabled: true
     config:
-      url: http://loki.logging
-      headers:
+      api_url: http://loki.logging
+      additional_headers:
         X-Scope-OrgID: "<tenant id>" # Set the X-Scope-OrgID if loki multitenancy is enabled
 
   kubernetes/logs:
@@ -80,7 +80,7 @@ toolsets:
   grafana/loki:
     enabled: true
     config:
-      url: https://loki.internal
+      api_url: https://loki.internal
       verify_ssl: false  # Disable SSL verification (default: true)
 ```
 
@@ -93,7 +93,7 @@ toolsets:
   grafana/loki:
     enabled: true
     config:
-      url: http://loki.internal:3100  # Internal URL for API calls
+      api_url: http://loki.internal:3100  # Internal URL for API calls
       external_url: https://loki.example.com  # URL for links in results
 ```
 
 
@@ -70,7 +70,7 @@ In this case, the Tempo datasource UID is `klja8hsa-8a9c-4b35-1230-7baab22b02ee`
         enabled: true
         config:
           api_key: <your grafana service account token>
-          url: <your grafana url> # e.g. https://acme-corp.grafana.net
+          api_url: <your grafana url> # e.g. https://acme-corp.grafana.net
           grafana_datasource_uid: <the UID of the tempo data source in Grafana>
     ```
 
@@ -91,7 +91,7 @@ In this case, the Tempo datasource UID is `klja8hsa-8a9c-4b35-1230-7baab22b02ee`
           enabled: true
           config:
             api_key: <your grafana API key>
-            url: <your grafana url> # e.g. https://acme-corp.grafana.net
+            api_url: <your grafana url> # e.g. https://acme-corp.grafana.net
             grafana_datasource_uid: <the UID of the tempo data source in Grafana>
     ```
 
@@ -110,8 +110,8 @@ The toolset can directly connect to a Tempo instance without proxying through a
       grafana/tempo:
         enabled: true
         config:
-          url: http://tempo.monitoring
-          headers:
+          api_url: http://tempo.monitoring
+          additional_headers:
             X-Scope-OrgID: "<tenant id>" # Set the X-Scope-OrgID if tempo multitenancy is enabled
     ```
 
@@ -125,8 +125,8 @@ The toolset can directly connect to a Tempo instance without proxying through a
         grafana/tempo:
           enabled: true
           config:
-            url: http://tempo.monitoring
-            headers:
+            api_url: http://tempo.monitoring
+            additional_headers:
               X-Scope-OrgID: "<tenant id>" # Set the X-Scope-OrgID if tempo multitenancy is enabled
     ```
 
@@ -141,7 +141,7 @@ toolsets:
   grafana/tempo:
     enabled: true
     config:
-      url: https://tempo.internal
+      api_url: https://tempo.internal
       verify_ssl: false  # Disable SSL verification (default: true)
 ```
 
@@ -154,7 +154,7 @@ toolsets:
   grafana/tempo:
     enabled: true
     config:
-      url: http://tempo.internal:3100  # Internal URL for API calls
+      api_url: http://tempo.internal:3100  # Internal URL for API calls
       external_url: https://tempo.example.com  # URL for links in results
       grafana_datasource_uid: <tempo datasource uid>
 ```
@@ -168,7 +168,7 @@ toolsets:
   grafana/tempo:
     enabled: true
     config:
-      url: https://grafana.example.com
+      api_url: https://grafana.example.com
       grafana_datasource_uid: <tempo datasource uid>
       labels:
         pod: "k8s.pod.name"           # default
 
@@ -171,7 +171,7 @@ toolsets:
   grafana/loki:
     enabled: true
     config:
-      url: http://loki.app-143.svc.cluster.local:3100
+      api_url: http://loki.app-143.svc.cluster.local:3100
       api_key: ""
   kafka/admin:
     enabled: true