Skip to content

Commit be57867

Browse files
Rename Grafana config fields for consistency and add backward compatibility (#1498)
## Summary This PR renames configuration fields in Grafana-related toolsets for improved clarity and consistency, while maintaining backward compatibility with existing configurations. ## Key Changes - **Field Renames:** - `url` → `api_url` across all Grafana toolsets (Dashboards, Loki, Tempo) - `headers` → `additional_headers` for clarity about their purpose - **Backward Compatibility:** - Added `_deprecated_mappings` to `GrafanaConfig` class to automatically map old field names to new ones - Existing configurations using old field names will continue to work without modification - New field names take precedence when both old and new names are provided - **Documentation Updates:** - Updated all documentation files to reflect new field names - Updated example configurations in CLAUDE.md and all toolset-specific docs - Updated test fixtures to use new field names - **Code Updates:** - Modified `GrafanaConfig` class in `holmes/plugins/toolsets/grafana/common.py` to support field mapping - Updated all references to `config.url` → `config.api_url` - Updated all references to `config.headers` → `config.additional_headers` - Updated test cases to use new field names ## Implementation Details The backward compatibility is implemented using Pydantic's field validation mechanism. The `_deprecated_mappings` class variable defines the mapping between old and new field names, allowing the configuration parser to automatically translate old configurations to the new format while maintaining full compatibility. All test cases have been updated to verify both the new field names work correctly and that the deprecated field names are properly mapped to their new equivalents. https://claude.ai/code/session_01EwZdqLZkUq4NKduWDmBFDa <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Major expansion of the evaluation/test workflow guide with detailed setup, prompts, CI secret handling, examples, and best practices; Grafana toolset docs updated for new config keys. * **Refactor** * Config keys renamed for clarity: url → api_url and headers → additional_headers, with backward-compatibility support. * **Tests** * Fixtures and unit tests updated to use the new config keys to preserve test compatibility. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Claude <noreply@anthropic.com> Co-authored-by: Claude <noreply@anthropic.com>
1 parent e9fa6ce commit be57867

File tree

27 files changed

+329
-69
lines changed

27 files changed

+329
-69
lines changed

CLAUDE.md

Lines changed: 194 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -259,10 +259,200 @@ For the complete eval CLI reference (flags, env vars, model comparison, debuggin
259259

260260
For creating, running, and debugging LLM eval tests, use the `/create-eval` skill. It contains the complete workflow, test_case.yaml field reference, anti-hallucination patterns, infrastructure setup guides, and CLI reference.
261261

262-
**Always run evals before submitting when possible:**
263-
1. `poetry run pytest -k "test_name" --only-setup --no-cov` — verify setup
264-
2. `poetry run pytest -k "test_name" --no-cov` — run full test
265-
3. Verify cleanup: `kubectl get namespace app-NNN` should return NotFound
262+
**Test Structure:**
263+
- Use sequential test numbers: check existing tests for next available number
264+
- Required files: `test_case.yaml`, infrastructure manifests, `toolsets.yaml` (if needed)
265+
- Use dedicated namespace per test: `app-<testid>` (e.g., `app-177`)
266+
- All resource names must be unique across tests to prevent conflicts
267+
268+
**Tags:**
269+
- **CRITICAL**: Only use valid tags from `pyproject.toml` - invalid tags cause test collection failures
270+
- Check existing tags before adding new ones, ask user permission for new tags
271+
272+
**Cloud Service Evals (No Kubernetes Required)**:
273+
- Evals can test against cloud services (Elasticsearch, external APIs) directly via environment variables
274+
- Faster setup (<30 seconds vs minutes for K8s infrastructure)
275+
- `before_test` creates test data in the cloud service, `after_test` cleans up
276+
- Use `toolsets.yaml` to configure the toolset with env var references: `api_url: "{{ env.ELASTICSEARCH_URL }}"`
277+
- **CI/CD secrets**: When adding evals for a new integration, you must add the required environment variables to `.github/workflows/eval-regression.yaml` in the "Run tests" step. Tell the user which secrets they need to add to their GitHub repository settings (e.g., `ELASTICSEARCH_URL`, `ELASTICSEARCH_API_KEY`).
278+
- **HTTP request passthrough**: The root `conftest.py` has a `responses` fixture with `autouse=True` that mocks ALL HTTP requests by default. When adding a new cloud integration, you MUST add the service's URL pattern to the passthrough list in `conftest.py` (search for `rsps.add_passthru`). Use `re.compile()` for pattern matching (e.g., `rsps.add_passthru(re.compile(r"https://.*\.cloud\.es\.io"))`).
279+
280+
**User Prompts & Expected Outputs:**
281+
- **Be specific**: Test exact values like `"The dashboard title is 'Home'"` not generic `"Holmes retrieves dashboard"`
282+
- **Match prompt to test**: User prompt must explicitly request what you're testing
283+
- BAD: `"Get the dashboard"`
284+
- GOOD: `"Get the dashboard and tell me the title, panels, and time range"`
285+
- **Anti-cheat prompts**: Don't use technical terms that give away solutions
286+
- BAD: `"Find node_exporter metrics"`
287+
- GOOD: `"Find CPU pressure monitoring queries"`
288+
- **Test discovery, not recognition**: Holmes should search/analyze, not guess from context
289+
- **Ruling out hallucinations is paramount**: When choosing between test approaches, prefer the one that rules out hallucinations:
290+
- **Best**: Check specific values that can only be discovered by querying (e.g., unique IDs, injected error codes, exact counts)
291+
- **Acceptable**: Use `include_tool_calls: true` to verify the tool was called when output values are too generic to rule out hallucinations
292+
- **Bad**: Check generic output patterns that an LLM could plausibly guess (e.g., "cluster status is green/yellow/red", "has N nodes")
293+
- **expected_output is invisible to LLM**: The `expected_output` field is only used by the evaluator - the LLM never sees it. This means:
294+
- You can safely put secrets/verification codes in `expected_output` that the LLM must discover
295+
- `before_test` can inject a unique verification code into test data, and `expected_output` can check for it
296+
- This is a powerful pattern for cloud service tests: create data with a unique code in `before_test`, ask LLM to find it, verify with `expected_output`
297+
```yaml
298+
# Example: before_test creates a page with verification code "HOLMES-EVAL-7x9k2m4p"
299+
# The LLM must discover this code by querying the service
300+
expected_output:
301+
- "Must report the verification code: HOLMES-EVAL-7x9k2m4p"
302+
```
303+
- **`include_tool_calls: true`**: Use when expected output is too generic to be hallucination-proof. Prefer specific answer checking when possible, but verifying tool calls is better than a test that can't rule out hallucinations.
304+
```yaml
305+
# Use when values are generic (cluster health could be guessed)
306+
include_tool_calls: true
307+
expected_output:
308+
- "Must call elasticsearch_cluster_health tool"
309+
- "Must report cluster status"
310+
```
311+
312+
**Infrastructure Setup:**
313+
- **Don't just test pod readiness** - verify actual service functionality
314+
- Poll real API endpoints and check for expected content (e.g., `"title":"Home"`, `"type":"welcome"`)
315+
- **CRITICAL**: Use `exit 1` when setup verification fails to fail the test early
316+
- **Never use `:latest` container tags** - use specific versions like `grafana/grafana:12.3.1`
317+
318+
### Running and Testing Evals
319+
320+
## 🚨 CRITICAL: Always Test Your Changes
321+
322+
**NEVER submit test changes without verification**:
323+
324+
### Required Testing Workflow:
325+
1. **Setup Phase**: `poetry run pytest -k "test_name" --only-setup --no-cov`
326+
2. **Full Test**: `poetry run pytest -k "test_name" --no-cov`
327+
3. **Verify Results**: Ensure 100% pass rate and expected behavior
328+
329+
### When to Test:
330+
- ✅ After creating new tests
331+
- ✅ After modifying existing tests
332+
- ✅ After refactoring shared infrastructure
333+
- ✅ After performance optimizations
334+
- ✅ After adding/changing tags
335+
336+
### Red Flags - Never Skip Testing:
337+
- ❌ "The changes look good" without running
338+
- ❌ "It's just a small change"
339+
- ❌ "I'll test it later"
340+
341+
**Testing is Part of Development**: Testing is not optional - it's an integral part of the development process. Untested code is broken code.
342+
343+
**Testing Methodology:**
344+
- Phase 1: Test setup with `--only-setup` flag first
345+
- Phase 2: Run full test after confirming setup works
346+
- Use background execution for long tests: `nohup ... > logfile.log 2>&1 &`
347+
- Handle port conflicts: clean up previous test port forwards before running
348+
349+
**Common Flags:**
350+
- `--skip-cleanup`: Keep resources after test (useful for debugging setup)
351+
- `--skip-setup`: Skip before_test commands (useful for iterative testing)
352+
353+
## Shared Infrastructure Pattern
354+
355+
**When to use shared infrastructure**:
356+
- Multiple tests use the same service (Grafana, Loki, Prometheus)
357+
- Service configuration is standardized across tests
358+
359+
**Implementation**:
360+
```bash
361+
# Create shared manifest in tests/llm/fixtures/shared/servicename.yaml
362+
# Use in tests:
363+
kubectl apply -f ../../shared/servicename.yaml -n app-<testid>
364+
```
365+
366+
**Benefits**:
367+
- Single place for version updates
368+
- Consistent configuration across tests
369+
- Reduced maintenance overhead
370+
- Follows established pattern (Loki, Prometheus, Grafana)
371+
372+
## Setup Verification Best Practices
373+
374+
**Prefer kubectl exec over port forwarding for setup verification**:
375+
```bash
376+
# GOOD - kubectl exec pattern (no port conflicts)
377+
kubectl exec -n namespace deployment/service -- wget -q -O- http://localhost:port/health
378+
379+
# AVOID - port forward for setup verification (causes conflicts)
380+
kubectl port-forward svc/service port:port &
381+
curl localhost:port/health
382+
kill $PORTFWD_PID
383+
```
384+
385+
**Performance optimization guidelines**:
386+
- Use `sleep 1` instead of `sleep 5` for most retry loops
387+
- Remove sleeps after straightforward operations (port forward start)
388+
- Reduce timeout values: 60s for pod readiness, 30s for API verification
389+
- Question every sleep - many are unnecessary
390+
391+
**Race Condition Handling:**
392+
Never use bare `kubectl wait` immediately after resource creation. Use retry loops:
393+
```bash
394+
# WRONG - fails if pod not scheduled yet
395+
kubectl apply -f deployment.yaml
396+
kubectl wait --for=condition=ready pod -l app=myapp --timeout=300s
397+
398+
# CORRECT - retry loop handles race condition
399+
kubectl apply -f deployment.yaml
400+
POD_READY=false
401+
for i in {1..60}; do
402+
if kubectl wait --for=condition=ready pod -l app=myapp --timeout=5s 2>/dev/null; then
403+
echo "✅ Pod is ready!"
404+
POD_READY=true
405+
break
406+
fi
407+
sleep 1
408+
done
409+
if [ "$POD_READY" = false ]; then
410+
echo "❌ Pod failed to become ready after 60 seconds"
411+
kubectl logs -l app=myapp --tail=20 # Diagnostic info
412+
exit 1 # CRITICAL: Fail the test early
413+
fi
414+
```
415+
416+
### Eval Best Practices
417+
418+
**Realism:**
419+
- No fake/obvious logs like "Memory usage stabilized at 800MB"
420+
- No hints in filenames like "disk_consumer.py" - use realistic names like "training_pipeline.py"
421+
- No error messages that give away it's simulated like "Simulated processing error"
422+
- Use real-world scenarios: ML pipelines with checkpoint issues, database connection pools
423+
- Resource naming should be neutral, not hint at the problem (avoid "broken-pod", "crashloop-app")
424+
425+
**Architecture:**
426+
- Implement full architecture even if complex (e.g., use Loki for log aggregation, not simplified alternatives)
427+
- Proper separation of concerns (app → file → Promtail → Loki → Holmes)
428+
- **ALWAYS use Secrets for scripts**, not inline manifests or ConfigMaps
429+
- Use minimal resource footprints (reduce memory/CPU for test services)
430+
431+
**Anti-Cheat Testing Guidelines:**
432+
- **Prevent Domain Knowledge Cheats**: Use neutral, application-specific names instead of obvious technical terms
433+
- Example: "E-Commerce Platform Monitoring" not "Node Exporter Full"
434+
- Example: "Payment Service Dashboard" not "MySQL Error Dashboard"
435+
- Add source comments: `# Uses Node Exporter dashboard but renamed to prevent cheats`
436+
- **Resource Naming Rules**: Avoid hint-giving names
437+
- Use realistic business context: "checkout-api", "user-service", "inventory-db"
438+
- Avoid obvious problem indicators: "broken-pod" → "payment-service-1"
439+
- Test discovery ability, not pattern recognition
440+
- **Prompt Design**: Don't give away solutions in prompts
441+
- BAD: "Find the node_pressure_cpu_waiting_seconds_total query"
442+
- GOOD: "Find the Prometheus query that monitors CPU pressure waiting time"
443+
- Test Holmes's search/analysis skills, not domain knowledge shortcuts
444+
445+
**Configuration:**
446+
- Custom runbooks: Add `runbooks` field in test_case.yaml (`runbooks: {}` for empty catalog)
447+
- Custom toolsets: Create separate `toolsets.yaml` file (never put in test_case.yaml)
448+
- Toolset config must go under `config` field:
449+
```yaml
450+
toolsets:
451+
grafana/dashboards:
452+
enabled: true
453+
config: # All toolset-specific config under 'config'
454+
api_url: http://localhost:10177
455+
```
266456

267457
## Documentation Lookup
268458

docs/data-sources/builtin-toolsets/grafanadashboards.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ A [Grafana service account token](https://grafana.com/docs/grafana/latest/admini
2020
enabled: true
2121
config:
2222
api_key: <your grafana service account token>
23-
url: <your grafana url> # e.g. https://acme-corp.grafana.net or http://localhost:3000
23+
api_url: <your grafana url> # e.g. https://acme-corp.grafana.net or http://localhost:3000
2424
# Optional: Additional headers for all requests
25-
# headers:
25+
# additional_headers:
2626
# X-Custom-Header: "custom-value"
2727
```
2828

@@ -43,9 +43,9 @@ A [Grafana service account token](https://grafana.com/docs/grafana/latest/admini
4343
enabled: true
4444
config:
4545
api_key: <your grafana API key>
46-
url: <your grafana url> # e.g. https://acme-corp.grafana.net
46+
api_url: <your grafana url> # e.g. https://acme-corp.grafana.net
4747
# Optional: Additional headers for all requests
48-
# headers:
48+
# additional_headers:
4949
# X-Custom-Header: "custom-value"
5050
```
5151

@@ -69,7 +69,7 @@ toolsets:
6969
grafana/dashboards:
7070
enabled: true
7171
config:
72-
url: https://grafana.internal
72+
api_url: https://grafana.internal
7373
api_key: <your api key>
7474
verify_ssl: false # Disable SSL verification (default: true)
7575
```
@@ -83,7 +83,7 @@ toolsets:
8383
grafana/dashboards:
8484
enabled: true
8585
config:
86-
url: http://grafana.internal:3000 # Internal URL for API calls
86+
api_url: http://grafana.internal:3000 # Internal URL for API calls
8787
external_url: https://grafana.example.com # URL for links in results
8888
api_key: <your api key>
8989
```

docs/data-sources/builtin-toolsets/grafanaloki.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ toolsets:
4343
enabled: true
4444
config:
4545
api_key: <your grafana API key>
46-
url: https://xxxxxxx.grafana.net # Your Grafana cloud account URL
46+
api_url: https://xxxxxxx.grafana.net # Your Grafana cloud account URL
4747
grafana_datasource_uid: <the UID of the loki data source in Grafana>
4848
4949
kubernetes/logs:
@@ -61,8 +61,8 @@ toolsets:
6161
grafana/loki:
6262
enabled: true
6363
config:
64-
url: http://loki.logging
65-
headers:
64+
api_url: http://loki.logging
65+
additional_headers:
6666
X-Scope-OrgID: "<tenant id>" # Set the X-Scope-OrgID if loki multitenancy is enabled
6767
6868
kubernetes/logs:
@@ -80,7 +80,7 @@ toolsets:
8080
grafana/loki:
8181
enabled: true
8282
config:
83-
url: https://loki.internal
83+
api_url: https://loki.internal
8484
verify_ssl: false # Disable SSL verification (default: true)
8585
```
8686

@@ -93,7 +93,7 @@ toolsets:
9393
grafana/loki:
9494
enabled: true
9595
config:
96-
url: http://loki.internal:3100 # Internal URL for API calls
96+
api_url: http://loki.internal:3100 # Internal URL for API calls
9797
external_url: https://loki.example.com # URL for links in results
9898
```
9999

docs/data-sources/builtin-toolsets/grafanatempo.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ In this case, the Tempo datasource UID is `klja8hsa-8a9c-4b35-1230-7baab22b02ee`
7070
enabled: true
7171
config:
7272
api_key: <your grafana service account token>
73-
url: <your grafana url> # e.g. https://acme-corp.grafana.net
73+
api_url: <your grafana url> # e.g. https://acme-corp.grafana.net
7474
grafana_datasource_uid: <the UID of the tempo data source in Grafana>
7575
```
7676

@@ -91,7 +91,7 @@ In this case, the Tempo datasource UID is `klja8hsa-8a9c-4b35-1230-7baab22b02ee`
9191
enabled: true
9292
config:
9393
api_key: <your grafana API key>
94-
url: <your grafana url> # e.g. https://acme-corp.grafana.net
94+
api_url: <your grafana url> # e.g. https://acme-corp.grafana.net
9595
grafana_datasource_uid: <the UID of the tempo data source in Grafana>
9696
```
9797

@@ -110,8 +110,8 @@ The toolset can directly connect to a Tempo instance without proxying through a
110110
grafana/tempo:
111111
enabled: true
112112
config:
113-
url: http://tempo.monitoring
114-
headers:
113+
api_url: http://tempo.monitoring
114+
additional_headers:
115115
X-Scope-OrgID: "<tenant id>" # Set the X-Scope-OrgID if tempo multitenancy is enabled
116116
```
117117

@@ -125,8 +125,8 @@ The toolset can directly connect to a Tempo instance without proxying through a
125125
grafana/tempo:
126126
enabled: true
127127
config:
128-
url: http://tempo.monitoring
129-
headers:
128+
api_url: http://tempo.monitoring
129+
additional_headers:
130130
X-Scope-OrgID: "<tenant id>" # Set the X-Scope-OrgID if tempo multitenancy is enabled
131131
```
132132

@@ -141,7 +141,7 @@ toolsets:
141141
grafana/tempo:
142142
enabled: true
143143
config:
144-
url: https://tempo.internal
144+
api_url: https://tempo.internal
145145
verify_ssl: false # Disable SSL verification (default: true)
146146
```
147147
@@ -154,7 +154,7 @@ toolsets:
154154
grafana/tempo:
155155
enabled: true
156156
config:
157-
url: http://tempo.internal:3100 # Internal URL for API calls
157+
api_url: http://tempo.internal:3100 # Internal URL for API calls
158158
external_url: https://tempo.example.com # URL for links in results
159159
grafana_datasource_uid: <tempo datasource uid>
160160
```
@@ -168,7 +168,7 @@ toolsets:
168168
grafana/tempo:
169169
enabled: true
170170
config:
171-
url: https://grafana.example.com
171+
api_url: https://grafana.example.com
172172
grafana_datasource_uid: <tempo datasource uid>
173173
labels:
174174
pod: "k8s.pod.name" # default

docs/development/evaluations/adding-evals.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ toolsets:
171171
grafana/loki:
172172
enabled: true
173173
config:
174-
url: http://loki.app-143.svc.cluster.local:3100
174+
api_url: http://loki.app-143.svc.cluster.local:3100
175175
api_key: ""
176176
kafka/admin:
177177
enabled: true

0 commit comments

Comments
 (0)