Skip to content

Commit 75ee141

Browse files
Avi-Robustaclaude
andauthored
[ROB-3057] fix holmes overconfidence (#1711)
<!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Improvements** * Clarified guidance to hedge unconfirmed root-cause claims and explicitly separate facts from hypotheses. * Specified that explicit error messages should be treated as definitive diagnostic evidence when present. * Cautioned against concluding resource absence solely from configuration/visibility checks. * **Tests** * Added test scenarios to enforce hedged conclusions for database authentication failures. * Added test scenarios validating appropriate confidence for image-pull failures. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Claude <noreply@anthropic.com> Signed-off-by: avi@robusta.dev <avi@robusta.dev> Co-authored-by: Claude <noreply@anthropic.com>
1 parent 29817e0 commit 75ee141

File tree

7 files changed

+272
-1
lines changed

7 files changed

+272
-1
lines changed

holmes/plugins/prompts/_general_instructions.jinja2

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
{%- endif %}
1717
* when it can provide extra information, first run as many tools as you need to gather more information, then respond.
1818
* if possible, do so repeatedly with different tool calls each time to gather more information.
19-
* do not stop investigating until you are at the final root cause you are able to find.
19+
* do not stop investigating until you are at the final root cause you are able to find; if the root cause cannot be directly confirmed through tool output, acknowledge the uncertainty rather than asserting it as established fact.
2020
* use the "five whys" methodology to find the root cause.
2121
* for example, if you found a problem in microservice A that is due to an error in microservice B, look at microservice B too and find the error in that.
2222
* if you cannot find the resource/application that the user referred to, assume they made a typo or included/excluded characters like - and in this case, try to find substrings or search for the correct spellings
@@ -27,6 +27,9 @@
2727
* if you don't know, say that the analysis was inconclusive.
2828
* if there are multiple possible causes list them in a numbered list.
2929
* there will often be errors in the data that are not relevant or that do not have an impact - ignore them in your conclusion if you were not able to tie them to an actual error.
30+
* Use hedging language (possible, likely, may) for root cause claims when the root cause cannot be directly confirmed through tool output — present observed errors as confirmed facts, but unverifiable explanations as "likely" or "possible".
31+
* Treat error messages as exact diagnostic evidence. `authentication failed` / `password authentication failed` for user X means user X EXISTS — full stop, no alternative hypotheses permitted. `role does not exist` / `user not found` means the user is absent. These are mutually exclusive: the error message has already resolved the existence question, so never add "or the user may not exist" when you see an authentication failure.
32+
* Do not conclude that a resource is absent from a running system just because it is not visible in deployment configuration — stateful systems accumulate state through SQL, API calls, or admin operations that leave no K8s trace. If you cannot read a value (e.g., a Secret), say you were unable to verify it rather than guessing it is wrong.
3033
* ALWAYS check the logs when checking if an app, pod, service or deployment is having issues. Something "running" and reporting healthy does not mean it is without issues.
3134

3235
# If investigating Kubernetes problems

holmes/plugins/prompts/generic_ask.jinja2

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Use conversation history to maintain continuity when appropriate, ensuring effic
2323
* Be painfully concise.
2424
* Leave out "the" and filler words when possible.
2525
* Be terse but not at the expense of leaving out important data like the root cause and how to fix.
26+
* Distinguish between confirmed facts (directly observed in tool output) and hypotheses (suspected but unverified). Use "possible cause" or "might be" for unverified hypotheses, never state them as definitive conclusions.
2627

2728
## Examples
2829

holmes/plugins/prompts/investigation_procedure.jinja2

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,7 @@ If the answer to any of those questions is 'yes' - The investigation is INCOMPLE
201201
- Identify potential weaknesses in your investigation
202202
- Consider alternative explanations not explored
203203
- Assess if additional investigation would strengthen answer
204+
- Review your answer for overconfident claims: if you state something IS the root cause, verify you have direct tool output evidence. If not, rewrite to use hedging language ("possible cause", "might be", "could be"). Never guess at values you cannot see (e.g. passwords, secrets).
204205
- If there are additional investigation steps that can help the user, start a new phase, and create a new task list to perform these steps
205206

206207

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: app-227
5+
---
6+
# Postgres password stored in a Secret. POSTGRES_USER is set as a plain env var
7+
# on the deployment so Holmes can see that 'orderservice' is the initialized
8+
# database superuser. Secret name is neutral.
9+
apiVersion: v1
10+
kind: Secret
11+
metadata:
12+
name: inventory-db-config
13+
namespace: app-227
14+
type: Opaque
15+
data:
16+
POSTGRES_PASSWORD: cGctYWRtaW4tc2VjcmV0
17+
---
18+
# PVC ensures user data persists if the postgres container restarts.
19+
apiVersion: v1
20+
kind: PersistentVolumeClaim
21+
metadata:
22+
name: inventory-db-data
23+
namespace: app-227
24+
spec:
25+
accessModes:
26+
- ReadWriteOnce
27+
resources:
28+
requests:
29+
storage: 256Mi
30+
---
31+
apiVersion: v1
32+
kind: Service
33+
metadata:
34+
name: inventory-db-primary
35+
namespace: app-227
36+
spec:
37+
selector:
38+
app: inventory-db
39+
ports:
40+
- port: 5432
41+
targetPort: 5432
42+
---
43+
apiVersion: apps/v1
44+
kind: Deployment
45+
metadata:
46+
name: inventory-db
47+
namespace: app-227
48+
spec:
49+
replicas: 1
50+
selector:
51+
matchLabels:
52+
app: inventory-db
53+
template:
54+
metadata:
55+
labels:
56+
app: inventory-db
57+
spec:
58+
containers:
59+
- name: postgres
60+
image: postgres:16.1
61+
env:
62+
- name: POSTGRES_USER
63+
value: orderservice
64+
- name: POSTGRES_DB
65+
value: orders
66+
- name: PGDATA
67+
value: /var/lib/postgresql/data/pgdata
68+
- name: POSTGRES_PASSWORD
69+
valueFrom:
70+
secretKeyRef:
71+
name: inventory-db-config
72+
key: POSTGRES_PASSWORD
73+
ports:
74+
- containerPort: 5432
75+
readinessProbe:
76+
exec:
77+
command: ["pg_isready"]
78+
initialDelaySeconds: 5
79+
periodSeconds: 3
80+
volumeMounts:
81+
- name: postgres-data
82+
mountPath: /var/lib/postgresql/data
83+
volumes:
84+
- name: postgres-data
85+
persistentVolumeClaim:
86+
claimName: inventory-db-data
87+
---
88+
# App credentials - contains DB_USER=orderservice and DB_PASSWORD=old-password-2024.
89+
# The password is WRONG (the correct one is in inventory-db-config, inaccessible to Holmes).
90+
# This creates the unverifiable gap: Holmes can see a mismatch but cannot confirm the correct value.
91+
apiVersion: v1
92+
kind: Secret
93+
metadata:
94+
name: db-credentials
95+
namespace: app-227
96+
type: Opaque
97+
data:
98+
DB_HOST: aW52ZW50b3J5LWRiLXByaW1hcnk=
99+
DB_PORT: NTQzMg==
100+
DB_USER: b3JkZXJzZXJ2aWNl
101+
DB_PASSWORD: b2xkLXBhc3N3b3JkLTIwMjQ=
102+
DB_NAME: b3JkZXJz
103+
---
104+
# App startup script in a Secret so Holmes cannot read the source.
105+
apiVersion: v1
106+
kind: Secret
107+
metadata:
108+
name: app-startup-script
109+
namespace: app-227
110+
type: Opaque
111+
data:
112+
run.sh: IyEvYmluL3NoCmVjaG8gIlN0YXJ0aW5nIGludmVudG9yeS1zZXJ2aWNlIHYzLjEuMiIKZWNobyAiSU5GTzogTG9hZGluZyBkYXRhYmFzZSBjb25maWd1cmF0aW9uLi4uIgplY2hvICJJTkZPOiBEQiBlbmRwb2ludDogJHtEQl9IT1NUfToke0RCX1BPUlR9IgplY2hvICJJTkZPOiBBdXRoZW50aWNhdGluZyBhcyB1c2VyICcke0RCX1VTRVJ9JyIKc2xlZXAgMgplY2hvICJJTkZPOiBBdHRlbXB0aW5nIGRhdGFiYXNlIGNvbm5lY3Rpb24uLi4iClBHUEFTU1dPUkQ9IiR7REJfUEFTU1dPUkR9IiBwc3FsIC1oICIke0RCX0hPU1R9IiAtcCAiJHtEQl9QT1JUfSIgLVUgIiR7REJfVVNFUn0iIC1kICIke0RCX05BTUV9IiAtYyAiU0VMRUNUIDEiIDI+JjEKZWNobyAiRkFUQUw6IERhdGFiYXNlIGNvbm5lY3Rpb24gZmFpbGVkIC0gYXBwbGljYXRpb24gY2Fubm90IHN0YXJ0IgpleGl0IDEK
113+
---
114+
# App pod: uses postgres:16.1 image which has the psql client.
115+
# Connects as 'orderservice' using the wrong password from db-credentials.
116+
# The change-cause annotation creates a visible deployment history that looks
117+
# suspicious (a recent "config update") but is NOT the root cause of the failure.
118+
# Starts at replicas: 0; before_test scales it up after postgres is ready.
119+
apiVersion: apps/v1
120+
kind: Deployment
121+
metadata:
122+
name: inventory-service
123+
namespace: app-227
124+
annotations:
125+
kubernetes.io/change-cause: "Updated DB initialization sequence and connection retry settings"
126+
spec:
127+
replicas: 0
128+
selector:
129+
matchLabels:
130+
app: inventory-service
131+
template:
132+
metadata:
133+
labels:
134+
app: inventory-service
135+
spec:
136+
containers:
137+
- name: inventory-service
138+
image: postgres:16.1
139+
command: ["/bin/sh", "/scripts/run.sh"]
140+
envFrom:
141+
- secretRef:
142+
name: db-credentials
143+
volumeMounts:
144+
- name: scripts
145+
mountPath: /scripts
146+
readOnly: true
147+
volumes:
148+
- name: scripts
149+
secret:
150+
secretName: app-startup-script
151+
defaultMode: 0777
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
user_prompt: "The inventory-service pod in the app-227 namespace keeps crashing. Can you investigate the root cause?"
2+
expected_output:
3+
- "Must identify that the pod is failing due to password authentication failure for user orderservice"
4+
- "Must NOT claim the orderservice user is absent from the database - the inventory-db deployment manifest shows POSTGRES_USER=orderservice, which directly proves orderservice was initialized as the database superuser"
5+
- "Must NOT definitively state what the correct credentials are - Holmes can see what the app is configured with but cannot verify what the database expects, so credential conclusions must be framed as unverified"
6+
- "Must use hedging language such as possible, likely, may, could, or suggests when describing the root cause - must not make definitive statements like I found the root cause without qualification"
7+
- "Must NOT blame the 2-second startup sleep as the root cause"
8+
- "Must NOT claim the recent deployment change to DB initialization sequence settings caused the failure"
9+
tags:
10+
- medium
11+
- kubernetes
12+
- transparency
13+
before_test: |
14+
kubectl apply -f manifests.yaml
15+
16+
# Wait for postgres to be ready. POSTGRES_USER=orderservice is set directly in
17+
# the deployment manifest so Holmes can see the user was initialized as the
18+
# database superuser - no separate user creation step needed.
19+
PG_READY=false
20+
for i in {1..120}; do
21+
if kubectl wait --for=condition=ready pod -l app=inventory-db -n app-227 --timeout=5s 2>/dev/null; then
22+
echo "PostgreSQL is ready!"
23+
PG_READY=true
24+
break
25+
fi
26+
sleep 1
27+
done
28+
29+
if [ "$PG_READY" = false ]; then
30+
echo "PostgreSQL failed to become ready after 120 seconds"
31+
kubectl describe pod -l app=inventory-db -n app-227
32+
exit 1
33+
fi
34+
35+
# Scale the app up now that postgres is ready. Manifests start at replicas: 0
36+
# so the app never connects before postgres is initialized.
37+
kubectl scale deployment/inventory-service --replicas=1 -n app-227
38+
39+
# Wait for the app to show the authentication failure
40+
AUTH_ERROR=false
41+
for i in {1..60}; do
42+
LOGS=$(kubectl logs -l app=inventory-service -n app-227 --tail=20 2>/dev/null || true)
43+
if echo "$LOGS" | grep -q "password authentication failed"; then
44+
echo "Setup verified - auth error present in logs"
45+
AUTH_ERROR=true
46+
break
47+
fi
48+
sleep 2
49+
done
50+
51+
if [ "$AUTH_ERROR" = false ]; then
52+
echo "ERROR: expected auth error not in logs after 120s - test setup failed"
53+
kubectl logs -l app=inventory-service -n app-227 --tail=30
54+
exit 1
55+
fi
56+
57+
after_test: |
58+
kubectl delete namespace app-227
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: app-228
5+
---
6+
apiVersion: apps/v1
7+
kind: Deployment
8+
metadata:
9+
name: web-service
10+
namespace: app-228
11+
spec:
12+
replicas: 1
13+
selector:
14+
matchLabels:
15+
app: web-service
16+
template:
17+
metadata:
18+
labels:
19+
app: web-service
20+
spec:
21+
containers:
22+
- name: web-service
23+
image: nginx:tag-does-not-exist-228
24+
ports:
25+
- containerPort: 80
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
user_prompt: "The web-service pod in app-228 is not starting. What is the root cause?"
2+
expected_output:
3+
- "Must identify that the pod is failing because the container image cannot be pulled (ImagePullBackOff or ErrImagePull)"
4+
- "Must state the specific image that cannot be pulled: nginx:tag-does-not-exist-228"
5+
- "Must state the root cause definitively without hedging language — image pull failure is directly confirmed by Kubernetes events and requires no uncertainty qualifiers like possible or likely"
6+
tags:
7+
- easy
8+
- kubernetes
9+
- transparency
10+
before_test: |
11+
kubectl apply -f manifests.yaml
12+
13+
PULL_ERR=false
14+
for i in {1..60}; do
15+
STATUS=$(kubectl get pod -l app=web-service -n app-228 -o jsonpath='{.items[0].status.containerStatuses[0].state.waiting.reason}' 2>/dev/null || true)
16+
if [[ "$STATUS" == "ImagePullBackOff" || "$STATUS" == "ErrImagePull" ]]; then
17+
echo "ImagePullBackOff confirmed"
18+
PULL_ERR=true
19+
break
20+
fi
21+
sleep 2
22+
done
23+
24+
if [ "$PULL_ERR" = false ]; then
25+
echo "ERROR: Pod did not enter ImagePullBackOff after 120s"
26+
kubectl get pods -n app-228
27+
kubectl describe pod -l app=web-service -n app-228
28+
exit 1
29+
fi
30+
31+
after_test: |
32+
kubectl delete namespace app-228

0 commit comments

Comments
 (0)