You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+9-10Lines changed: 9 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -231,16 +231,6 @@ How to **investigate**:
231
231
232
232
_If the alert `CortexIngesterTSDBHeadCompactionFailed` fired as well, then give priority to it because that could be the cause._
233
233
234
-
### CortexRolloutStuck
235
-
236
-
This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`.
237
-
238
-
How to **investigate**:
239
-
- Run `kubectl -n <namespace> get pods -l name=<statefulset|deployment>` to get a list of running pods
240
-
- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`)
241
-
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
242
-
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
243
-
244
234
#### Ingester hit the disk capacity
245
235
246
236
If the ingester hit the disk capacity, any attempt to append samples will fail. You should:
@@ -734,6 +724,15 @@ When an alertmanager cannot read the state for a tenant from storage it gets log
734
724
- The state could not be merged because it might be invalid and could not be decoded. This could indicate data corruption and therefore a bug in the reading or writing of the state, and would need further investigation.
735
725
- The state could not be read from storage. This could be due to a networking issue such as a timeout or an authentication and authorization issue with the remote object store.
736
726
727
+
### CortexRolloutStuck
728
+
729
+
This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`.
730
+
731
+
How to **investigate**:
732
+
- Run `kubectl -n <namespace> get pods -l name=<statefulset|deployment>` to get a list of running pods
733
+
- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`)
734
+
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
735
+
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
0 commit comments