You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,12 +3,21 @@
3
3
## master / unreleased
4
4
5
5
*[CHANGE] Add the default preset 'extra_small_user' and reference it in the CLI flags. This will raise the limits of the 'small_user' preset to the defaults for `ingester.max-samples-per-query` and `ingester.max-series-per-query`. #200
6
+
*[CHANGE] Removed the config option `$._config.ingester.statefulset_replicas` which was used only when running Cortex chunks storage with WAL enabled. To configure the number of ingester replicas you should now use the following: #210
7
+
```
8
+
ingester_statefulset+:
9
+
statefulSet.mixin.spec.withReplicas(6),
10
+
```
6
11
*[ENHANCEMENT] Add the Ruler to the read resources dashboard #205
7
12
*[ENHANCEMENT] Read dashboards now use `cortex_querier_request_duration_seconds` metrics to allow for accurate dashboards when deploying Cortex as a single-binary. #199
8
13
*[ENHANCEMENT] Improved Ruler dashboard. Includes information about notifications, reads/writes, and per user per rule group evaluation. #197, #205
9
14
*[ENHANCEMENT] Add new `CortexCompactorRunFailed` alert when compactor run fails. #206
15
+
*[ENHANCEMENT] Add `flusher-job-blocks.libsonnet` with support for flushing blocks disks. #187
16
+
*[ENHANCEMENT] Add more alerts on failure conditions for ingesters when running the blocks storage. #208
10
17
*[FEATURE] Latency recording rules for the metric`cortex_querier_request_duration_seconds` are now part of a `cortex_querier_api` rule group. #199
11
18
*[FEATURE] Add overrides-exporter as optional deployment to expose configured runtime overrides and presets. #198
19
+
*[FEATURE] Add a dashboard for the alertmanager. #207
20
+
*[BUGFIX] Added `ingester-blocks` to ingester's job label matcher, in order to correctly get metrics when migrating a Cortex cluster from chunks to blocks. #203
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+54-6Lines changed: 54 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ This alert occurs when a ruler is unable to validate whether or not it should cl
82
82
83
83
This alert fires when a Cortex ingester is not uploading any block to the long-term storage. An ingester is expected to upload a block to the storage every block range period (defaults to 2h) and if a longer time elapse since the last successful upload it means something is not working correctly.
84
84
85
-
How to investigate:
85
+
How to **investigate**:
86
86
- Ensure the ingester is receiving write-path traffic (samples to ingest)
87
87
- Look for any upload error in the ingester logs (ie. networking or authentication issues)
88
88
@@ -115,33 +115,81 @@ The cause triggering this alert could **lead to**:
115
115
How to **investigate**:
116
116
- Look for details in the ingester logs
117
117
118
+
### CortexIngesterTSDBHeadTruncationFailed
119
+
120
+
This alert fires when a Cortex ingester fails to truncate the TSDB head.
121
+
122
+
The TSDB head is the in-memory store used to keep series and samples not compacted into a block yet. If head truncation fails for a long time, the ingester disk might get full as it won't continue to the WAL truncation stage and the subsequent ingester restart may take a long time or even go into an OOMKilled crash loop because of the huge WAL to replay. For this reason, it's important to investigate and address the issue as soon as it happen.
123
+
124
+
How to **investigate**:
125
+
- Look for details in the ingester logs
126
+
127
+
### CortexIngesterTSDBCheckpointCreationFailed
128
+
129
+
This alert fires when a Cortex ingester fails to create a TSDB checkpoint.
130
+
131
+
How to **investigate**:
132
+
- Look for details in the ingester logs
133
+
- If the checkpoint fails because of a `corruption in segment`, you can restart the ingester because at next startup TSDB will try to "repair" it. After restart, if the issue is repaired and the ingester is running, you should also get paged by `CortexIngesterTSDBWALCorrupted` to signal you the WAL was corrupted and manual investigation is required.
134
+
135
+
### CortexIngesterTSDBCheckpointDeletionFailed
136
+
137
+
This alert fires when a Cortex ingester fails to delete a TSDB checkpoint.
138
+
139
+
Generally, this is not an urgent issue, but manual investigation is required to find the root cause of the issue and fix it.
140
+
141
+
How to **investigate**:
142
+
- Look for details in the ingester logs
143
+
144
+
### CortexIngesterTSDBWALTruncationFailed
145
+
146
+
This alert fires when a Cortex ingester fails to truncate the TSDB WAL.
147
+
148
+
How to **investigate**:
149
+
- Look for details in the ingester logs
150
+
151
+
### CortexIngesterTSDBWALCorrupted
152
+
153
+
This alert fires when a Cortex ingester finds a corrupted TSDB WAL (stored on disk) while replaying it at ingester startup or when creation of a checkpoint comes across a WAL corruption.
154
+
155
+
If this alert fires during an **ingester startup**, the WAL should have been auto-repaired, but manual investigation is required. The WAL repair mechanism cause data loss because all WAL records after the corrupted segment are discarded and so their samples lost while replaying the WAL. If this issue happen only on 1 ingester then Cortex doesn't suffer any data loss because of the replication factor, while if it happens on multiple ingesters then some data loss is possible.
156
+
157
+
If this alert fires during a **checkpoint creation**, you should have also been paged with `CortexIngesterTSDBCheckpointCreationFailed`, and you can follow the steps under that alert.
158
+
159
+
### CortexIngesterTSDBWALWritesFailed
160
+
161
+
This alert fires when a Cortex ingester is failing to log records to the TSDB WAL on disk.
162
+
163
+
How to **investigate**:
164
+
- Look for details in the ingester logs
165
+
118
166
### CortexQuerierHasNotScanTheBucket
119
167
120
168
This alert fires when a Cortex querier is not successfully scanning blocks in the storage (bucket). A querier is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket since a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results.
121
169
122
-
How to investigate:
170
+
How to **investigate**:
123
171
- Look for any scan error in the querier logs (ie. networking or rate limiting issues)
124
172
125
173
### CortexQuerierHighRefetchRate
126
174
127
175
This alert fires when there's an high number of queries for which series have been refetched from a different store-gateway because of missing blocks. This could happen for a short time whenever a store-gateway ring resharding occurs (e.g. during/after an outage or while rolling out store-gateway) but store-gateways should reconcile in a short time. This alert fires if the issue persist for an unexpected long time and thus it should be investigated.
128
176
129
-
How to investigate:
177
+
How to **investigate**:
130
178
- Ensure there are no errors related to blocks scan or sync in the queriers and store-gateways
131
179
- Check store-gateway logs to see if all store-gateway have successfully completed a blocks sync
132
180
133
181
### CortexStoreGatewayHasNotSyncTheBucket
134
182
135
183
This alert fires when a Cortex store-gateway is not successfully scanning blocks in the storage (bucket). A store-gateway is expected to periodically iterate the bucket to find new and deleted blocks (defaults to every 5m) and if it's not successfully synching the bucket for a long time, it may end up querying only a subset of blocks, thus leading to potentially partial results.
136
184
137
-
How to investigate:
185
+
How to **investigate**:
138
186
- Look for any scan error in the store-gateway logs (ie. networking or rate limiting issues)
This alert fires when a Cortex compactor is not successfully deleting blocks marked for deletion for a long time.
143
191
144
-
How to investigate:
192
+
How to **investigate**:
145
193
- Ensure the compactor is not crashing during compaction (ie. `OOMKilled`)
146
194
- Look for any error in the compactor logs (ie. bucket Delete API errors)
147
195
@@ -153,7 +201,7 @@ Same as [`CortexCompactorHasNotSuccessfullyCleanedUpBlocks`](#CortexCompactorHas
153
201
154
202
This alert fires when a Cortex compactor is not uploading any compacted blocks to the storage since a long time.
155
203
156
-
How to investigate:
204
+
How to **investigate**:
157
205
- If the alert `CortexCompactorHasNotSuccessfullyRun` or `CortexCompactorHasNotSuccessfullyRunSinceStart` have fired as well, then investigate that issue first
158
206
- If the alert `CortexIngesterHasNotShippedBlocks` or `CortexIngesterHasNotShippedBlocksSinceStart` have fired as well, then investigate that issue first
159
207
- Ensure ingesters are successfully shipping blocks to the storage
0 commit comments