Skip to content

Commit 0b041c6

Browse files
committed
Monitor visit marker
Signed-off-by: Friedrich Gonzalez <[email protected]>
1 parent c99aab2 commit 0b041c6

File tree

3 files changed

+28
-0
lines changed

3 files changed

+28
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## master
44
* [ENHANCEMENT] Add bigger tenants and configure default compactor tenant shards
5+
* [ENHANCEMENT] Add alert `CortexCompactorWriteVisitMarkerIsFailing` to monitor compactors
56

67
## 1.17.1 / 2024-10-23
78
* [CHANGE] Use cortex v1.17.1

cortex-mixin/alerts/compactor.libsonnet

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,22 @@
102102
||| % $._config,
103103
},
104104
},
105+
{
106+
// Alert if compactor are not able to update the visit-marker.
107+
alert: 'CortexCompactorBlockVisitMarkerIsFailing',
108+
'for': '2h',
109+
expr: |||
110+
sum(increase(cortex_compactor_block_visit_marker_write_failed{job=~".+/%(compactor)s"}[2h]))>0
111+
||| % $._config.job_names,
112+
labels: {
113+
severity: 'critical'
114+
},
115+
annotations: {
116+
message: |||
117+
Cortex compactors are not able to update the visit marker, double check logs to see what is happening
118+
|||
119+
}
120+
}
105121
],
106122
},
107123
],

cortex-mixin/docs/playbooks.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -379,6 +379,17 @@ How to **investigate**:
379379
- Ensure ingesters are successfully shipping blocks to the storage
380380
- Look for any error in the compactor logs
381381
382+
### CortexCompactorWriteVisitMarkerIsFailing
383+
384+
Only applies to compactors when using shuffle sharding.
385+
This alert fires if the compactor is not able to update the visit marker across all tenants.
386+
The marker file is a very small json file that should never have any problems getting updated.
387+
388+
How to **investigate**:
389+
- Verify the logs for the compactors, they should show the exact reason
390+
- If you see the `context canceled` or any other timeouts in the logs,
391+
consider increasing `-compactor.compaction-visit-marker-timeout` and `-compactor.compaction-visit-marker-file-update-interval`.
392+
382393
### CortexCompactorHasNotSuccessfullyRunCompaction
383394
384395
This alert fires if the compactor is not able to successfully compact all discovered compactable blocks (across all tenants).

0 commit comments

Comments
 (0)