Skip to content

Commit e57eeeb

Browse files
authored
Polish alarm recovery logic. (#13581)
1 parent cfbb00d commit e57eeeb

File tree

7 files changed

+53
-33
lines changed

7 files changed

+53
-33
lines changed

docs/en/setup/backend/backend-alarm.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ The metrics names in the expression could be found in the [list of all potential
4040
- **Silence period**. After the alarm is triggered at Time-N (TN), there will be silence during the **TN -> TN + period**.
4141
By default, it works in the same manner as **period**. The same Alarm (having the same ID in the same metrics name) may only be triggered once within a period.
4242
- **Recovery observation period**. Defines the number of consecutive periods that the alarm condition must remain false before the alarm is considered recovered. When the alarm condition becomes false, the system enters an observation period. If the condition remains false for the specified number of periods, a recovery notification is sent. If the condition becomes true again during the observation period, the alarm returns to the FIRING state.
43-
The default value is 0, which means immediate recovery notification when the condition becomes false.
43+
The default value is 0, which means immediate recovery notification when the condition becomes false.
44+
**Notice:** because the alarm will not be triggered again during the silence period, recovery won't be triggered during the silence period after an alarm is fired. It will be in the OBSERVING_RECOVERY state, the recovery will be triggered only after the silence period is over and the condition remains false for the specified observation periods.
4445

4546

4647
Such as for a metric, there is a shifting window as following at T7.
@@ -523,15 +524,16 @@ stateDiagram-v2
523524
[*] --> NORMAL
524525
NORMAL --> FIRING: Expression true<br/>not in silence period
525526
526-
FIRING --> SILENCED: Expression true<br/>in silence period
527-
FIRING --> OBSERVING_RECOVERY: Expression false<br/>in recovery window
528-
FIRING --> RECOVERED: Expression false<br/>not in recovery window
527+
FIRING --> SILENCED_FIRING: Expression true<br/>in silence period
528+
FIRING --> OBSERVING_RECOVERY: Expression false<br/>in recovery window or in silence period
529+
FIRING --> RECOVERED: Expression false<br/>not in recovery window and not in silence period
529530
530531
OBSERVING_RECOVERY --> FIRING: Expression true<br/>not in silence period
531-
OBSERVING_RECOVERY --> RECOVERED: Expression false<br/>not in recovery window
532+
OBSERVING_RECOVERY --> SILENCED_FIRING: Expression true<br/>in silence period or in silence period
533+
OBSERVING_RECOVERY --> RECOVERED: Expression false<br/>not in recovery window and not in silence period
532534
533-
SILENCED --> RECOVERED: Expression false<br/>not in recovery window
534-
SILENCED --> OBSERVING_RECOVERY: Expression false<br/>in recovery window
535+
SILENCED_FIRING --> RECOVERED: Expression false<br/>not in recovery window and not in silence period
536+
SILENCED_FIRING --> OBSERVING_RECOVERY: Expression false<br/>in recovery window or in silence period
535537
536538
RECOVERED --> FIRING: Expression true<br/>not in silence period
537539
RECOVERED --> NORMAL: Expression false

docs/en/status/query_alarm_runtime_status.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,11 @@ Return the running context of the alarm rule.
158158
"endTime": "2025-11-10T09:39:00.000",
159159
"additionalPeriod": 0,
160160
"size": 10,
161+
"silencePeriod": 3,
162+
"recoveryObservationPeriod": 2,
161163
"silenceCountdown": 10,
162164
"recoveryObservationCountdown": 2,
165+
"currentState": "FIRING",
163166
"entityName": "mock_b_service",
164167
"windowValues": [
165168
{
@@ -233,8 +236,9 @@ Return the running context of the alarm rule.
233236
`size` is the window size. Equal to the `period + additionalPeriod`.
234237
`silenceCountdown` is the countdown of the silence period. -1 means silence countdown is not running.
235238
`recoveryObservationCountdown` is the countdown of the recovery observation period.
236-
`windowValues` is the original metrics data. The `index` is the index of the window, starting from 0.
237-
`mqeMetricsSnapshot` is the metrics data in the MQE format. When checking conditions, these data will be calculated according to the expression.
239+
`windowValues` is the original metrics data when the metrics come in. The `index` is the index of the window, starting from 0.
240+
`mqeMetricsSnapshot` is the metrics data in the MQE format which is generated when executing the checking.
241+
These data will be calculated according to the expression.
238242

239243
## Get Errors When Querying Status from OAP Instances
240244

oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmStatusWatcher.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,8 +136,11 @@ public String getAlarmRuleContext(final String ruleName, final String entityName
136136
runningContext.setEndTime(window.getEndTime().toString());
137137
runningContext.setAdditionalPeriod(window.getAdditionalPeriod());
138138
runningContext.setSize(window.getSize());
139+
runningContext.setSilencePeriod(window.getStateMachine().getSilencePeriod());
140+
runningContext.setRecoveryObservationPeriod(window.getStateMachine().getRecoveryObservationPeriod());
139141
runningContext.setSilenceCountdown(window.getStateMachine().getSilenceCountdown());
140142
runningContext.setRecoveryObservationCountdown(window.getStateMachine().getRecoveryObservationCountdown());
143+
runningContext.setCurrentState(window.getStateMachine().getCurrentState().name());
141144
window.scanWindowValues(values -> {
142145
for (int i = 0; i < values.size(); i++) {
143146
AlarmRunningContext.WindowValue windowValue = new AlarmRunningContext.WindowValue();

oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,7 @@ public List<AlarmMessage> check() {
237237
public enum State {
238238
NORMAL,
239239
FIRING,
240-
SILENCED,
240+
SILENCED_FIRING,
241241
OBSERVING_RECOVERY,
242242
RECOVERED
243243
}
@@ -477,14 +477,12 @@ private void init() {
477477
}
478478
}
479479

480+
@Getter
480481
public class AlarmStateMachine {
481-
@Getter
482482
private int silenceCountdown;
483-
@Getter
484483
private int recoveryObservationCountdown;
485484
private final int silencePeriod;
486485
private final int recoveryObservationPeriod;
487-
@Getter
488486
private State currentState;
489487

490488
public AlarmStateMachine(int silencePeriod, int recoveryObservationPeriod) {
@@ -503,16 +501,20 @@ public void onMatch() {
503501
silenceCountdown--;
504502
switch (currentState) {
505503
case NORMAL:
506-
case SILENCED:
504+
transitionTo(State.FIRING);
505+
break;
506+
case SILENCED_FIRING:
507507
case OBSERVING_RECOVERY:
508508
case RECOVERED:
509509
if (silenceCountdown < 0) {
510510
transitionTo(State.FIRING);
511+
} else {
512+
transitionTo(State.SILENCED_FIRING);
511513
}
512514
break;
513515
case FIRING:
514516
if (silenceCountdown >= 0) {
515-
transitionTo(State.SILENCED);
517+
transitionTo(State.SILENCED_FIRING);
516518
}
517519
break;
518520
default:
@@ -531,15 +533,15 @@ public void onMismatch() {
531533
silenceCountdown--;
532534
switch (currentState) {
533535
case FIRING:
534-
case SILENCED:
535-
if (this.recoveryObservationCountdown < 0) {
536+
case SILENCED_FIRING:
537+
if (this.recoveryObservationCountdown < 0 && silenceCountdown < 0) {
536538
transitionTo(State.RECOVERED);
537539
} else {
538540
transitionTo(State.OBSERVING_RECOVERY);
539541
}
540542
break;
541543
case OBSERVING_RECOVERY:
542-
if (recoveryObservationCountdown < 0) {
544+
if (recoveryObservationCountdown < 0 && silenceCountdown < 0) {
543545
transitionTo(State.RECOVERED);
544546
}
545547
break;
@@ -564,9 +566,9 @@ private void transitionTo(State newState) {
564566
break;
565567
case FIRING:
566568
this.silenceCountdown = this.silencePeriod;
567-
this.recoveryObservationCountdown = recoveryObservationPeriod;
569+
this.recoveryObservationCountdown = this.recoveryObservationPeriod;
568570
break;
569-
case SILENCED:
571+
case SILENCED_FIRING:
570572
break;
571573
case OBSERVING_RECOVERY:
572574
this.recoveryObservationCountdown = this.recoveryObservationPeriod - 1;
@@ -578,7 +580,7 @@ private void transitionTo(State newState) {
578580
}
579581

580582
private void resetCountdowns() {
581-
recoveryObservationCountdown = this.recoveryObservationPeriod;
583+
this.recoveryObservationCountdown = this.recoveryObservationPeriod;
582584
}
583585

584586
}

oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/status/AlarmRunningContext.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,11 @@ public class AlarmRunningContext {
3030
private String endTime;
3131
private int additionalPeriod;
3232
private int size;
33+
private int silencePeriod;
34+
private int recoveryObservationPeriod;
3335
private int silenceCountdown;
3436
private int recoveryObservationCountdown;
37+
private String currentState;
3538
private String entityName;
3639
private List<WindowValue> windowValues = new ArrayList<>();
3740
private JsonObject mqeMetricsSnapshot;

oap-server/server-alarm-plugin/src/test/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRuleTest.java

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -740,13 +740,13 @@ public void testAlarmStateMachine_OnlySilencePeriod() throws IllegalExpressionEx
740740
runningRule.in(getMetaInAlarm(123), getMetrics(TimeBucket.getMinuteTimeBucket(startTime.plusMinutes(1).getMillis()), 72));
741741
alarmMessages = getAlarmFiringMessageList(runningRule.check());
742742
Assertions.assertEquals(0, alarmMessages.size(), "Should be silenced");
743-
Assertions.assertEquals(RunningRule.State.SILENCED, stateMachine.getCurrentState());
743+
Assertions.assertEquals(RunningRule.State.SILENCED_FIRING, stateMachine.getCurrentState());
744744

745745
runningRule.in(getMetaInAlarm(123), getMetrics(TimeBucket.getMinuteTimeBucket(startTime.plusMinutes(2).getMillis()), 72));
746746
runningRule.moveTo(startTime.plusMinutes(2).toLocalDateTime());
747747
alarmMessages = getAlarmFiringMessageList(runningRule.check());
748748
Assertions.assertEquals(0, alarmMessages.size(), "Should be silenced");
749-
Assertions.assertEquals(RunningRule.State.SILENCED, stateMachine.getCurrentState());
749+
Assertions.assertEquals(RunningRule.State.SILENCED_FIRING, stateMachine.getCurrentState());
750750

751751
runningRule.in(getMetaInAlarm(123), getMetrics(TimeBucket.getMinuteTimeBucket(startTime.plusMinutes(3).getMillis()), 80));
752752
runningRule.moveTo(startTime.plusMinutes(3).toLocalDateTime());
@@ -758,16 +758,22 @@ public void testAlarmStateMachine_OnlySilencePeriod() throws IllegalExpressionEx
758758
runningRule.moveTo(startTime.plusMinutes(4).toLocalDateTime());
759759
alarmMessages = getAlarmFiringMessageList(runningRule.check());
760760
Assertions.assertEquals(0, alarmMessages.size(), "Should be silenced");
761-
Assertions.assertEquals(RunningRule.State.SILENCED, stateMachine.getCurrentState());
761+
Assertions.assertEquals(RunningRule.State.SILENCED_FIRING, stateMachine.getCurrentState());
762762

763763
runningRule.in(getMetaInAlarm(123), getMetrics(TimeBucket.getMinuteTimeBucket(startTime.plusMinutes(5).getMillis()), 80));
764764
runningRule.moveTo(startTime.plusMinutes(5).toLocalDateTime());
765765
alarmMessages = getAlarmRecoveryMessageList(runningRule.check());
766-
Assertions.assertEquals(1, alarmMessages.size(), "Should recover immediately");
767-
Assertions.assertEquals(RunningRule.State.RECOVERED, stateMachine.getCurrentState());
766+
Assertions.assertEquals(0, alarmMessages.size(), "Should not recover immediately");
767+
Assertions.assertEquals(RunningRule.State.OBSERVING_RECOVERY, stateMachine.getCurrentState());
768768

769769
runningRule.in(getMetaInAlarm(123), getMetrics(TimeBucket.getMinuteTimeBucket(startTime.plusMinutes(6).getMillis()), 80));
770770
runningRule.moveTo(startTime.plusMinutes(6).toLocalDateTime());
771+
alarmMessages = getAlarmRecoveryMessageList(runningRule.check());
772+
Assertions.assertEquals(1, alarmMessages.size(), "Should recover after silence period");
773+
Assertions.assertEquals(RunningRule.State.RECOVERED, stateMachine.getCurrentState());
774+
775+
runningRule.in(getMetaInAlarm(123), getMetrics(TimeBucket.getMinuteTimeBucket(startTime.plusMinutes(7).getMillis()), 80));
776+
runningRule.moveTo(startTime.plusMinutes(7).toLocalDateTime());
771777
alarmMessages = getAlarmFiringMessageList(runningRule.check());
772778
Assertions.assertEquals(0, alarmMessages.size(), "Should be normal");
773779
Assertions.assertEquals(RunningRule.State.NORMAL, stateMachine.getCurrentState());
@@ -858,23 +864,23 @@ public void testAlarmStateMachine_SilenceGreaterThanRecovery() throws IllegalExp
858864
alarmMessages = getAlarmFiringMessageList(runningRule.check());
859865
if (i < 3) {
860866
Assertions.assertEquals(0, alarmMessages.size(), "Should be silenced at minute " + i);
861-
Assertions.assertEquals(RunningRule.State.SILENCED, stateMachine.getCurrentState());
867+
Assertions.assertEquals(RunningRule.State.SILENCED_FIRING, stateMachine.getCurrentState());
862868
} else {
863869
Assertions.assertEquals(1, alarmMessages.size(), "Should fire after silence period");
864870
Assertions.assertEquals(RunningRule.State.FIRING, stateMachine.getCurrentState());
865871
}
866872
}
867-
for (int i = 0; i <= 2; i++) {
873+
for (int i = 0; i <= 3; i++) {
868874
runningRule.moveTo(startTime.plusMinutes(8 + i).toLocalDateTime());
869875
runningRule.in(getMetaInAlarm(123), getMetrics(
870876
TimeBucket.getMinuteTimeBucket(startTime.plusMinutes(8 + i).getMillis()), 80));
871-
if (i < 2) {
877+
if (i < 3) {
872878
List<AlarmMessage> recoveryMessages = getAlarmRecoveryMessageList(runningRule.check());
873879
Assertions.assertEquals(0, recoveryMessages.size(), "Should not recover immediately");
874880
Assertions.assertEquals(RunningRule.State.OBSERVING_RECOVERY, stateMachine.getCurrentState());
875881
} else {
876882
List<AlarmMessage> recoveryMessages = getAlarmRecoveryMessageList(runningRule.check());
877-
Assertions.assertEquals(1, recoveryMessages.size(), "Should recover after observation period");
883+
Assertions.assertEquals(1, recoveryMessages.size(), "Should recover after silence period");
878884
Assertions.assertEquals(RunningRule.State.RECOVERED, stateMachine.getCurrentState());
879885
}
880886
}
@@ -914,12 +920,12 @@ public void testAlarmStateMachine_RecoveryGreaterThanSilence() throws IllegalExp
914920
runningRule.moveTo(startTime.plusMinutes(1).toLocalDateTime());
915921
alarmMessages = getAlarmFiringMessageList(runningRule.check());
916922
Assertions.assertEquals(0, alarmMessages.size(), "Should be silenced");
917-
Assertions.assertEquals(RunningRule.State.SILENCED, stateMachine.getCurrentState());
923+
Assertions.assertEquals(RunningRule.State.SILENCED_FIRING, stateMachine.getCurrentState());
918924

919925
runningRule.moveTo(startTime.plusMinutes(2).toLocalDateTime());
920926
alarmMessages = getAlarmFiringMessageList(runningRule.check());
921927
Assertions.assertEquals(0, alarmMessages.size(), "Should be silenced");
922-
Assertions.assertEquals(RunningRule.State.SILENCED, stateMachine.getCurrentState());
928+
Assertions.assertEquals(RunningRule.State.SILENCED_FIRING, stateMachine.getCurrentState());
923929

924930
runningRule.moveTo(startTime.plusMinutes(3).toLocalDateTime());
925931
alarmMessages = getAlarmFiringMessageList(runningRule.check());

test/e2e-v2/cases/alarm/alarm-settings.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ rules:
3838
- webhook.custom
3939
comp_rule:
4040
expression: sum((service_resp_time > 100) && (service_sla > 1)) >= 1
41-
period: 10
41+
period: 5
4242
recovery-observation-period: 3
4343
message: Service {name} response time is more than 100ms and sla is more than 1%.
4444
tags:

0 commit comments

Comments
 (0)