|
| 1 | + groups: |
| 2 | + - name: OTPE |
| 3 | + interval: 15s |
| 4 | + rules: |
| 5 | + - alert: Config Digest Duplicate |
| 6 | + expr: otpe_config_digest_duplicates_total != 0 |
| 7 | + labels: |
| 8 | + severity: critical |
| 9 | + team: incident-response |
| 10 | + annotations: |
| 11 | + summary: Something really unexpected happened. Let Lorenz know. |
| 12 | + |
| 13 | + - name: Telemetry Ingestion |
| 14 | + interval: 15s |
| 15 | + rules: |
| 16 | + - record: sum:ocr_telemetry_ingested_total |
| 17 | + expr: sum without (contract, oracle) (ocr_telemetry_ingested_total) |
| 18 | + - record: bool:telemetry_down |
| 19 | + expr: (rate (sum:ocr_telemetry_ingested_total[1m])) == bool 0 |
| 20 | + - alert: Telemetry Down (infra) |
| 21 | + expr: bool:telemetry_down == 1 |
| 22 | + for: 5m |
| 23 | + labels: |
| 24 | + severity: critical |
| 25 | + team: infra |
| 26 | + annotations: |
| 27 | + summary: OTPE is not receiving any telemetry at all. |
| 28 | + |
| 29 | + - alert: Telemetry Down (o11y) |
| 30 | + expr: bool:telemetry_down == 1 |
| 31 | + for: 5m |
| 32 | + labels: |
| 33 | + severity: critical |
| 34 | + team: monitoring |
| 35 | + annotations: |
| 36 | + summary: OTPE is not receiving any telemetry at all. |
| 37 | + |
| 38 | + - name: Contract Configuration |
| 39 | + interval: 15s |
| 40 | + rules: |
| 41 | + - record: bool:contract_oracle_active |
| 42 | + # use max_over_time to be resistant to exporter restarts/glitches |
| 43 | + expr: max_over_time(ocr_contract_oracle_active[1m]) > bool 0 |
| 44 | + - record: bool:contract_active |
| 45 | + expr: sum without(oracle) (bool:contract_oracle_active) > bool 0 |
| 46 | + - record: bool:oracle_active |
| 47 | + expr: sum without(contract) (bool:contract_oracle_active) > bool 0 |
| 48 | + |
| 49 | + - name: Oracle & Feed |
| 50 | + interval: 15s |
| 51 | + rules: |
| 52 | + - record: bool:oracle_feed_telemetry_down |
| 53 | + expr: (rate(ocr_telemetry_ingested_total[2m]) == bool 0) * bool:contract_oracle_active |
| 54 | + - record: bool:oracle_feed_blind |
| 55 | + # TODO: It would be better to make this based on a rate of a total |
| 56 | + expr: (max_over_time(ocr_telemetry_message_report_req_observation_included[2m]) == bool 0) * bool:contract_oracle_active |
| 57 | + |
| 58 | + - name: Oracle |
| 59 | + interval: 15s |
| 60 | + rules: |
| 61 | + - record: bool:oracle_blind |
| 62 | + expr: min without(contract) (bool:oracle_feed_blind) * bool:oracle_active |
| 63 | + - record: bool:oracle_blind_except_telemetry_down |
| 64 | + expr: bool:oracle_blind * ignoring (oracle) group_left() (1 - bool:telemetry_down) |
| 65 | + |
| 66 | + # Oracle Blind EXCEPT Telemetry Down |
| 67 | + - alert: No observations from an OCR oracle |
| 68 | + expr: bool:oracle_blind_except_telemetry_down == 1 |
| 69 | + for: 3m |
| 70 | + labels: |
| 71 | + severity: critical |
| 72 | + team: incident-response |
| 73 | + annotations: |
| 74 | + summary: Oracle has made no observations {{ $labels.oracle }}. Perhaps the oracle is down or having data source issues? Reach out to the node op. |
| 75 | + - record: bool:oracle_telemetry_down |
| 76 | + expr: min without(contract) (bool:oracle_feed_telemetry_down) * bool:oracle_active |
| 77 | + - record: bool:oracle_telemetry_down_except_telemetry_down |
| 78 | + expr: bool:oracle_telemetry_down * ignoring (oracle) group_left() (1 - bool:telemetry_down) |
| 79 | + |
| 80 | + # Oracle Telemetry Down EXCEPT Telemetry Down |
| 81 | + - alert: No telemetry from an OCR oracle |
| 82 | + expr: bool:oracle_telemetry_down_except_telemetry_down == 1 |
| 83 | + for: 20m |
| 84 | + labels: |
| 85 | + severity: critical |
| 86 | + team: incident-response |
| 87 | + annotations: |
| 88 | + summary: Not receiving any telemetry for {{ $labels.oracle }}. Perhaps the oracle is down or having issues with the telemetry transport? Reach out to the node op. |
| 89 | + |
| 90 | + - name: Feed |
| 91 | + interval: 15s |
| 92 | + rules: |
| 93 | + - record: bool:feed_telemetry_down |
| 94 | + expr: min without(oracle) (bool:oracle_feed_telemetry_down) * bool:contract_active |
| 95 | + - record: bool:feed_telemetry_down_except_telemetry_down |
| 96 | + expr: bool:feed_telemetry_down * ignoring (contract) group_left() (1 - bool:telemetry_down) |
| 97 | + |
| 98 | + # Feed Telemetry Down EXCEPT Telemetry Down |
| 99 | + - alert: No telemetry on an OCR feed |
| 100 | + expr: bool:feed_telemetry_down_except_telemetry_down == 1 |
| 101 | + for: 4m |
| 102 | + labels: |
| 103 | + severity: critical |
| 104 | + team: incident-response |
| 105 | + annotations: |
| 106 | + summary: Not receiving any telemetry for {{ $labels.contract }}. Are all nodes down or not sending telemetry? |
| 107 | + - record: bool:feed_stalled |
| 108 | + expr: (rate(ocr_telemetry_feed_agreed_epoch[5m]) == bool 0) * bool:contract_active |
| 109 | + - record: bool:feed_stalled_except_telemetry_down |
| 110 | + expr: bool:feed_stalled * (1 - bool:feed_telemetry_down) |
| 111 | + |
| 112 | + # Alert if no new round seen after 90 seconds, unless feed fails to report or OTPE is not receving any telememtry at all |
| 113 | + - alert: Rounds have stopped progressing on an OCR feed |
| 114 | + expr: |
| 115 | + ( |
| 116 | + (sum(rate(ocr_telemetry_epoch_round[10m])) by (contract, job, cluster, instance) < 1./90 == bool 0) |
| 117 | + * bool:contract_active |
| 118 | + * (1-bool:feed_telemetry_down) |
| 119 | + ) == 1 |
| 120 | + labels: |
| 121 | + severity: critical |
| 122 | + team: incident-response |
| 123 | + annotations: |
| 124 | + summary: New rounds are not being created on feed {{ $labels.contract }} at the expected rate. Maybe the feed has stalled. Reach out to node operators to corroborate this. If they are not seeing any runs, escalate and consider failing over to FM if you cannot resolve this quickly. |
| 125 | + |
| 126 | + # Feed Stalled EXCEPT Feed Telemetry Down |
| 127 | + - alert: Epochs have stopped progressing on an OCR feed |
| 128 | + expr: bool:feed_stalled_except_telemetry_down == 1 |
| 129 | + for: 5m |
| 130 | + labels: |
| 131 | + severity: critical |
| 132 | + team: incident-response |
| 133 | + annotations: |
| 134 | + summary: New epochs are not being created on feed {{ $labels.contract }} at the expected rate. Maybe the feed has stalled. Reach out to node operators to corroborate this. If they are not seeing any runs, escalate and consider failing over to FM if you cannot resolve this quickly. |
| 135 | + # This is not particularly actionable, so commenting out for now. We can think about improved versions later. |
| 136 | + # - record: bool:feed_fast_epochs |
| 137 | + # expr: (rate(ocr_telemetry_feed_agreed_epoch[6m]) > bool 3/(ocr_contract_config_r_max * ocr_contract_config_delta_round_seconds)) * bool:contract_active |
| 138 | + # - alert: Feed Fast Epochs |
| 139 | + # expr: bool:feed_fast_epochs == 1 |
| 140 | + # for: 3m |
| 141 | + # labels: |
| 142 | + # severity: critical |
| 143 | + # slack_channel: ocr-telemetry-beta-group |
| 144 | + # annotations: |
| 145 | + # summary: Feed is moving through epochs much faster than expected {{ $labels.contract }}. Perhaps a few nodes are down? |
| 146 | + - record: bool:feed_close_to_reporting_failure |
| 147 | + expr: (max_over_time(ocr_telemetry_feed_message_report_req_size[2m]) < bool 2*ocr_contract_config_f+1 + 2) * bool:contract_active |
| 148 | + - record: bool:feed_close_to_reporting_failure_except_feed_telemetry_down |
| 149 | + expr: bool:feed_close_to_reporting_failure * (1 - bool:feed_telemetry_down) |
| 150 | + |
| 151 | + # Feed Close To Reporting Failure EXCEPT Feed Telemetry Down |
| 152 | + - alert: OCR feed close to reporting failure |
| 153 | + expr: bool:feed_close_to_reporting_failure_except_feed_telemetry_down == 1 |
| 154 | + for: 3m |
| 155 | + labels: |
| 156 | + severity: critical |
| 157 | + team: incident-response |
| 158 | + annotations: |
| 159 | + summary: Feed is within two oracles of reporting failure {{ $labels.contract }}. Reach out to node ops that are having issues asap and consider replacing them. |
| 160 | + - record: bool:feed_reporting_failure |
| 161 | + expr: (rate(ocr_telemetry_feed_message_report_req_total[4m]) == bool 0) * bool:contract_active |
| 162 | + - record: bool:feed_reporting_failure_except_feed_telemetry_down |
| 163 | + expr: bool:feed_reporting_failure * (1 - bool:feed_telemetry_down) |
| 164 | + |
| 165 | + # Feed Reporting Failure EXCEPT Feed Telemetry Down |
| 166 | + - alert: OCR feed reporting failure |
| 167 | + expr: bool:feed_reporting_failure_except_feed_telemetry_down == 1 |
| 168 | + for: 4m |
| 169 | + labels: |
| 170 | + severity: critical |
| 171 | + team: incident-response |
| 172 | + annotations: |
| 173 | + summary: Feed is experiencing reporting failure {{ $labels.contract }}! Reach out to node ops to confirm and consider failing over to FluxMonitor. |
| 174 | + |
| 175 | + - name: Oracle & Feed Except Oracle |
| 176 | + interval: 15s |
| 177 | + rules: |
| 178 | + # Currently not useful due to unreliable telemetry ingestion |
| 179 | + # - record: bool:oracle_feed_telemetry_down_except_oracle_telemetry_down_except_feed_telemetry_down |
| 180 | + # expr: (bool:oracle_feed_telemetry_down * ignoring (contract) group_left() (1 - bool:oracle_telemetry_down)) * ignoring (oracle) group_left() (1 - bool:feed_telemetry_down) |
| 181 | + # - alert: Oracle & Feed Telemetry Down EXCEPT Oracle Telemetry Down EXCEPT Feed Telemetry Down |
| 182 | + # expr: bool:oracle_feed_telemetry_down_except_oracle_telemetry_down_except_feed_telemetry_down == 1 |
| 183 | + # for: 30m |
| 184 | + # labels: |
| 185 | + # severity: warning |
| 186 | + # slack_channel: ocr-telemetry-beta-group |
| 187 | + # annotations: |
| 188 | + # summary: Not receiving any telemetry from oracle {{ $labels.oracle }} on feed {{ $labels.contract }}. Reach out to the node op. |
| 189 | + - record: bool:oracle_feed_blind_except_oracle_blind_except_feed_reporting_failure_except_feed_telemetry_down |
| 190 | + expr: (bool:oracle_feed_blind * ignoring (contract) group_left() (1 - bool:oracle_blind)) * ignoring (oracle) group_left() (1 - bool:feed_reporting_failure) * ignoring (oracle) group_left() (1 - bool:feed_telemetry_down) |
| 191 | + |
| 192 | + # Oracle & Feed Blind EXCEPT Oracle Blind EXCEPT Feed Reporting Failure EXCEPT Feed Telemetry Down |
| 193 | + - alert: Oracle not making observations on an OCR feed |
| 194 | + expr: bool:oracle_feed_blind_except_oracle_blind_except_feed_reporting_failure_except_feed_telemetry_down == 1 |
| 195 | + for: 10m |
| 196 | + labels: |
| 197 | + severity: warning |
| 198 | + team: incident-response |
| 199 | + annotations: |
| 200 | + summary: Oracle {{ $labels.oracle }} is able to make observations, yet I'm not receiving any observations from it on feed {{ $labels.contract }}. Perhaps a data source issue? Reach out to the node op. |
| 201 | + |
0 commit comments