Skip to content

Commit 9100ad0

Browse files
authored
infra: add watchdog for batcher (#1311)
1 parent 7aa630c commit 9100ad0

File tree

3 files changed

+82
-0
lines changed

3 files changed

+82
-0
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
PROMETHEUS_URL=<ip>:<port>
2+
SYSTEMD_SERVICE=batcher
3+
PROMETHEUS_COUNTER=sent_batches
4+
PROMETHEUS_BOT=batcher
5+
PROMETHEUS_INTERVAL=20m
6+
SLACK_WEBHOOK_URL=<>

infra/watchdog/batcher/README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Batcher Watchdog
2+
3+
The Batcher Watchdog checks a prometheus metric and restart the Batcher as needed.
4+
5+
The metric is the quantity of batches sent in the last N minutes, defined in the PROMETHEUS_INTERVAL variable. Lets call this metric `sent_batches`.
6+
7+
Since we are sending proofs constantly, the ideal behaviour is the creation of a task every 3 Ethereum blocks (~36 secs). So, if the `sent_batches` metrics is 0 it means there is a problem in the Batcher, for example a transaction is stuck in Ethereum and the Batcher is locked waiting for the transaction. If this happens, the Watchdog restarts the Batcher.
8+
9+
## Configuration
10+
11+
You need to create a .env file with the following variables
12+
13+
```
14+
PROMETHEUS_URL=<ip>:<port>
15+
SYSTEMD_SERVICE=batcher
16+
PROMETHEUS_COUNTER=sent_batches
17+
PROMETHEUS_BOT=batcher
18+
PROMETHEUS_INTERVAL=20m
19+
SLACK_WEBHOOK_URL=<>
20+
```
21+
22+
There is a `.env.example` file in this directory.
23+
24+
## Run with Crontab
25+
26+
Open the Crontab configuration with `crontab -e` and add the following line:
27+
28+
```
29+
*/10 * * * * /path/to/watchdog/batcher_watchdog.sh /path/to/config/.env >> /path/to/logs/folder/batcher_watchdog.log 2>&1
30+
```
31+
32+
The cron interval has to be the half of PROMETHEUS_INTERVAL (PROMETHEUS_INTERVAL/2).
33+
34+
You can check logs in the specified file, for example:
35+
36+
```
37+
Tue Oct 15 08:00:01 UTC 2024: tasks created in the last 20m: "25"
38+
Tue Oct 15 08:20:01 UTC 2024: tasks created in the last 20m: "2"
39+
Tue Oct 15 08:40:01 UTC 2024: tasks created in the last 20m: "0"
40+
Tue Oct 15 08:40:01 UTC 2024: restarting batcher
41+
Tue Oct 15 08:40:01 UTC 2024: batcher restarted
42+
```
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/bin/bash
2+
3+
# Load env file from first parameter
4+
# Env variables:
5+
# - PROMETHEUS_URL
6+
# - SYSTEMD_SERVICE
7+
# - PROMETHEUS_COUNTER
8+
# - PROMETHEUS_BOT
9+
# - PROMETHEUS_INTERVAL
10+
# - SLACK_WEBHOOK_URL
11+
source $1
12+
13+
# Function to send slack message
14+
# @param message
15+
function send_slack_message() {
16+
curl -X POST -H 'Content-type: application/json' \
17+
--data "{\"text\":\"$1\"}" \
18+
$SLACK_WEBHOOK_URL
19+
}
20+
21+
# Get rate from prometheus
22+
rate=$(curl -gs 'http://'$PROMETHEUS_URL'/api/v1/query?query=floor(increase('$PROMETHEUS_COUNTER'{bot="'$PROMETHEUS_BOT'"}['$PROMETHEUS_INTERVAL']))' | jq '.data.result[0].value[1]')
23+
24+
echo "$(date): tasks created in the last $PROMETHEUS_INTERVAL: $rate"
25+
26+
# Check if rate is 0
27+
if [ "$rate" = \"0\" ]; then
28+
# Restart systemd service
29+
echo "$(date): restarting $SYSTEMD_SERVICE"
30+
sudo systemctl restart $SYSTEMD_SERVICE
31+
message="$(date): $SYSTEMD_SERVICE restarted by watchdog"
32+
echo $message
33+
send_slack_message "$message"
34+
fi

0 commit comments

Comments
 (0)