Skip to content

Commit 20ffa22

Browse files
authored
ci: improve handle-exit.sh timeout behavior (#1207)
For large validator folders, handle-exit.sh takes over 90 seconds to archive it, and systemd kills it. This causes the disk space to fill up because handle-exit.sh is unable to clean up resources. To improve the situation, I've increased the time limit to 30 minutes. I also clear out the validator folder when sig starts because sig is unable to resume from old state. This was resulting in repeated crashing and attempting to archive the huge validator folder and failing, creating partial tarballs over and over that filled up the disk. Instead now the service just tries to archive it to S3 once, and if that fails, set aside the folder and restart.
1 parent 5c26bc1 commit 20ffa22

File tree

2 files changed

+17
-0
lines changed

2 files changed

+17
-0
lines changed

ci/run-and-update-service/run.sh

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,20 @@ fi
1414
# Delete log files older than 1 month
1515
find "$BASE_DIR/logs" -name "sig.*.log" -type f -mtime +30 -delete
1616

17+
# Sig is unable to resume from existing validator state, so clear it before starting.
18+
if [ -d "$BASE_DIR/validator" ]; then
19+
# We make an effort to save the old state in case it is needed for
20+
# debugging, but only save the latest instance to avoid running out of disk
21+
# space. There is already handle-exit.sh which uploads crash artifacts to
22+
# S3, so this is only a last resort to deal with cascading failures.
23+
rm -rf "$BASE_DIR/validator-archive"
24+
mv "$BASE_DIR/validator" "$BASE_DIR/validator-archive"
25+
fi
26+
27+
# Clear out old archives to avoid running out of disk space. They should have
28+
# been uploaded to S3 and then deleted right away. If not, we must be flooding
29+
# the disk with archives on failed uploads, and we're going to run out of disk
30+
# space.
31+
rm -f "$BASE_DIR/validator-*.tar.zst"
32+
1733
"$BASE_DIR/zig-out/bin/sig" $@ 2>>"$BASE_DIR/logs/sig.log" >>"$BASE_DIR/logs/sig.log"

ci/run-and-update-service/sig.service

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ LimitNOFILE=infinity
1313
ExecStopPost=/home/sig/sig/ci/run-and-update-service/handle-exit.sh
1414
MemoryAccounting=yes
1515
MemoryMax=85%
16+
TimeoutStopSec=30m
1617

1718
[Install]
1819
WantedBy=multi-user.target

0 commit comments

Comments
 (0)