Skip to content

Commit 92a12cd

Browse files
committed
fstests_watchdog: add process-based fallback for test detection
When testing parallel writeback patches for over 5 days with generic/750 I noticed the following: ./scripts/workflows/fstests/fstests_watchdog.py hosts baseline Hostname Test-name Completion % runtime(s) last-runtime(s) Stall-status Kernel Crash-status pw2-xfs-reflink-4k generic/750 0% (soak) 75290 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-8k-4ks generic/750 0% (soak) 75291 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-16k-4ks generic/750 0% (soak) 75290 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-32k-4ks generic/750 0% (soak) 75292 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-64k-4ks None 0% 0 0 OK 6.16.0-gbc97f3a7cc8f OK Journal-method Soak-duration(s) systemd-journal-remote 432000 But when I ssh to pw2-xfs-reflink-64k-4ks I can see generic/750 is running. The issue is the test has been running so long we don't see the kernel line any more about the test running. When systemd journal and dmesg logs have rotated out test information (which happens on long-running VMs), fall back to checking running processes to detect which test is currently executing. The fallback: 1. Uses SSH to check for 'check -s' processes on the host 2. Extracts the test name from the command line (last argument) 3. Gets the process runtime using 'ps -o etimes' to calculate duration This ensures the watchdog can correctly identify running tests even when all log messages have rotated out, preventing false "None" test reports for actively running tests. With this, I can no see what I expect: ./scripts/workflows/fstests/fstests_watchdog.py hosts baseline Hostname Test-name Completion % runtime(s) last-runtime(s) Stall-status Kernel Crash-status pw2-xfs-reflink-4k generic/750 0% (soak) 76119 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-8k-4ks generic/750 0% (soak) 76119 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-16k-4ks generic/750 0% (soak) 76119 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-32k-4ks generic/750 0% (soak) 76120 0 OK 6.16.0-gbc97f3a7cc8f OK pw2-xfs-reflink-64k-4ks generic/750 0% (soak) 76128 0 OK 6.16.0-gbc97f3a7cc8f OK Journal-method Soak-duration(s) systemd-journal-remote 432000 Generated-by: Claude AI Signed-off-by: Luis Chamberlain <[email protected]>
1 parent de18d27 commit 92a12cd

File tree

1 file changed

+73
-0
lines changed

1 file changed

+73
-0
lines changed

scripts/workflows/fstests/fstests_watchdog.py

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,79 @@ def print_fstest_host_status(host, verbose, use_remote, use_ssh, basedir, config
5353
)
5454
)
5555

56+
# If we couldn't get test info from journal/dmesg, try to get it from running processes
57+
if last_test is None and not stall_suspect:
58+
try:
59+
# Check if fstests is actually running by looking at processes
60+
import subprocess
61+
62+
result = subprocess.run(
63+
[
64+
"ssh",
65+
"-o",
66+
"ConnectTimeout=5",
67+
"-o",
68+
"StrictHostKeyChecking=no",
69+
host,
70+
"ps aux | grep 'check -s' | grep -v grep",
71+
],
72+
capture_output=True,
73+
text=True,
74+
timeout=10,
75+
)
76+
if result.returncode == 0 and result.stdout.strip():
77+
# Extract test name from process command line
78+
# Format: bash ./check -s section -R xunit test_name
79+
process_line = result.stdout.strip()
80+
parts = process_line.split()
81+
if "check" in process_line and len(parts) > 0:
82+
# Find the test name - it's usually the last argument
83+
test_name = parts[-1]
84+
if "/" in test_name: # Looks like a test name (e.g., generic/750)
85+
last_test = test_name
86+
# We don't have the start time, but we know it's running
87+
last_test_time = "Unknown (logs rotated)"
88+
current_time_str = "N/A"
89+
# Use SSH to get how long the process has been running
90+
pid_result = subprocess.run(
91+
[
92+
"ssh",
93+
"-o",
94+
"ConnectTimeout=5",
95+
"-o",
96+
"StrictHostKeyChecking=no",
97+
host,
98+
"ps aux | grep 'check -s' | grep -v grep | awk '{print $2}'",
99+
],
100+
capture_output=True,
101+
text=True,
102+
timeout=10,
103+
)
104+
if pid_result.returncode == 0 and pid_result.stdout.strip():
105+
pid = pid_result.stdout.strip().split()[0]
106+
# Get process runtime in seconds
107+
runtime_result = subprocess.run(
108+
[
109+
"ssh",
110+
"-o",
111+
"ConnectTimeout=5",
112+
"-o",
113+
"StrictHostKeyChecking=no",
114+
host,
115+
f"ps -o etimes= -p {pid}",
116+
],
117+
capture_output=True,
118+
text=True,
119+
timeout=10,
120+
)
121+
if (
122+
runtime_result.returncode == 0
123+
and runtime_result.stdout.strip()
124+
):
125+
delta_seconds = int(runtime_result.stdout.strip())
126+
except:
127+
pass # If SSH fails, keep the None values
128+
56129
checktime = fstests.get_checktime(host, basedir, kernel, section, last_test)
57130
percent_done = (delta_seconds * 100 / checktime) if checktime > 0 else 0
58131

0 commit comments

Comments
 (0)