Skip to content

test: match OpenClaw argv in crash-loop PID probe#4049

Open
jyaunches wants to merge 2 commits into
NVIDIA:mainfrom
jyaunches:fix/2478-gateway-pid-v2
Open

test: match OpenClaw argv in crash-loop PID probe#4049
jyaunches wants to merge 2 commits into
NVIDIA:mainfrom
jyaunches:fix/2478-gateway-pid-v2

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented May 22, 2026

Summary

Why

PR #4045 fixed explicit openclaw gateway / openclaw-gateway matching and stale diagnostics, but the rerun still showed the gateway as a plain openclaw argv process while ps comm can be node on hosted runners. The log proves the gateway is healthy (gateway ready, Phase: Ready, healthy inference), so this remains a PID matcher false negative.

Validation

  • bash -n test/e2e/test-issue-2478-crash-loop-recovery.sh
  • git diff --check origin/main..HEAD
  • synthetic checks for explicit gateway argv, retitled gateway, plain comm fallback, plain argv fallback, and no fallback before gateway ready

Summary by CodeRabbit

  • Tests
    • Improved crash-loop recovery test robustness by enhancing process detection and selection to better handle inconsistent process metadata and identify the appropriate running process once readiness is signaled.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

Update to the E2E test's gateway_pid() fallback: expanded inline documentation and replacement of a ps/awk comm-based PID filter with pgrep -fo '[o]penclaw' to select the oldest matching openclaw process after readiness is observed.

Changes

Process ID Detection Improvement

Layer / File(s) Summary
gateway_pid() fallback robustness
test/e2e/test-issue-2478-crash-loop-recovery.sh
Expanded inline documentation clarifying the fallback finds the oldest live openclaw process after gateway.log indicates readiness and that ps comm/argv fields vary across builds. Replaced the previous ps/awk predicate that matched comm == "openclaw" with pgrep -fo '[o]penclaw' to obtain the oldest matching PID.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4045: Both PRs enhance the same gateway_pid() fallback logic to reliably detect the oldest openclaw gateway process when process title representation varies across environments.

Suggested labels

Integration: OpenClaw

Suggested reviewers

  • cv

Poem

🐰 In logs the gateway whispers, then sleeps no more,
I sniff for openclaw footsteps on the floor.
From comm or args I hop and I peep,
Oldest PID found — no mystery to keep.
Hooray, the test wakes up and dances once more! 🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: updating the crash-loop PID probe to match OpenClaw in argv, which is the core modification to the gateway_pid() fallback logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26265650672
Target ref: 2a741450b97d7298de6debd4a0bb20b625d72bdd
Workflow ref: main
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ❌ failure

Failed jobs: issue-2478-crash-loop-recovery-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Around line 121-122: The fallback pgrep call (pid="$(pgrep -fo '[o]penclaw'
... )") can return the launcher PID; change it to explicitly exclude the
launcher process when selecting a plain openclaw PID — for example, use pgrep
-af '[o]penclaw' (or equivalent) and filter out commands/args that contain
"launcher" before extracting the PID so the selected PID corresponds to a
non-launcher gateway process; keep the readiness grep against /tmp/gateway.log
and assign the filtered PID to the pid variable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a2548f6f-7b52-4779-8db8-679bcf4d5838

📥 Commits

Reviewing files that changed from the base of the PR and between 2a74145 and e2306cd.

📒 Files selected for processing (1)
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

Comment on lines 121 to +122
if [ -z "$pid" ] && grep -Eq "\[gateway\] (ready|http server listening)" /tmp/gateway.log 2>/dev/null; then
pid="$(ps -eo pid=,comm=,args= 2>/dev/null | awk '$2 == "openclaw" { print $1 }' | sort -n | head -n 1)"
pid="$(pgrep -fo '[o]penclaw' 2>/dev/null || true)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't select the oldest plain openclaw process here.

This fallback can return the launcher instead of the gateway: the file already assumes both can coexist, and the launcher is older. Because the readiness grep is against the whole gateway.log, that branch can also stay enabled after a later gateway crash and still hand back a stale non-gateway PID. That breaks the subsequent kill/respawn checks.

Suggested fix
-if [ -z "$pid" ] && grep -Eq "\[gateway\] (ready|http server listening)" /tmp/gateway.log 2>/dev/null; then
-  pid="$(pgrep -fo '[o]penclaw' 2>/dev/null || true)"
+if [ -z "$pid" ] && grep -Eq "\[gateway\] (ready|http server listening)" /tmp/gateway.log 2>/dev/null; then
+  pid="$(pgrep -fn '[o]penclaw' 2>/dev/null || true)"
 fi

If plain openclaw can match more than just launcher+gateway in this environment, I'd explicitly exclude the launcher instead of relying only on age.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ -z "$pid" ] && grep -Eq "\[gateway\] (ready|http server listening)" /tmp/gateway.log 2>/dev/null; then
pid="$(ps -eo pid=,comm=,args= 2>/dev/null | awk '$2 == "openclaw" { print $1 }' | sort -n | head -n 1)"
pid="$(pgrep -fo '[o]penclaw' 2>/dev/null || true)"
if [ -z "$pid" ] && grep -Eq "\[gateway\] (ready|http server listening)" /tmp/gateway.log 2>/dev/null; then
pid="$(pgrep -fn '[o]penclaw' 2>/dev/null || true)"
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-issue-2478-crash-loop-recovery.sh` around lines 121 - 122, The
fallback pgrep call (pid="$(pgrep -fo '[o]penclaw' ... )") can return the
launcher PID; change it to explicitly exclude the launcher process when
selecting a plain openclaw PID — for example, use pgrep -af '[o]penclaw' (or
equivalent) and filter out commands/args that contain "launcher" before
extracting the PID so the selected PID corresponds to a non-launcher gateway
process; keep the readiness grep against /tmp/gateway.log and assign the
filtered PID to the pid variable.

@wscurran wscurran added E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps fix Integration: OpenClaw Support for OpenClaw labels May 22, 2026
@wscurran
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps fix Integration: OpenClaw Support for OpenClaw

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants