Skip to content

Stamp last_active in streaming agent loop to prevent heartbeat false-positives#1090

Open
chrisyoung2005 wants to merge 1 commit intoRightNow-AI:mainfrom
chrisyoung2005:fix/streaming-heartbeat-touch
Open

Stamp last_active in streaming agent loop to prevent heartbeat false-positives#1090
chrisyoung2005 wants to merge 1 commit intoRightNow-AI:mainfrom
chrisyoung2005:fix/streaming-heartbeat-touch

Conversation

@chrisyoung2005
Copy link
Copy Markdown
Contributor

Fixes #1089

What

run_agent_loop_streaming was missing the touch_agent() call that the non-streaming run_agent_loop performs before every LLM request. This causes last_active to go stale during slow streaming generations, and the heartbeat monitor flags the agent as unresponsive mid-stream — triggering a crash-recovery cycle.

Why

With a slow local backend (e.g. Ollama qwen3.5:35b generating for minutes, especially under contention from multiple agents sharing one Ollama instance), the agent appears "frozen" to the user. Under the hood, the kernel has already marked it unresponsive, killed the loop, and is restarting it — which re-queues the request, making the problem worse when multiple agents pile on.

The non-streaming path handles this correctly at crates/openfang-runtime/src/agent_loop.rs:446-449:

// Stamp last_active before the (potentially long) LLM call so the
// heartbeat monitor doesn't flag us as unresponsive mid-iteration.
if let Some(k) = &kernel {
    k.touch_agent(&agent_id_str);
}

The streaming path did not have the equivalent, so last_active was only updated between iterations (after streaming finished), not before the long-running call.

Fix

Mirror the non-streaming behavior in run_agent_loop_streaming, immediately before stream_with_retry:

+        // Stamp last_active before the (potentially long) LLM call so the
+        // heartbeat monitor doesn't flag us as unresponsive mid-iteration.
+        if let Some(k) = &kernel {
+            k.touch_agent(&agent_id_str);
+        }
+
         // Stream LLM call with retry, error classification, and circuit breaker
         let provider_name = manifest.model.provider.as_str();
         let mut response = stream_with_retry(

Minimal 6-line change, no behavior change for agents that fit inside the heartbeat window. agent_id_str is already in scope at this point in the function.

Verification

  • cargo fmt -p openfang-runtime -- --check — clean
  • cargo clippy -p openfang-runtime --all-targets -- -D warnings — clean
  • cargo test -p openfang-runtime — 929 passed, 0 failed

Repro path on local Ollama: with heartbeat.default_timeout_secs below actual per-iteration generation time, streaming agents get killed mid-response and re-spawned in a loop. With this patch applied, the same config runs to completion.

🤖 Generated with Claude Code

…positives

Fixes RightNow-AI#1089

run_agent_loop_streaming skipped the touch_agent() call that the
non-streaming run_agent_loop performs before every LLM request. On slow
local inference (e.g. Ollama qwen3.5:35b, multi-minute generations),
last_active went stale and the heartbeat monitor flagged the agent as
unresponsive, triggering crash recovery mid-stream. With multiple agents
sharing one Ollama instance, queued agents appeared frozen while the
active one generated.

Mirror the non-streaming behavior: stamp last_active immediately before
stream_with_retry so the heartbeat window covers the full LLM call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chrisyoung2005
Copy link
Copy Markdown
Contributor Author

Note on failing checks: Format, Clippy, and Security Audit are pre-existing failures on main at the bump v0.6.0 commit — see run 24637821654 on main @ e6bab99, same three checks red.

None of the diffs flagged by cargo fmt --check or cargo clippy are in files this PR touches. Locally against this branch:

  • cargo fmt -p openfang-runtime -- --check — clean
  • cargo clippy -p openfang-runtime --all-targets -- -D warnings — clean
  • cargo test -p openfang-runtime — 929 passed

The Test and Check matrices (ubuntu/macos/windows) all pass here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming agent loop missing touch_agent → heartbeat false-positives on selected local LLMs

1 participant