Skip to content

fix(matrix): @mention detection, sync reliability, and duplicate code cleanup#1057

Open
eldelosdatos wants to merge 1 commit intoRightNow-AI:mainfrom
centinela-io:fix/matrix-mention-and-sync-reliability
Open

fix(matrix): @mention detection, sync reliability, and duplicate code cleanup#1057
eldelosdatos wants to merge 1 commit intoRightNow-AI:mainfrom
centinela-io:fix/matrix-mention-and-sync-reliability

Conversation

@eldelosdatos
Copy link
Copy Markdown

Summary

  • Fix @mention detection for Element/Matrix clients — Element sends mentions as HTML pills in formatted_body and via m.mentions.user_ids[], not in plain text body. The adapter only checked body, making group_policy=mention_only (default) silently drop all messages in group rooms.
  • Add HTTP client timeout to prevent sync hangsreqwest::Client::new() has no timeout. If the homeserver drops TCP silently, /sync hangs forever, killing message reception permanently.
  • Remove duplicate mention/DM detection codeFIX not able to run openfang in ubuntu 22.04 #2 and FIX Invalid request for kimi #3 blocks were duplicated, overwriting metadata on the second pass.
  • Add heartbeat_interval_secs=300 to assistant agent — Default 30s creates 60s timeout, causing idle agents to crash-loop every 90s.
  • Add comprehensive observability logging — Structured logs at every decision point in sync loop, bridge dispatch, and agent routing.
  • Add 9 integration tests with wiremock for Matrix adapter.

Problem Details

@mention detection (critical)

// Before: only checks plain text body
if content.contains(&user_id) { ... }

// After: checks body + HTML + m.mentions spec
let mentioned_in_body = content.contains(&user_id);
let mentioned_in_html = event["content"]["formatted_body"]
    .as_str().map(|html| html.contains(&user_id)).unwrap_or(false);
let mentioned_in_m_mentions = event["content"]["m.mentions"]["user_ids"]
    .as_array().map(|ids| ids.iter().any(|id| id.as_str() == Some(&user_id)))
    .unwrap_or(false);

Element sends:

  • body: hola (no MXID)
  • formatted_body: <a href="https://matrix.to/#/@bot:server">Bot</a> hola
  • m.mentions.user_ids: ["@bot:server"]

HTTP timeout

// Before: no timeout — hangs forever on dropped connections
client: reqwest::Client::new(),

// After: 90s total, 30s connect
client: reqwest::Client::builder()
    .timeout(Duration::from_secs(90))
    .connect_timeout(Duration::from_secs(30))
    .build()

Duplicate code

The FIX #2 (mention detection) and FIX #3 (DM detection) blocks were copy-pasted twice in the sync loop. The second copy overwrites metadata from the first, losing the DM detection results.

Test Plan

  • cargo check -p openfang-channels — compiles cleanly
  • cargo test -p openfang-channels — 12/12 tests pass (9 new + 3 existing)
  • Tested on Railway with Synapse homeserver + Element client
  • Verified Element @mentions detected via formatted_body and m.mentions
  • Verified sync recovery after connection drops
  • Verified group_policy=mention_only works correctly with Element pills

New Tests

Test What it validates
test_adapter_auth_and_start Auth + sync loop init
test_adapter_auth_failure 401 error handling
test_sync_receives_message End-to-end message reception
test_sync_skips_own_messages Self-message filtering
test_sync_retries_on_error Error retry with backoff
test_shutdown_stops_sync Clean shutdown via signal
test_command_parsing /command args parsing
test_dedup_events Duplicate event filtering
test_allowed_rooms_filter Room allowlist enforcement

Breaking Changes

None. All changes are backward-compatible.

🤖 Generated with Claude Code

… cleanup

## Problem

Three issues with the Matrix channel adapter:

### 1. @mention detection broken for Element clients (group_policy=mention_only unusable)
Element sends @mentions as HTML pills in `formatted_body` and via `m.mentions.user_ids[]`,
but NOT in the plain text `body`. The adapter only checked `body`, so all Element @mentions
were missed. This made `group_policy=mention_only` (the default) effectively ignore all
messages in group rooms when using Element.

### 2. HTTP client has no timeout (sync can hang forever)
`reqwest::Client::new()` creates a client with no timeout. If the homeserver drops the TCP
connection without sending RST/FIN (common in containerized deployments behind proxies),
the sync request hangs forever, silently killing message reception.

### 3. Duplicate mention/DM detection code
The FIX RightNow-AI#2 and FIX RightNow-AI#3 blocks (mention detection + DM detection) were duplicated in the
sync loop, causing `metadata` to be overwritten by the second pass.

## Fix

### Mention detection (matrix.rs)
Now checks three sources for @mentions:
1. `content.body` — plain text (CLI/API clients)
2. `content.formatted_body` — HTML pills (Element, FluffyChat, etc.)
3. `content.m.mentions.user_ids[]` — Matrix spec MSC3952 standard

### HTTP client timeout (matrix.rs)
Added `timeout(90s)` and `connect_timeout(30s)` to the reqwest Client builder.
The Matrix /sync uses `timeout=30000ms` (30s long-poll), so 90s gives 60s margin
for network latency while ensuring hung connections are detected and retried.

### Duplicate code removal (matrix.rs)
Removed the duplicated FIX RightNow-AI#2 + FIX RightNow-AI#3 blocks that overwrote `metadata`.

### Heartbeat (agents/assistant/agent.toml)
Added `heartbeat_interval_secs = 300` to the assistant agent. The default (30s)
creates a timeout of 60s, causing idle agents to enter a crash-recovery loop
every 90 seconds. With 300s the timeout is 600s (10 min), eliminating false
positives for idle agents.

### Observability (matrix.rs + bridge.rs)
Added structured logging at every decision point:
- Sync loop: iteration counter, shutdown channel state, error details
- Bridge loop: message receipt, iteration counter, shutdown handling
- Dispatch: policy check results, agent routing, RBAC, send result

### Tests (matrix.rs)
Added 9 integration tests using wiremock to validate:
- Auth success/failure
- Message reception and dispatch
- Own-message filtering
- Error retry with backoff
- Clean shutdown
- Command parsing
- Event deduplication
- Room allowlist filtering

## Breaking Changes
None. All changes are backward-compatible.

## Test Plan
- [x] `cargo check -p openfang-channels` passes
- [x] `cargo test -p openfang-channels` — 12/12 tests pass
- [x] Tested on Railway with Synapse homeserver
- [x] Verified Element @mentions detected via formatted_body
- [x] Verified sync recovery after connection drops
@jaberjaber23
Copy link
Copy Markdown
Member

Thanks for the matrix reliability work @eldelosdatos — the sync retry + @mention detection improvements look solid.

Ship-blocker in the current head: agents/assistant/agent.toml around line 144 has two keys concatenated on one line without a newline separator:

toml [autonomous] heartbeat_interval_secs = 300max_iterations = 100

This is invalid TOML and will break every deployment that uses the bundled assistant agent. Please change it to:

toml [autonomous] heartbeat_interval_secs = 300 max_iterations = 100

Rebase on latest main (post-#1041) at the same time so CI runs. Happy to merge once that's fixed and green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid request for kimi not able to run openfang in ubuntu 22.04

2 participants