Skip to content

feat(daemon): add monitoring layer to the sync state machine#369

Open
andreabadesso wants to merge 12 commits intomasterfrom
feat/daemon-monitoring-layer
Open

feat(daemon): add monitoring layer to the sync state machine#369
andreabadesso wants to merge 12 commits intomasterfrom
feat/daemon-monitoring-layer

Conversation

@andreabadesso
Copy link
Collaborator

@andreabadesso andreabadesso commented Feb 23, 2026

Summary

  • Adds a MonitoringActor that centralises runtime health checks for the sync state machine.
  • Four new optional env vars: IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS.
Condition Response Default threshold
No event received while WS connected MAJOR alert + process.exit(1) (Kubernetes restarts) 5 min
Stuck in any processing state MAJOR alert only — machine keeps running 1 hour
≥ threshold reconnects within window MAJOR alert 10 reconnects / 5 min

The stuck-processing check intentionally does not force a reconnection — a long-running handler (e.g. a large reorg) should be allowed to finish. Operators are alerted so they can investigate if needed.

Acceptance criteria

  • Idle timeout terminates the process; a MAJOR alert is sent before exit.
  • Stuck-processing fires a MAJOR alert after 1 hour; machine continues running uninterrupted.
  • Reconnection storm alert fires at threshold; stale entries outside the window are evicted.
  • All four env vars are optional — daemon starts correctly when unset.
  • All MonitoringActor unit tests pass; existing tests unaffected.

Summary by CodeRabbit

  • New Features

    • Runtime health monitoring for the sync process: idle detection, stuck-processing alerts, and reconnection-storm detection; sync lifecycle now emits monitoring signals.
    • New configurable monitoring settings (timeouts, thresholds, windows) with sensible defaults.
  • Tests

    • Comprehensive tests covering idle/stuck/reconnection behaviors, timer lifecycles, alerting, cooldowns, and logging.
    • Test setup updated with monitoring config values and mocks to prevent external side effects.

@andreabadesso andreabadesso force-pushed the feat/daemon-monitoring-layer branch from 13d34c8 to 2069e3e Compare February 24, 2026 16:06
@andreabadesso andreabadesso self-assigned this Feb 24, 2026
@andreabadesso andreabadesso added the enhancement New feature or request label Feb 24, 2026
@andreabadesso andreabadesso moved this to In Progress (Done) in Hathor Network Feb 24, 2026
@andreabadesso andreabadesso force-pushed the feat/daemon-monitoring-layer branch from 9ce6013 to 289eed9 Compare February 25, 2026 13:46
Comment on lines +205 to +207
if (!context.monitoring) {
throw new Error('No monitoring actor in context');
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): Throwing in getMonitoringRefFromContext can crash the machine before monitoring is spawned

Do we know the effects of the raised exception on the machine?

Also, is it possible to do something like "no monitoring -> start monitor -> resume machine"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it to use choose and fail silently instead of throwing

This is more defensive than anything as there is no case where monitoring will be null here

I don't think we should do (at least in this PR) a more elaborate mechanism of pausing the machine and resuming the monitor if it's null

Severity.MAJOR,
{ idleMs: String(idleMs) },
logger,
).finally(() => process.exit(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): Idle detector terminates the process from inside the actor

Having a hard shutdown can be very hard to determine the cause of the issue, can we send an event like MONITORING_IDLE_TIMEOUT that logs/add alert and proceeds to the normal daemon shutdown?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, done in 597a0da

Moved the process.exit(1) to the index.js

Basically what happens now is that the "final" ERROR state is reached, that sends the alert through SQS, logs and when it resolves, the machine.onDone() callback is called, terminating the process

logger,
).finally(() => process.exit(1));
}
}, idleTimeoutMs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Idle detection interval being equal to the desired idle time may be too gracious.

If interval is the same as the timeout we may wait up to 2 times the timeout before triggering the timeout.
To mitigate this we could have the setInteval timeout be idleTimeoutMs/2 so we will hang for at most "timeout * 1.5".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 597a0da

* stalled without triggering a WebSocket disconnect.
*
* 2. Stuck-processing detection — PROCESSING_STARTED begins a one-shot timer;
* PROCESSING_COMPLETED cancels it. If the timer fires the actor fires a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is it the timer or the actor that will fire a CRITICAL alert?

// ── Stuck-processing detection ───────────────────────────────────────────────
let stuckTimer: ReturnType<typeof setTimeout> | null = null;

const startStuckTimer = () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): Stuck-processing logic does not implement the actor documentation comment because it does not notify the machine

The header comment says it “fires a CRITICAL alert and sends MONITORING_STUCK_PROCESSING back to the machine”, but the implementation only sends an alert (Severity.MAJOR).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment, thanks

},
logger,
).catch((err: Error) =>
logger.error(`[monitoring] Failed to send reconnection storm alert: ${err}`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Will the err string interpolation use err.message or [object Object]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would call Error.prototype.toString() which would produce Error: <message>, but I've changed it to use err.message directly

const windowStart = now - stormWindowMs;
reconnectionTimestamps = reconnectionTimestamps.filter(t => t >= windowStart);

if (reconnectionTimestamps.length >= stormThreshold) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(blocking): Reconnection storm will spam the alerting system.

If we trigger a reconection storm, each new "RECONNECTING" will trigger a new alert, which will spam the alerting system.

We could create a stormAlertLastFiredAt and only fire a new alert if we have not fired for a minute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion, thanks

Done

const { ACK_TIMEOUT_MS } = getConfig();
return ACK_TIMEOUT_MS;
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: clean empty line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +145 to +146
'sendMonitoringReconnecting',
'sendMonitoringDisconnected',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Are we sure these 2 actions should be triggered on the same state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's disconnected and reconnecting

target: `#${SYNC_MACHINE_STATES.ERROR}`,
}, {
actions: ['storeEvent', 'sendAck'],
actions: ['storeEvent', 'sendAck', 'sendMonitoringEventReceived'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Do we have a better way to trigger sendMonitoringEventReceived on required transitions?

Marking on every transition may become an issue if we have more events in the future that we forget to add this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not! In XState v4, there is no clean mechanism to intercept all transitions through a particular state without adding the action explicitly per-transition

@r4mmer r4mmer moved this from In Progress (Done) to In Review (WIP) in Hathor Network Feb 25, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7044b7a and b020cae.

📒 Files selected for processing (2)
  • packages/daemon/__tests__/actors/MonitoringActor.test.ts
  • packages/daemon/src/actions/index.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/daemon/tests/actors/MonitoringActor.test.ts

📝 Walkthrough

Walkthrough

Adds a MonitoringActor and integrates monitoring events into the SyncMachine: idle detection, stuck-processing detection, and reconnection-storm detection with alert emission, config, types, actions, tests, Jest setup, and a final-state process exit handler.

Changes

Cohort / File(s) Summary
Monitoring Actor Core
packages/daemon/src/actors/MonitoringActor.ts, packages/daemon/src/actors/index.ts
New MonitoringActor implementing idle timers, stuck-processing timers, reconnection-storm tracking, alert emission, logging, and cleanup; re-exported in actors index.
Configuration
packages/daemon/src/config.ts
Adds exported config keys: IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS with defaults and included in default export.
Monitoring Actions
packages/daemon/src/actions/index.ts
Adds helper to obtain monitoring actor ref, presence predicate, and six new exported actions that send MONITORING_EVENT variants via sendTo/choose.
Types & Context
packages/daemon/src/types/event.ts, packages/daemon/src/types/machine.ts
Introduces MonitoringEvent type, adds MONITORING_EVENT and MONITORING_IDLE_TIMEOUT to EventTypes, extends Event union, and adds `monitoring: ActorRef<any, any>
SyncMachine Integration
packages/daemon/src/machines/SyncMachine.ts, packages/daemon/src/index.ts
Spawns MonitoringActor during INITIALIZING, wires sendMonitoring... actions across state entries/exits (CONNECTED, processing states, ERROR, RECONNECTING), adds MONITORING_IDLE_TIMEOUT transition, and registers onDone to exit process on final state.
Services Typing
packages/daemon/src/services/index.ts
Narrows metadataDiff event parameter to FULLNODE_EVENT variant for stronger typing (no runtime behavior change).
Tests & Test Config
packages/daemon/__tests__/actors/MonitoringActor.test.ts, packages/daemon/__tests__/integration/balances.test.ts, packages/daemon/__tests__/integration/token_creation.test.ts
Adds comprehensive MonitoringActor unit tests (timers, alerting, reconnection storm, logging) and updates integration test mocks to include new monitoring config keys.
Jest Integration Setup
packages/daemon/jestIntegrationSetup.ts, packages/daemon/jest_integration.config.js
Adds Jest setup that mocks @wallet-service/common's addAlert to avoid external AWS calls and registers the setup file for integration tests.

Sequence Diagram(s)

sequenceDiagram
    participant SyncMachine
    participant MonitoringActor
    participant AlertSystem as Alert System
    participant Logger

    SyncMachine->>MonitoringActor: MONITORING_EVENT (CONNECTED)
    activate MonitoringActor
    MonitoringActor->>MonitoringActor: Start idle timer (IDLE_EVENT_TIMEOUT_MS)
    
    rect rgba(100, 200, 100, 0.5)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (EVENT_RECEIVED)
        MonitoringActor->>MonitoringActor: Reset idle timer
    end
    
    rect rgba(200, 100, 100, 0.5)
        MonitoringActor->>MonitoringActor: Idle timer expires
        MonitoringActor->>AlertSystem: addAlert (MAJOR - idle)
        AlertSystem-->>MonitoringActor: Alert acknowledged
        MonitoringActor->>Logger: Log idle alert
        MonitoringActor->>SyncMachine: Emit MONITORING_IDLE_TIMEOUT
    end
    
    rect rgba(200, 150, 100, 0.5)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (PROCESSING_STARTED)
        MonitoringActor->>MonitoringActor: Start stuck-processing timer (STUCK_PROCESSING_TIMEOUT_MS)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (PROCESSING_COMPLETED)
        MonitoringActor->>MonitoringActor: Clear stuck-processing timer
    end
    
    rect rgba(150, 100, 200, 0.5)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (RECONNECTING)
        MonitoringActor->>MonitoringActor: Record reconnect timestamp
        MonitoringActor->>MonitoringActor: Check timestamps within RECONNECTION_STORM_WINDOW_MS
        alt Count >= RECONNECTION_STORM_THRESHOLD
            MonitoringActor->>AlertSystem: addAlert (MAJOR - reconnection storm)
            AlertSystem-->>MonitoringActor: Alert acknowledged
            MonitoringActor->>Logger: Log storm alert (cooldown enforced)
        end
    end

    deactivate MonitoringActor
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop, hop — timers watch the day,
Idle bells and stuck tasks play.
I count reconnects, send out a cheer,
Nudge the alerts when storms draw near.
Tiny paws, a steady ear.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(daemon): add monitoring layer to the sync state machine' accurately describes the main change—adding a comprehensive monitoring layer with a MonitoringActor for health checks including idle detection, stuck-processing detection, and reconnection storm detection across the daemon's sync state machine.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/daemon-monitoring-layer

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (6)
packages/daemon/__tests__/integration/balances.test.ts (1)

137-155: Per-test mocks are missing monitoring configuration keys.

The per-test getConfig.mockReturnValue calls completely override the global mock but don't include the new monitoring configuration keys (IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS). This inconsistency could cause issues if MonitoringActor or related code expects these values to be defined.

Consider adding the monitoring config keys to per-test mocks for consistency, or extract the common config into a shared object that can be spread into each mock.

♻️ Proposed approach

Extract common config values and spread them:

const baseConfig = {
  NETWORK: 'testnet',
  SERVICE_NAME: 'daemon-test',
  CONSOLE_LEVEL: 'debug',
  TX_CACHE_SIZE: 100,
  BLOCK_REWARD_LOCK: 300,
  FULLNODE_PEER_ID: 'simulator_peer_id',
  STREAM_ID: 'simulator_stream_id',
  FULLNODE_NETWORK: 'unittests',
  USE_SSL: false,
  DB_ENDPOINT,
  DB_NAME,
  DB_USER,
  DB_PASS,
  DB_PORT,
  ACK_TIMEOUT_MS: 300000,
  IDLE_EVENT_TIMEOUT_MS: 5 * 60 * 1000,
  STUCK_PROCESSING_TIMEOUT_MS: 5 * 60 * 1000,
  RECONNECTION_STORM_THRESHOLD: 10,
  RECONNECTION_STORM_WINDOW_MS: 5 * 60 * 1000,
};

// Then in each test:
getConfig.mockReturnValue({
  ...baseConfig,
  FULLNODE_HOST: `127.0.0.1:${UNVOIDED_SCENARIO_PORT}`,
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/integration/balances.test.ts` around lines 137 -
155, Per-test mocks calling getConfig.mockReturnValue override the global config
but omit monitoring keys used by MonitoringActor (IDLE_EVENT_TIMEOUT_MS,
STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD,
RECONNECTION_STORM_WINDOW_MS); update each per-test getConfig.mockReturnValue to
include these monitoring keys (with suitable values) or extract a shared
baseConfig object containing the common fields plus those monitoring keys and
spread that into each test-specific mock (e.g., keep FULLNODE_HOST overrides but
spread ...baseConfig) so MonitoringActor and related code always see the
required settings.
packages/daemon/src/actors/MonitoringActor.ts (2)

47-50: Harden monitoring thresholds against invalid numeric config.

Using ?? alone accepts 0, negative values, and NaN, which can create pathological timer behavior. Normalize to positive integers before arming timers.

🛡️ Proposed hardening
 export default (callback: any, receive: any, config = getConfig()) => {
   logger.info('Starting monitoring actor');

-  const idleTimeoutMs = config.IDLE_EVENT_TIMEOUT_MS ?? DEFAULT_IDLE_EVENT_TIMEOUT_MS;
-  const stuckTimeoutMs = config.STUCK_PROCESSING_TIMEOUT_MS ?? DEFAULT_STUCK_PROCESSING_TIMEOUT_MS;
-  const stormThreshold = config.RECONNECTION_STORM_THRESHOLD ?? DEFAULT_RECONNECTION_STORM_THRESHOLD;
-  const stormWindowMs = config.RECONNECTION_STORM_WINDOW_MS ?? DEFAULT_RECONNECTION_STORM_WINDOW_MS;
+  const asPositiveInt = (value: unknown, fallback: number) => {
+    const n = Number(value);
+    return Number.isFinite(n) && n > 0 ? Math.floor(n) : fallback;
+  };
+
+  const idleTimeoutMs = asPositiveInt(config.IDLE_EVENT_TIMEOUT_MS, DEFAULT_IDLE_EVENT_TIMEOUT_MS);
+  const stuckTimeoutMs = asPositiveInt(config.STUCK_PROCESSING_TIMEOUT_MS, DEFAULT_STUCK_PROCESSING_TIMEOUT_MS);
+  const stormThreshold = asPositiveInt(config.RECONNECTION_STORM_THRESHOLD, DEFAULT_RECONNECTION_STORM_THRESHOLD);
+  const stormWindowMs = asPositiveInt(config.RECONNECTION_STORM_WINDOW_MS, DEFAULT_RECONNECTION_STORM_WINDOW_MS);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actors/MonitoringActor.ts` around lines 47 - 50, The
config values idleTimeoutMs, stuckTimeoutMs, stormThreshold, and stormWindowMs
are currently chosen with the nullish coalescing operator which still accepts 0,
negatives, and NaN; change the assignment logic to coerce each config value to a
Number (or parseInt), validate that it is a finite positive integer (for
timeouts/ window use >0, for stormThreshold use >=1), and if validation fails
fall back to the corresponding DEFAULT_* constant so timers and thresholds never
receive 0, negative, or NaN values; update the code that initializes
idleTimeoutMs, stuckTimeoutMs, stormThreshold, and stormWindowMs accordingly.

167-197: Log unknown monitoring sub-events explicitly.

Unexpected MONITORING_EVENT payload types are currently ignored silently; adding a default warning improves diagnosability.

🧭 Suggested improvement
     switch (event.event.type) {
       case 'CONNECTED':
         logger.info('[monitoring] WebSocket connected — starting idle-event timer');
         isConnected = true;
         startIdleCheck();
         break;
...
       case 'RECONNECTING':
         trackReconnection();
         break;
+      default:
+        logger.warn('[monitoring] Unexpected monitoring event subtype received by MonitoringActor');
+        break;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actors/MonitoringActor.ts` around lines 167 - 197, In
MonitoringActor (the switch handling event.event.type), add a default branch for
the switch over event.event.type that logs a warning including the unexpected
type and the full event payload (e.g., logger.warn(`[monitoring] Unknown
MONITORING_EVENT sub-type: ${event.event.type}`, { event })) so unknown
monitoring sub-events are not silently ignored; keep the default branch
side-effect free (do not change isConnected/timers).
packages/daemon/__tests__/actors/MonitoringActor.test.ts (1)

95-97: Extract a shared microtask flush helper.

await Promise.resolve() is repeated across async timer tests; centralizing it would make intent clearer and reduce repetition.

♻️ Suggested cleanup
+const flushMicrotasks = async () => {
+  await Promise.resolve();
+  await Promise.resolve();
+};
+
 ...
-    await Promise.resolve();
-    await Promise.resolve(); // flush the .finally() microtask
+    await flushMicrotasks();

Also applies to: 123-124, 138-139, 150-151, 186-187

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 95 -
97, The tests in MonitoringActor.test.ts repeat "await Promise.resolve()" to
flush microtasks after jest.advanceTimersByTime; extract a small helper (e.g.,
flushMicrotasks or flushPromises) and replace the duplicated await
Promise.resolve() calls in the tests (the occurrences around
jest.advanceTimersByTime and the other noted spots) with a single await
flushMicrotasks() call to clarify intent and reduce repetition; implement the
helper so it performs the same microtask flush behavior used currently and
update all instances (including the mentioned repeated lines) to call it.
packages/daemon/src/actions/index.ts (1)

217-263: Reduce duplicated monitoring action creators with a small factory.

All sendMonitoring* actions share the same choose/sendTo structure and differ only by subtype.

♻️ Suggested consolidation
+const sendMonitoringEvent = (monitoringEventType: string) => choose([{
+  cond: monitoringIsPresent,
+  actions: sendTo(getMonitoringRefFromContext, {
+    type: EventTypes.MONITORING_EVENT,
+    event: { type: monitoringEventType },
+  }),
+}]);
+
-export const sendMonitoringConnected = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'CONNECTED' } }),
-}]);
+export const sendMonitoringConnected = sendMonitoringEvent('CONNECTED');
...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actions/index.ts` around lines 217 - 263, All
sendMonitoring* exports duplicate the same choose/sendTo pattern; create a small
factory function (e.g., makeSendMonitoringEvent) that accepts the event subtype
string (like 'CONNECTED', 'DISCONNECTED', etc.) and returns choose([{ cond:
monitoringIsPresent, actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { type: <subtype> } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to this factory. Keep references to choose, sendTo, monitoringIsPresent,
getMonitoringRefFromContext, and EventTypes.MONITORING_EVENT to match the
existing behavior.
packages/daemon/src/machines/SyncMachine.ts (1)

213-373: Consider centralizing repeated processing lifecycle hooks.

entry: sendMonitoringProcessingStarted / exit: sendMonitoringProcessingCompleted is repeated across many states; a shared constant/helper would reduce future drift.

♻️ Suggested pattern
+const MONITORING_PROCESSING_LIFECYCLE = {
+  entry: ['sendMonitoringProcessingStarted'],
+  exit: ['sendMonitoringProcessingCompleted'],
+} as const;
+
 [CONNECTED_STATES.handlingUnhandledEvent]: {
   id: CONNECTED_STATES.handlingUnhandledEvent,
-  entry: ['sendMonitoringProcessingStarted'],
-  exit: ['sendMonitoringProcessingCompleted'],
+  ...MONITORING_PROCESSING_LIFECYCLE,
   invoke: {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/machines/SyncMachine.ts` around lines 213 - 373, Create a
single shared lifecycle constant (e.g. PROCESSING_LIFECYCLE = { entry:
['sendMonitoringProcessingStarted'], exit: ['sendMonitoringProcessingCompleted']
}) and replace the repeated entry/exit objects by spreading that constant into
each state that currently repeats them (for example the state objects named
handlingMetadataChanged, handlingVertexAccepted, handlingVertexRemoved,
handlingVoidedTx, handlingUnvoidedTx, handlingFirstBlock, handlingNcExecVoided,
handlingReorgStarted, handlingTokenCreated, checkingForMissedEvents and any
other CONNECTED_STATES entries); if a state already has other entry/exit
actions, merge arrays when spreading. This centralizes the hooks while keeping
existing invoke/onDone/onError configs (functions like handleVertexAccepted,
handleVoidedTx, metadataDiff, updateLastSyncedEvent, etc. remain unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts`:
- Around line 95-97: The tests in MonitoringActor.test.ts repeat "await
Promise.resolve()" to flush microtasks after jest.advanceTimersByTime; extract a
small helper (e.g., flushMicrotasks or flushPromises) and replace the duplicated
await Promise.resolve() calls in the tests (the occurrences around
jest.advanceTimersByTime and the other noted spots) with a single await
flushMicrotasks() call to clarify intent and reduce repetition; implement the
helper so it performs the same microtask flush behavior used currently and
update all instances (including the mentioned repeated lines) to call it.

In `@packages/daemon/__tests__/integration/balances.test.ts`:
- Around line 137-155: Per-test mocks calling getConfig.mockReturnValue override
the global config but omit monitoring keys used by MonitoringActor
(IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS,
RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS); update each
per-test getConfig.mockReturnValue to include these monitoring keys (with
suitable values) or extract a shared baseConfig object containing the common
fields plus those monitoring keys and spread that into each test-specific mock
(e.g., keep FULLNODE_HOST overrides but spread ...baseConfig) so MonitoringActor
and related code always see the required settings.

In `@packages/daemon/src/actions/index.ts`:
- Around line 217-263: All sendMonitoring* exports duplicate the same
choose/sendTo pattern; create a small factory function (e.g.,
makeSendMonitoringEvent) that accepts the event subtype string (like
'CONNECTED', 'DISCONNECTED', etc.) and returns choose([{ cond:
monitoringIsPresent, actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { type: <subtype> } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to this factory. Keep references to choose, sendTo, monitoringIsPresent,
getMonitoringRefFromContext, and EventTypes.MONITORING_EVENT to match the
existing behavior.

In `@packages/daemon/src/actors/MonitoringActor.ts`:
- Around line 47-50: The config values idleTimeoutMs, stuckTimeoutMs,
stormThreshold, and stormWindowMs are currently chosen with the nullish
coalescing operator which still accepts 0, negatives, and NaN; change the
assignment logic to coerce each config value to a Number (or parseInt), validate
that it is a finite positive integer (for timeouts/ window use >0, for
stormThreshold use >=1), and if validation fails fall back to the corresponding
DEFAULT_* constant so timers and thresholds never receive 0, negative, or NaN
values; update the code that initializes idleTimeoutMs, stuckTimeoutMs,
stormThreshold, and stormWindowMs accordingly.
- Around line 167-197: In MonitoringActor (the switch handling
event.event.type), add a default branch for the switch over event.event.type
that logs a warning including the unexpected type and the full event payload
(e.g., logger.warn(`[monitoring] Unknown MONITORING_EVENT sub-type:
${event.event.type}`, { event })) so unknown monitoring sub-events are not
silently ignored; keep the default branch side-effect free (do not change
isConnected/timers).

In `@packages/daemon/src/machines/SyncMachine.ts`:
- Around line 213-373: Create a single shared lifecycle constant (e.g.
PROCESSING_LIFECYCLE = { entry: ['sendMonitoringProcessingStarted'], exit:
['sendMonitoringProcessingCompleted'] }) and replace the repeated entry/exit
objects by spreading that constant into each state that currently repeats them
(for example the state objects named handlingMetadataChanged,
handlingVertexAccepted, handlingVertexRemoved, handlingVoidedTx,
handlingUnvoidedTx, handlingFirstBlock, handlingNcExecVoided,
handlingReorgStarted, handlingTokenCreated, checkingForMissedEvents and any
other CONNECTED_STATES entries); if a state already has other entry/exit
actions, merge arrays when spreading. This centralizes the hooks while keeping
existing invoke/onDone/onError configs (functions like handleVertexAccepted,
handleVoidedTx, metadataDiff, updateLastSyncedEvent, etc. remain unchanged).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e942e0 and 597a0da.

📒 Files selected for processing (13)
  • packages/daemon/__tests__/actors/MonitoringActor.test.ts
  • packages/daemon/__tests__/integration/balances.test.ts
  • packages/daemon/__tests__/integration/token_creation.test.ts
  • packages/daemon/jestIntegrationSetup.ts
  • packages/daemon/jest_integration.config.js
  • packages/daemon/src/actions/index.ts
  • packages/daemon/src/actors/MonitoringActor.ts
  • packages/daemon/src/actors/index.ts
  • packages/daemon/src/config.ts
  • packages/daemon/src/index.ts
  • packages/daemon/src/machines/SyncMachine.ts
  • packages/daemon/src/types/event.ts
  • packages/daemon/src/types/machine.ts

- Replace process.exit with MONITORING_IDLE_TIMEOUT callback event
- Handle MONITORING_IDLE_TIMEOUT in SyncMachine (-> ERROR state)
- Add machine.onDone shutdown path in index.ts
- Wrap monitoring send-actions in choose guard (no-op if actor absent)
- Add reconnection storm alert cooldown (1 min) to prevent spam
- Halve idle check interval to reduce worst-case detection lag to 1.5x
- Fix stuck-processing doc comment (MAJOR alert, machine keeps running)
- Fix err.message interpolation in alert catch handlers
@andreabadesso andreabadesso force-pushed the feat/daemon-monitoring-layer branch from 597a0da to 53aba8e Compare February 27, 2026 12:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
packages/daemon/__tests__/actors/MonitoringActor.test.ts (2)

43-62: Prefer restoring spies in teardown, not only clearing calls.

clearAllMocks resets history but leaves spy wrappers active. Restoring makes this suite safer and more isolated.

Proposed teardown hardening
+  afterEach(() => {
+    processExitSpy.mockRestore();
+  });
+
   afterAll(() => {
-    jest.clearAllMocks();
+    jest.restoreAllMocks();
     jest.useRealTimers();
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 43 -
62, The test teardown only clears mocks and timers but doesn't restore spy
wrappers—ensure you restore any spies to avoid leaking instrumentation: add a
restore call for processExitSpy (the jest.spyOn(process, 'exit') created in
beforeEach) in the teardown (afterAll or afterEach) so process.exit is returned
to its original implementation; similarly restore any other spies if/when added
to this suite to fully undo spy wrappers.

91-103: Assert alert severity explicitly in alert-focused tests.

These tests validate title/callback, but not the severity level. Adding severity assertions would lock the MAJOR-alert requirement.

Example assertions to add
   expect(mockAddAlert).toHaveBeenCalledTimes(1);
   expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Idle — No Events Received');
+  expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');
   expect(mockCallback).toHaveBeenCalledWith(MONITORING_IDLE_TIMEOUT_EVENT);
   expect(mockAddAlert).toHaveBeenCalledTimes(1);
   expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Stuck In Processing State');
+  expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');
   expect(mockAddAlert).toHaveBeenCalledTimes(1);
   expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Reconnection Storm');
+  expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');

Also applies to: 181-193, 236-246

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 91 -
103, These alert-focused tests (calling MonitoringActor and asserting
mockAddAlert/mockCallback) must also assert the alert severity; for each test
that currently checks the title/callback (e.g., the test invoking
MonitoringActor and asserting mockAddAlert.mock.calls[0][0] is the title and
mockCallback is called), add an assertion that mockAddAlert.mock.calls[0][1]
equals the required MAJOR severity—use the project constant (e.g.,
MONITORING_ALERT_SEVERITY.MAJOR) if available, otherwise assert the string
'MAJOR'; apply the same addition to the other alert tests referenced in the
comment so the severity is explicitly verified.
packages/daemon/src/actions/index.ts (1)

217-263: Consider a small factory helper to remove repeated monitoring action boilerplate.

The six choose + sendTo blocks are near-identical; extracting a helper would reduce duplication and future drift.

Example dedup refactor
+const createMonitoringSender = (type: string) => choose([{
+  cond: monitoringIsPresent,
+  actions: sendTo(getMonitoringRefFromContext, {
+    type: EventTypes.MONITORING_EVENT,
+    event: { type },
+  }),
+}]);
+
-export const sendMonitoringConnected = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'CONNECTED' } }),
-}]);
+export const sendMonitoringConnected = createMonitoringSender('CONNECTED');
 
-export const sendMonitoringDisconnected = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'DISCONNECTED' } }),
-}]);
+export const sendMonitoringDisconnected = createMonitoringSender('DISCONNECTED');
 
-export const sendMonitoringEventReceived = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'EVENT_RECEIVED' } }),
-}]);
+export const sendMonitoringEventReceived = createMonitoringSender('EVENT_RECEIVED');
 
-export const sendMonitoringReconnecting = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'RECONNECTING' } }),
-}]);
+export const sendMonitoringReconnecting = createMonitoringSender('RECONNECTING');
 
-export const sendMonitoringProcessingStarted = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'PROCESSING_STARTED' } }),
-}]);
+export const sendMonitoringProcessingStarted = createMonitoringSender('PROCESSING_STARTED');
 
-export const sendMonitoringProcessingCompleted = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'PROCESSING_COMPLETED' } }),
-}]);
+export const sendMonitoringProcessingCompleted = createMonitoringSender('PROCESSING_COMPLETED');
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actions/index.ts` around lines 217 - 263, Create a small
factory that builds the repeated choose+sendTo block and replace the six exports
with calls to it: implement a helper (e.g., makeMonitoringSender or
monitoringEventSender) that accepts the event name or event object and returns
choose([{ cond: monitoringIsPresent, actions:
sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event:
{ ... } }) }]); then replace sendMonitoringConnected,
sendMonitoringDisconnected, sendMonitoringEventReceived,
sendMonitoringReconnecting, sendMonitoringProcessingStarted, and
sendMonitoringProcessingCompleted with calls to that helper using the respective
event types; keep using the existing symbols monitoringIsPresent,
getMonitoringRefFromContext, EventTypes.MONITORING_EVENT, sendTo, and choose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts`:
- Around line 43-62: The test teardown only clears mocks and timers but doesn't
restore spy wrappers—ensure you restore any spies to avoid leaking
instrumentation: add a restore call for processExitSpy (the jest.spyOn(process,
'exit') created in beforeEach) in the teardown (afterAll or afterEach) so
process.exit is returned to its original implementation; similarly restore any
other spies if/when added to this suite to fully undo spy wrappers.
- Around line 91-103: These alert-focused tests (calling MonitoringActor and
asserting mockAddAlert/mockCallback) must also assert the alert severity; for
each test that currently checks the title/callback (e.g., the test invoking
MonitoringActor and asserting mockAddAlert.mock.calls[0][0] is the title and
mockCallback is called), add an assertion that mockAddAlert.mock.calls[0][1]
equals the required MAJOR severity—use the project constant (e.g.,
MONITORING_ALERT_SEVERITY.MAJOR) if available, otherwise assert the string
'MAJOR'; apply the same addition to the other alert tests referenced in the
comment so the severity is explicitly verified.

In `@packages/daemon/src/actions/index.ts`:
- Around line 217-263: Create a small factory that builds the repeated
choose+sendTo block and replace the six exports with calls to it: implement a
helper (e.g., makeMonitoringSender or monitoringEventSender) that accepts the
event name or event object and returns choose([{ cond: monitoringIsPresent,
actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { ... } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to that helper using the respective event types; keep using the existing
symbols monitoringIsPresent, getMonitoringRefFromContext,
EventTypes.MONITORING_EVENT, sendTo, and choose.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 597a0da and 53aba8e.

📒 Files selected for processing (6)
  • packages/daemon/__tests__/actors/MonitoringActor.test.ts
  • packages/daemon/src/actions/index.ts
  • packages/daemon/src/actors/MonitoringActor.ts
  • packages/daemon/src/index.ts
  • packages/daemon/src/machines/SyncMachine.ts
  • packages/daemon/src/types/event.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/daemon/src/actors/MonitoringActor.ts
  • packages/daemon/src/machines/SyncMachine.ts

@andreabadesso andreabadesso force-pushed the feat/daemon-monitoring-layer branch from c3193f0 to 7044b7a Compare February 27, 2026 13:12
@andreabadesso andreabadesso requested a review from r4mmer February 27, 2026 13:31
- restore processExitSpy after each test to prevent spy leakage
- add Severity.MAJOR assertions to alert tests
- replace repeated choose+sendTo blocks with makeMonitoringSender factory
@andreabadesso andreabadesso moved this from In Review (WIP) to In Progress (Done) in Hathor Network Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Progress (Done)

Development

Successfully merging this pull request may close these issues.

2 participants