feat(daemon): add monitoring layer to the sync state machine#369

Open

andreabadesso wants to merge 12 commits intomasterfrom

feat/daemon-monitoring-layer

Collaborator

andreabadesso commented Feb 23, 2026 •

edited by coderabbitai bot

Loading

Summary

Adds a MonitoringActor that centralises runtime health checks for the sync state machine.
Four new optional env vars: IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS.

Condition	Response	Default threshold
No event received while WS connected	MAJOR alert + `process.exit(1)` (Kubernetes restarts)	5 min
Stuck in any processing state	MAJOR alert only — machine keeps running	1 hour
≥ threshold reconnects within window	MAJOR alert	10 reconnects / 5 min

The stuck-processing check intentionally does not force a reconnection — a long-running handler (e.g. a large reorg) should be allowed to finish. Operators are alerted so they can investigate if needed.

Acceptance criteria

Idle timeout terminates the process; a MAJOR alert is sent before exit.
Stuck-processing fires a MAJOR alert after 1 hour; machine continues running uninterrupted.
Reconnection storm alert fires at threshold; stale entries outside the window are evicted.
All four env vars are optional — daemon starts correctly when unset.
All MonitoringActor unit tests pass; existing tests unaffected.

Summary by CodeRabbit

New Features
- Runtime health monitoring for the sync process: idle detection, stuck-processing alerts, and reconnection-storm detection; sync lifecycle now emits monitoring signals.
- New configurable monitoring settings (timeouts, thresholds, windows) with sensible defaults.
Tests
- Comprehensive tests covering idle/stuck/reconnection behaviors, timer lifecycles, alerting, cooldowns, and logging.
- Test setup updated with monitoring config values and mocks to prevent external side effects.

andreabadesso force-pushed the feat/daemon-monitoring-layer branch from 13d34c8 to 2069e3e Compare

February 24, 2026 16:06

andreabadesso self-assigned this

andreabadesso added the enhancement label

andreabadesso added this to Hathor Network

andreabadesso moved this to In Progress (Done) in Hathor Network

andreabadesso requested review from pedroferreira1 and r4mmer

February 25, 2026 13:32

andreabadesso added 8 commits

February 25, 2026 10:46


          feat(daemon): add monitoring layer with idle detection, stuck-state r…

62dc57c

…estart, and reconnection storm alerts


          refactor(daemon): move stuck-processing timeout logic into Monitoring…

1cd9e56

…Actor


          fix(daemon): mock addAlert in integration tests to prevent real SQS c…

ee88ca9

…alls


          fix(daemon): add missing monitoring config keys to integration test mock

6a96883


          fix(daemon): adjust alert severities — idle=MINOR, stuck/storm=MAJOR

0cb4b61


          feat(daemon): terminate process on idle timeout to allow kubernetes r…

fd50522

…estart


          fix(daemon): apply defaults for monitoring config values to guard aga…

78ba9c1

…inst undefined


          refactor(daemon): stuck-processing only alerts; default 1 hour

289eed9

andreabadesso force-pushed the feat/daemon-monitoring-layer branch from 9ce6013 to 289eed9 Compare

February 25, 2026 13:46


          Merge branch 'master' into feat/daemon-monitoring-layer

2c2cc5c

r4mmer reviewed

View reviewed changes

packages/daemon/src/actions/index.ts

Comment on lines +205 to +207

+                if (!context.monitoring) {
+                  throw new Error('No monitoring actor in context');
+                }

Member

r4mmer Feb 25, 2026

issue (blocking): Throwing in getMonitoringRefFromContext can crash the machine before monitoring is spawned

Do we know the effects of the raised exception on the machine?

Also, is it possible to do something like "no monitoring -> start monitor -> resume machine"?

Collaborator Author

andreabadesso Feb 26, 2026

I refactored it to use choose and fail silently instead of throwing

This is more defensive than anything as there is no case where monitoring will be null here

I don't think we should do (at least in this PR) a more elaborate mechanism of pausing the machine and resuming the monitor if it's null

packages/daemon/src/actors/MonitoringActor.ts Outdated

+                        Severity.MAJOR,
+                        { idleMs: String(idleMs) },
+                        logger,
+                      ).finally(() => process.exit(1));

Member

r4mmer Feb 25, 2026

issue (blocking): Idle detector terminates the process from inside the actor

Having a hard shutdown can be very hard to determine the cause of the issue, can we send an event like MONITORING_IDLE_TIMEOUT that logs/add alert and proceeds to the normal daemon shutdown?

Collaborator Author

andreabadesso Feb 27, 2026

I like the idea, done in 597a0da

Moved the process.exit(1) to the index.js

Basically what happens now is that the "final" ERROR state is reached, that sends the alert through SQS, logs and when it resolves, the machine.onDone() callback is called, terminating the process

packages/daemon/src/actors/MonitoringActor.ts Outdated

+                        logger,
+                      ).finally(() => process.exit(1));
+                    }
+                  }, idleTimeoutMs);

Member

r4mmer Feb 25, 2026

suggestion: Idle detection interval being equal to the desired idle time may be too gracious.

If interval is the same as the timeout we may wait up to 2 times the timeout before triggering the timeout.
To mitigate this we could have the setInteval timeout be idleTimeoutMs/2 so we will hang for at most "timeout * 1.5".

Collaborator Author

andreabadesso Feb 27, 2026

Done in 597a0da

packages/daemon/src/actors/MonitoringActor.ts

+               *    stalled without triggering a WebSocket disconnect.
+               *
+               * 2. Stuck-processing detection — PROCESSING_STARTED begins a one-shot timer;
+               *    PROCESSING_COMPLETED cancels it.  If the timer fires the actor fires a

Member

r4mmer Feb 25, 2026

question: is it the timer or the actor that will fire a CRITICAL alert?

packages/daemon/src/actors/MonitoringActor.ts

+                // ── Stuck-processing detection ───────────────────────────────────────────────
+                let stuckTimer: ReturnType<typeof setTimeout> | null = null;
+                const startStuckTimer = () => {

Member

r4mmer Feb 25, 2026

issue (blocking): Stuck-processing logic does not implement the actor documentation comment because it does not notify the machine

The header comment says it “fires a CRITICAL alert and sends MONITORING_STUCK_PROCESSING back to the machine”, but the implementation only sends an alert (Severity.MAJOR).

Collaborator Author

andreabadesso Feb 27, 2026

Updated the comment, thanks

packages/daemon/src/actors/MonitoringActor.ts Outdated

+                      },
+                      logger,
+                    ).catch((err: Error) =>
+                      logger.error(`[monitoring] Failed to send reconnection storm alert: ${err}`),

Member

r4mmer Feb 25, 2026

question: Will the err string interpolation use err.message or [object Object]?

Collaborator Author

andreabadesso Feb 27, 2026

It would call Error.prototype.toString() which would produce Error: <message>, but I've changed it to use err.message directly

packages/daemon/src/actors/MonitoringActor.ts

+                  const windowStart = now - stormWindowMs;
+                  reconnectionTimestamps = reconnectionTimestamps.filter(t => t >= windowStart);
+                  if (reconnectionTimestamps.length >= stormThreshold) {

Member

r4mmer Feb 25, 2026

issue(blocking): Reconnection storm will spam the alerting system.

If we trigger a reconection storm, each new "RECONNECTING" will trigger a new alert, which will spam the alerting system.

We could create a stormAlertLastFiredAt and only fire a new alert if we have not fired for a minute.

Collaborator Author

andreabadesso Feb 27, 2026

Nice suggestion, thanks

Done

packages/daemon/src/delays/index.ts Outdated

                 const { ACK_TIMEOUT_MS } = getConfig();
                 return ACK_TIMEOUT_MS;
               };

Member

r4mmer Feb 25, 2026

nitpick: clean empty line

Collaborator Author

andreabadesso Feb 27, 2026

Done

packages/daemon/src/machines/SyncMachine.ts

Comment on lines +145 to +146

		'sendMonitoringReconnecting',
		'sendMonitoringDisconnected',

Member

r4mmer Feb 25, 2026

question: Are we sure these 2 actions should be triggered on the same state?

Collaborator Author

andreabadesso Feb 26, 2026

Yes, it's disconnected and reconnecting

packages/daemon/src/machines/SyncMachine.ts

                             target: `#${SYNC_MACHINE_STATES.ERROR}`,
                           }, {
-                            actions: ['storeEvent', 'sendAck'],
+                            actions: ['storeEvent', 'sendAck', 'sendMonitoringEventReceived'],

Member

r4mmer Feb 25, 2026

question: Do we have a better way to trigger sendMonitoringEventReceived on required transitions?

Marking on every transition may become an issue if we have more events in the future that we forget to add this.

Collaborator Author

andreabadesso Feb 27, 2026

Unfortunately not! In XState v4, there is no clean mechanism to intercept all transitions through a particular state without adding the action explicitly per-transition

r4mmer moved this from In Progress (Done) to In Review (WIP) in Hathor Network

coderabbitai bot commented Feb 26, 2026 •

edited

Loading

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7044b7a and b020cae.

📒 Files selected for processing (2)

packages/daemon/__tests__/actors/MonitoringActor.test.ts
packages/daemon/src/actions/index.ts

🚧 Files skipped from review as they are similar to previous changes (1)

packages/daemon/tests/actors/MonitoringActor.test.ts

📝 Walkthrough

Walkthrough

Adds a MonitoringActor and integrates monitoring events into the SyncMachine: idle detection, stuck-processing detection, and reconnection-storm detection with alert emission, config, types, actions, tests, Jest setup, and a final-state process exit handler.

Changes

Cohort / File(s)	Summary
Monitoring Actor Core `packages/daemon/src/actors/MonitoringActor.ts`, `packages/daemon/src/actors/index.ts`	New MonitoringActor implementing idle timers, stuck-processing timers, reconnection-storm tracking, alert emission, logging, and cleanup; re-exported in actors index.
Configuration `packages/daemon/src/config.ts`	Adds exported config keys: `IDLE_EVENT_TIMEOUT_MS`, `STUCK_PROCESSING_TIMEOUT_MS`, `RECONNECTION_STORM_THRESHOLD`, `RECONNECTION_STORM_WINDOW_MS` with defaults and included in default export.
Monitoring Actions `packages/daemon/src/actions/index.ts`	Adds helper to obtain monitoring actor ref, presence predicate, and six new exported actions that send `MONITORING_EVENT` variants via `sendTo`/choose.
Types & Context `packages/daemon/src/types/event.ts`, `packages/daemon/src/types/machine.ts`	Introduces `MonitoringEvent` type, adds `MONITORING_EVENT` and `MONITORING_IDLE_TIMEOUT` to `EventTypes`, extends `Event` union, and adds `monitoring: ActorRef<any, any>
SyncMachine Integration `packages/daemon/src/machines/SyncMachine.ts`, `packages/daemon/src/index.ts`	Spawns MonitoringActor during INITIALIZING, wires `sendMonitoring...` actions across state entries/exits (CONNECTED, processing states, ERROR, RECONNECTING), adds MONITORING_IDLE_TIMEOUT transition, and registers onDone to exit process on final state.
Services Typing `packages/daemon/src/services/index.ts`	Narrows `metadataDiff` event parameter to `FULLNODE_EVENT` variant for stronger typing (no runtime behavior change).
Tests & Test Config `packages/daemon/__tests__/actors/MonitoringActor.test.ts`, `packages/daemon/__tests__/integration/balances.test.ts`, `packages/daemon/__tests__/integration/token_creation.test.ts`	Adds comprehensive MonitoringActor unit tests (timers, alerting, reconnection storm, logging) and updates integration test mocks to include new monitoring config keys.
Jest Integration Setup `packages/daemon/jestIntegrationSetup.ts`, `packages/daemon/jest_integration.config.js`	Adds Jest setup that mocks `@wallet-service/common`'s `addAlert` to avoid external AWS calls and registers the setup file for integration tests.

Sequence Diagram(s)

sequenceDiagram
    participant SyncMachine
    participant MonitoringActor
    participant AlertSystem as Alert System
    participant Logger

    SyncMachine->>MonitoringActor: MONITORING_EVENT (CONNECTED)
    activate MonitoringActor
    MonitoringActor->>MonitoringActor: Start idle timer (IDLE_EVENT_TIMEOUT_MS)
    
    rect rgba(100, 200, 100, 0.5)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (EVENT_RECEIVED)
        MonitoringActor->>MonitoringActor: Reset idle timer
    end
    
    rect rgba(200, 100, 100, 0.5)
        MonitoringActor->>MonitoringActor: Idle timer expires
        MonitoringActor->>AlertSystem: addAlert (MAJOR - idle)
        AlertSystem-->>MonitoringActor: Alert acknowledged
        MonitoringActor->>Logger: Log idle alert
        MonitoringActor->>SyncMachine: Emit MONITORING_IDLE_TIMEOUT
    end
    
    rect rgba(200, 150, 100, 0.5)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (PROCESSING_STARTED)
        MonitoringActor->>MonitoringActor: Start stuck-processing timer (STUCK_PROCESSING_TIMEOUT_MS)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (PROCESSING_COMPLETED)
        MonitoringActor->>MonitoringActor: Clear stuck-processing timer
    end
    
    rect rgba(150, 100, 200, 0.5)
        SyncMachine->>MonitoringActor: MONITORING_EVENT (RECONNECTING)
        MonitoringActor->>MonitoringActor: Record reconnect timestamp
        MonitoringActor->>MonitoringActor: Check timestamps within RECONNECTION_STORM_WINDOW_MS
        alt Count >= RECONNECTION_STORM_THRESHOLD
            MonitoringActor->>AlertSystem: addAlert (MAJOR - reconnection storm)
            AlertSystem-->>MonitoringActor: Alert acknowledged
            MonitoringActor->>Logger: Log storm alert (cooldown enforced)
        end
    end

    deactivate MonitoringActor

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop, hop — timers watch the day,
Idle bells and stuck tasks play.
I count reconnects, send out a cheer,
Nudge the alerts when storms draw near.
Tiny paws, a steady ear.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(daemon): add monitoring layer to the sync state machine' accurately describes the main change—adding a comprehensive monitoring layer with a MonitoringActor for health checks including idle detection, stuck-processing detection, and reconnection storm detection across the daemon's sync state machine.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/daemon-monitoring-layer

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai bot reviewed

View reviewed changes

coderabbitai bot left a comment

🧹 Nitpick comments (6)

packages/daemon/__tests__/integration/balances.test.ts (1)

137-155: Per-test mocks are missing monitoring configuration keys.

The per-test getConfig.mockReturnValue calls completely override the global mock but don't include the new monitoring configuration keys (IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS). This inconsistency could cause issues if MonitoringActor or related code expects these values to be defined.

Consider adding the monitoring config keys to per-test mocks for consistency, or extract the common config into a shared object that can be spread into each mock.

♻️ Proposed approach

Extract common config values and spread them:

const baseConfig = {
  NETWORK: 'testnet',
  SERVICE_NAME: 'daemon-test',
  CONSOLE_LEVEL: 'debug',
  TX_CACHE_SIZE: 100,
  BLOCK_REWARD_LOCK: 300,
  FULLNODE_PEER_ID: 'simulator_peer_id',
  STREAM_ID: 'simulator_stream_id',
  FULLNODE_NETWORK: 'unittests',
  USE_SSL: false,
  DB_ENDPOINT,
  DB_NAME,
  DB_USER,
  DB_PASS,
  DB_PORT,
  ACK_TIMEOUT_MS: 300000,
  IDLE_EVENT_TIMEOUT_MS: 5 * 60 * 1000,
  STUCK_PROCESSING_TIMEOUT_MS: 5 * 60 * 1000,
  RECONNECTION_STORM_THRESHOLD: 10,
  RECONNECTION_STORM_WINDOW_MS: 5 * 60 * 1000,
};

// Then in each test:
getConfig.mockReturnValue({
  ...baseConfig,
  FULLNODE_HOST: `127.0.0.1:${UNVOIDED_SCENARIO_PORT}`,
});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/integration/balances.test.ts` around lines 137 -
155, Per-test mocks calling getConfig.mockReturnValue override the global config
but omit monitoring keys used by MonitoringActor (IDLE_EVENT_TIMEOUT_MS,
STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD,
RECONNECTION_STORM_WINDOW_MS); update each per-test getConfig.mockReturnValue to
include these monitoring keys (with suitable values) or extract a shared
baseConfig object containing the common fields plus those monitoring keys and
spread that into each test-specific mock (e.g., keep FULLNODE_HOST overrides but
spread ...baseConfig) so MonitoringActor and related code always see the
required settings.

packages/daemon/src/actors/MonitoringActor.ts (2)

47-50: Harden monitoring thresholds against invalid numeric config.

Using ?? alone accepts 0, negative values, and NaN, which can create pathological timer behavior. Normalize to positive integers before arming timers.

🛡️ Proposed hardening

 export default (callback: any, receive: any, config = getConfig()) => {
   logger.info('Starting monitoring actor');

-  const idleTimeoutMs = config.IDLE_EVENT_TIMEOUT_MS ?? DEFAULT_IDLE_EVENT_TIMEOUT_MS;
-  const stuckTimeoutMs = config.STUCK_PROCESSING_TIMEOUT_MS ?? DEFAULT_STUCK_PROCESSING_TIMEOUT_MS;
-  const stormThreshold = config.RECONNECTION_STORM_THRESHOLD ?? DEFAULT_RECONNECTION_STORM_THRESHOLD;
-  const stormWindowMs = config.RECONNECTION_STORM_WINDOW_MS ?? DEFAULT_RECONNECTION_STORM_WINDOW_MS;
+  const asPositiveInt = (value: unknown, fallback: number) => {
+    const n = Number(value);
+    return Number.isFinite(n) && n > 0 ? Math.floor(n) : fallback;
+  };
+
+  const idleTimeoutMs = asPositiveInt(config.IDLE_EVENT_TIMEOUT_MS, DEFAULT_IDLE_EVENT_TIMEOUT_MS);
+  const stuckTimeoutMs = asPositiveInt(config.STUCK_PROCESSING_TIMEOUT_MS, DEFAULT_STUCK_PROCESSING_TIMEOUT_MS);
+  const stormThreshold = asPositiveInt(config.RECONNECTION_STORM_THRESHOLD, DEFAULT_RECONNECTION_STORM_THRESHOLD);
+  const stormWindowMs = asPositiveInt(config.RECONNECTION_STORM_WINDOW_MS, DEFAULT_RECONNECTION_STORM_WINDOW_MS);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actors/MonitoringActor.ts` around lines 47 - 50, The
config values idleTimeoutMs, stuckTimeoutMs, stormThreshold, and stormWindowMs
are currently chosen with the nullish coalescing operator which still accepts 0,
negatives, and NaN; change the assignment logic to coerce each config value to a
Number (or parseInt), validate that it is a finite positive integer (for
timeouts/ window use >0, for stormThreshold use >=1), and if validation fails
fall back to the corresponding DEFAULT_* constant so timers and thresholds never
receive 0, negative, or NaN values; update the code that initializes
idleTimeoutMs, stuckTimeoutMs, stormThreshold, and stormWindowMs accordingly.

167-197: Log unknown monitoring sub-events explicitly.

Unexpected MONITORING_EVENT payload types are currently ignored silently; adding a default warning improves diagnosability.

🧭 Suggested improvement

     switch (event.event.type) {
       case 'CONNECTED':
         logger.info('[monitoring] WebSocket connected — starting idle-event timer');
         isConnected = true;
         startIdleCheck();
         break;
...
       case 'RECONNECTING':
         trackReconnection();
         break;
+      default:
+        logger.warn('[monitoring] Unexpected monitoring event subtype received by MonitoringActor');
+        break;
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actors/MonitoringActor.ts` around lines 167 - 197, In
MonitoringActor (the switch handling event.event.type), add a default branch for
the switch over event.event.type that logs a warning including the unexpected
type and the full event payload (e.g., logger.warn(`[monitoring] Unknown
MONITORING_EVENT sub-type: ${event.event.type}`, { event })) so unknown
monitoring sub-events are not silently ignored; keep the default branch
side-effect free (do not change isConnected/timers).

packages/daemon/__tests__/actors/MonitoringActor.test.ts (1)

95-97: Extract a shared microtask flush helper.

await Promise.resolve() is repeated across async timer tests; centralizing it would make intent clearer and reduce repetition.

♻️ Suggested cleanup

+const flushMicrotasks = async () => {
+  await Promise.resolve();
+  await Promise.resolve();
+};
+
 ...
-    await Promise.resolve();
-    await Promise.resolve(); // flush the .finally() microtask
+    await flushMicrotasks();

Also applies to: 123-124, 138-139, 150-151, 186-187

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 95 -
97, The tests in MonitoringActor.test.ts repeat "await Promise.resolve()" to
flush microtasks after jest.advanceTimersByTime; extract a small helper (e.g.,
flushMicrotasks or flushPromises) and replace the duplicated await
Promise.resolve() calls in the tests (the occurrences around
jest.advanceTimersByTime and the other noted spots) with a single await
flushMicrotasks() call to clarify intent and reduce repetition; implement the
helper so it performs the same microtask flush behavior used currently and
update all instances (including the mentioned repeated lines) to call it.

packages/daemon/src/actions/index.ts (1)

217-263: Reduce duplicated monitoring action creators with a small factory.

All sendMonitoring* actions share the same choose/sendTo structure and differ only by subtype.

♻️ Suggested consolidation

+const sendMonitoringEvent = (monitoringEventType: string) => choose([{
+  cond: monitoringIsPresent,
+  actions: sendTo(getMonitoringRefFromContext, {
+    type: EventTypes.MONITORING_EVENT,
+    event: { type: monitoringEventType },
+  }),
+}]);
+
-export const sendMonitoringConnected = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'CONNECTED' } }),
-}]);
+export const sendMonitoringConnected = sendMonitoringEvent('CONNECTED');
...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actions/index.ts` around lines 217 - 263, All
sendMonitoring* exports duplicate the same choose/sendTo pattern; create a small
factory function (e.g., makeSendMonitoringEvent) that accepts the event subtype
string (like 'CONNECTED', 'DISCONNECTED', etc.) and returns choose([{ cond:
monitoringIsPresent, actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { type: <subtype> } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to this factory. Keep references to choose, sendTo, monitoringIsPresent,
getMonitoringRefFromContext, and EventTypes.MONITORING_EVENT to match the
existing behavior.

packages/daemon/src/machines/SyncMachine.ts (1)

213-373: Consider centralizing repeated processing lifecycle hooks.

entry: sendMonitoringProcessingStarted / exit: sendMonitoringProcessingCompleted is repeated across many states; a shared constant/helper would reduce future drift.

♻️ Suggested pattern

+const MONITORING_PROCESSING_LIFECYCLE = {
+  entry: ['sendMonitoringProcessingStarted'],
+  exit: ['sendMonitoringProcessingCompleted'],
+} as const;
+
 [CONNECTED_STATES.handlingUnhandledEvent]: {
   id: CONNECTED_STATES.handlingUnhandledEvent,
-  entry: ['sendMonitoringProcessingStarted'],
-  exit: ['sendMonitoringProcessingCompleted'],
+  ...MONITORING_PROCESSING_LIFECYCLE,
   invoke: {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/machines/SyncMachine.ts` around lines 213 - 373, Create a
single shared lifecycle constant (e.g. PROCESSING_LIFECYCLE = { entry:
['sendMonitoringProcessingStarted'], exit: ['sendMonitoringProcessingCompleted']
}) and replace the repeated entry/exit objects by spreading that constant into
each state that currently repeats them (for example the state objects named
handlingMetadataChanged, handlingVertexAccepted, handlingVertexRemoved,
handlingVoidedTx, handlingUnvoidedTx, handlingFirstBlock, handlingNcExecVoided,
handlingReorgStarted, handlingTokenCreated, checkingForMissedEvents and any
other CONNECTED_STATES entries); if a state already has other entry/exit
actions, merge arrays when spreading. This centralizes the hooks while keeping
existing invoke/onDone/onError configs (functions like handleVertexAccepted,
handleVoidedTx, metadataDiff, updateLastSyncedEvent, etc. remain unchanged).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts`:
- Around line 95-97: The tests in MonitoringActor.test.ts repeat "await
Promise.resolve()" to flush microtasks after jest.advanceTimersByTime; extract a
small helper (e.g., flushMicrotasks or flushPromises) and replace the duplicated
await Promise.resolve() calls in the tests (the occurrences around
jest.advanceTimersByTime and the other noted spots) with a single await
flushMicrotasks() call to clarify intent and reduce repetition; implement the
helper so it performs the same microtask flush behavior used currently and
update all instances (including the mentioned repeated lines) to call it.

In `@packages/daemon/__tests__/integration/balances.test.ts`:
- Around line 137-155: Per-test mocks calling getConfig.mockReturnValue override
the global config but omit monitoring keys used by MonitoringActor
(IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS,
RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS); update each
per-test getConfig.mockReturnValue to include these monitoring keys (with
suitable values) or extract a shared baseConfig object containing the common
fields plus those monitoring keys and spread that into each test-specific mock
(e.g., keep FULLNODE_HOST overrides but spread ...baseConfig) so MonitoringActor
and related code always see the required settings.

In `@packages/daemon/src/actions/index.ts`:
- Around line 217-263: All sendMonitoring* exports duplicate the same
choose/sendTo pattern; create a small factory function (e.g.,
makeSendMonitoringEvent) that accepts the event subtype string (like
'CONNECTED', 'DISCONNECTED', etc.) and returns choose([{ cond:
monitoringIsPresent, actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { type: <subtype> } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to this factory. Keep references to choose, sendTo, monitoringIsPresent,
getMonitoringRefFromContext, and EventTypes.MONITORING_EVENT to match the
existing behavior.

In `@packages/daemon/src/actors/MonitoringActor.ts`:
- Around line 47-50: The config values idleTimeoutMs, stuckTimeoutMs,
stormThreshold, and stormWindowMs are currently chosen with the nullish
coalescing operator which still accepts 0, negatives, and NaN; change the
assignment logic to coerce each config value to a Number (or parseInt), validate
that it is a finite positive integer (for timeouts/ window use >0, for
stormThreshold use >=1), and if validation fails fall back to the corresponding
DEFAULT_* constant so timers and thresholds never receive 0, negative, or NaN
values; update the code that initializes idleTimeoutMs, stuckTimeoutMs,
stormThreshold, and stormWindowMs accordingly.
- Around line 167-197: In MonitoringActor (the switch handling
event.event.type), add a default branch for the switch over event.event.type
that logs a warning including the unexpected type and the full event payload
(e.g., logger.warn(`[monitoring] Unknown MONITORING_EVENT sub-type:
${event.event.type}`, { event })) so unknown monitoring sub-events are not
silently ignored; keep the default branch side-effect free (do not change
isConnected/timers).

In `@packages/daemon/src/machines/SyncMachine.ts`:
- Around line 213-373: Create a single shared lifecycle constant (e.g.
PROCESSING_LIFECYCLE = { entry: ['sendMonitoringProcessingStarted'], exit:
['sendMonitoringProcessingCompleted'] }) and replace the repeated entry/exit
objects by spreading that constant into each state that currently repeats them
(for example the state objects named handlingMetadataChanged,
handlingVertexAccepted, handlingVertexRemoved, handlingVoidedTx,
handlingUnvoidedTx, handlingFirstBlock, handlingNcExecVoided,
handlingReorgStarted, handlingTokenCreated, checkingForMissedEvents and any
other CONNECTED_STATES entries); if a state already has other entry/exit
actions, merge arrays when spreading. This centralizes the hooks while keeping
existing invoke/onDone/onError configs (functions like handleVertexAccepted,
handleVoidedTx, metadataDiff, updateLastSyncedEvent, etc. remain unchanged).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e942e0 and 597a0da.

📒 Files selected for processing (13)

packages/daemon/__tests__/actors/MonitoringActor.test.ts
packages/daemon/__tests__/integration/balances.test.ts
packages/daemon/__tests__/integration/token_creation.test.ts
packages/daemon/jestIntegrationSetup.ts
packages/daemon/jest_integration.config.js
packages/daemon/src/actions/index.ts
packages/daemon/src/actors/MonitoringActor.ts
packages/daemon/src/actors/index.ts
packages/daemon/src/config.ts
packages/daemon/src/index.ts
packages/daemon/src/machines/SyncMachine.ts
packages/daemon/src/types/event.ts
packages/daemon/src/types/machine.ts


          fix(daemon): address monitoring layer review feedback

53aba8e

- Replace process.exit with MONITORING_IDLE_TIMEOUT callback event
- Handle MONITORING_IDLE_TIMEOUT in SyncMachine (-> ERROR state)
- Add machine.onDone shutdown path in index.ts
- Wrap monitoring send-actions in choose guard (no-op if actor absent)
- Add reconnection storm alert cooldown (1 min) to prevent spam
- Halve idle check interval to reduce worst-case detection lag to 1.5x
- Fix stuck-processing doc comment (MAJOR alert, machine keeps running)
- Fix err.message interpolation in alert catch handlers

andreabadesso force-pushed the feat/daemon-monitoring-layer branch from 597a0da to 53aba8e Compare

February 27, 2026 12:30

coderabbitai bot reviewed

View reviewed changes

coderabbitai bot left a comment

🧹 Nitpick comments (3)

packages/daemon/__tests__/actors/MonitoringActor.test.ts (2)

43-62: Prefer restoring spies in teardown, not only clearing calls.

clearAllMocks resets history but leaves spy wrappers active. Restoring makes this suite safer and more isolated.

Proposed teardown hardening

+  afterEach(() => {
+    processExitSpy.mockRestore();
+  });
+
   afterAll(() => {
-    jest.clearAllMocks();
+    jest.restoreAllMocks();
     jest.useRealTimers();
   });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 43 -
62, The test teardown only clears mocks and timers but doesn't restore spy
wrappers—ensure you restore any spies to avoid leaking instrumentation: add a
restore call for processExitSpy (the jest.spyOn(process, 'exit') created in
beforeEach) in the teardown (afterAll or afterEach) so process.exit is returned
to its original implementation; similarly restore any other spies if/when added
to this suite to fully undo spy wrappers.

91-103: Assert alert severity explicitly in alert-focused tests.

These tests validate title/callback, but not the severity level. Adding severity assertions would lock the MAJOR-alert requirement.

Example assertions to add

   expect(mockAddAlert).toHaveBeenCalledTimes(1);
   expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Idle — No Events Received');
+  expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');
   expect(mockCallback).toHaveBeenCalledWith(MONITORING_IDLE_TIMEOUT_EVENT);

   expect(mockAddAlert).toHaveBeenCalledTimes(1);
   expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Stuck In Processing State');
+  expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');

   expect(mockAddAlert).toHaveBeenCalledTimes(1);
   expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Reconnection Storm');
+  expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');

Also applies to: 181-193, 236-246

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 91 -
103, These alert-focused tests (calling MonitoringActor and asserting
mockAddAlert/mockCallback) must also assert the alert severity; for each test
that currently checks the title/callback (e.g., the test invoking
MonitoringActor and asserting mockAddAlert.mock.calls[0][0] is the title and
mockCallback is called), add an assertion that mockAddAlert.mock.calls[0][1]
equals the required MAJOR severity—use the project constant (e.g.,
MONITORING_ALERT_SEVERITY.MAJOR) if available, otherwise assert the string
'MAJOR'; apply the same addition to the other alert tests referenced in the
comment so the severity is explicitly verified.

packages/daemon/src/actions/index.ts (1)

217-263: Consider a small factory helper to remove repeated monitoring action boilerplate.

The six choose + sendTo blocks are near-identical; extracting a helper would reduce duplication and future drift.

Example dedup refactor

+const createMonitoringSender = (type: string) => choose([{
+  cond: monitoringIsPresent,
+  actions: sendTo(getMonitoringRefFromContext, {
+    type: EventTypes.MONITORING_EVENT,
+    event: { type },
+  }),
+}]);
+
-export const sendMonitoringConnected = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'CONNECTED' } }),
-}]);
+export const sendMonitoringConnected = createMonitoringSender('CONNECTED');
 
-export const sendMonitoringDisconnected = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'DISCONNECTED' } }),
-}]);
+export const sendMonitoringDisconnected = createMonitoringSender('DISCONNECTED');
 
-export const sendMonitoringEventReceived = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'EVENT_RECEIVED' } }),
-}]);
+export const sendMonitoringEventReceived = createMonitoringSender('EVENT_RECEIVED');
 
-export const sendMonitoringReconnecting = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'RECONNECTING' } }),
-}]);
+export const sendMonitoringReconnecting = createMonitoringSender('RECONNECTING');
 
-export const sendMonitoringProcessingStarted = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'PROCESSING_STARTED' } }),
-}]);
+export const sendMonitoringProcessingStarted = createMonitoringSender('PROCESSING_STARTED');
 
-export const sendMonitoringProcessingCompleted = choose([{
-  cond: monitoringIsPresent,
-  actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'PROCESSING_COMPLETED' } }),
-}]);
+export const sendMonitoringProcessingCompleted = createMonitoringSender('PROCESSING_COMPLETED');

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/daemon/src/actions/index.ts` around lines 217 - 263, Create a small
factory that builds the repeated choose+sendTo block and replace the six exports
with calls to it: implement a helper (e.g., makeMonitoringSender or
monitoringEventSender) that accepts the event name or event object and returns
choose([{ cond: monitoringIsPresent, actions:
sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event:
{ ... } }) }]); then replace sendMonitoringConnected,
sendMonitoringDisconnected, sendMonitoringEventReceived,
sendMonitoringReconnecting, sendMonitoringProcessingStarted, and
sendMonitoringProcessingCompleted with calls to that helper using the respective
event types; keep using the existing symbols monitoringIsPresent,
getMonitoringRefFromContext, EventTypes.MONITORING_EVENT, sendTo, and choose.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts`:
- Around line 43-62: The test teardown only clears mocks and timers but doesn't
restore spy wrappers—ensure you restore any spies to avoid leaking
instrumentation: add a restore call for processExitSpy (the jest.spyOn(process,
'exit') created in beforeEach) in the teardown (afterAll or afterEach) so
process.exit is returned to its original implementation; similarly restore any
other spies if/when added to this suite to fully undo spy wrappers.
- Around line 91-103: These alert-focused tests (calling MonitoringActor and
asserting mockAddAlert/mockCallback) must also assert the alert severity; for
each test that currently checks the title/callback (e.g., the test invoking
MonitoringActor and asserting mockAddAlert.mock.calls[0][0] is the title and
mockCallback is called), add an assertion that mockAddAlert.mock.calls[0][1]
equals the required MAJOR severity—use the project constant (e.g.,
MONITORING_ALERT_SEVERITY.MAJOR) if available, otherwise assert the string
'MAJOR'; apply the same addition to the other alert tests referenced in the
comment so the severity is explicitly verified.

In `@packages/daemon/src/actions/index.ts`:
- Around line 217-263: Create a small factory that builds the repeated
choose+sendTo block and replace the six exports with calls to it: implement a
helper (e.g., makeMonitoringSender or monitoringEventSender) that accepts the
event name or event object and returns choose([{ cond: monitoringIsPresent,
actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { ... } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to that helper using the respective event types; keep using the existing
symbols monitoringIsPresent, getMonitoringRefFromContext,
EventTypes.MONITORING_EVENT, sendTo, and choose.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 597a0da and 53aba8e.

📒 Files selected for processing (6)

packages/daemon/__tests__/actors/MonitoringActor.test.ts
packages/daemon/src/actions/index.ts
packages/daemon/src/actors/MonitoringActor.ts
packages/daemon/src/index.ts
packages/daemon/src/machines/SyncMachine.ts
packages/daemon/src/types/event.ts

🚧 Files skipped from review as they are similar to previous changes (2)

packages/daemon/src/actors/MonitoringActor.ts
packages/daemon/src/machines/SyncMachine.ts


          fix: missing type

7044b7a

andreabadesso force-pushed the feat/daemon-monitoring-layer branch from c3193f0 to 7044b7a Compare

February 27, 2026 13:12

andreabadesso requested a review from r4mmer

February 27, 2026 13:31


          refactor(daemon): address monitoring layer nitpicks

b020cae

- restore processExitSpy after each test to prevent spy leakage
- add Severity.MAJOR assertions to alert tests
- replace repeated choose+sendTo blocks with makeMonitoringSender factory

andreabadesso moved this from In Review (WIP) to In Progress (Done) in Hathor Network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels