feat(daemon): add monitoring layer to the sync state machine#369
feat(daemon): add monitoring layer to the sync state machine#369andreabadesso wants to merge 12 commits intomasterfrom
Conversation
13d34c8 to
2069e3e
Compare
…estart, and reconnection storm alerts
9ce6013 to
289eed9
Compare
| if (!context.monitoring) { | ||
| throw new Error('No monitoring actor in context'); | ||
| } |
There was a problem hiding this comment.
issue (blocking): Throwing in getMonitoringRefFromContext can crash the machine before monitoring is spawned
Do we know the effects of the raised exception on the machine?
Also, is it possible to do something like "no monitoring -> start monitor -> resume machine"?
There was a problem hiding this comment.
I refactored it to use choose and fail silently instead of throwing
This is more defensive than anything as there is no case where monitoring will be null here
I don't think we should do (at least in this PR) a more elaborate mechanism of pausing the machine and resuming the monitor if it's null
| Severity.MAJOR, | ||
| { idleMs: String(idleMs) }, | ||
| logger, | ||
| ).finally(() => process.exit(1)); |
There was a problem hiding this comment.
issue (blocking): Idle detector terminates the process from inside the actor
Having a hard shutdown can be very hard to determine the cause of the issue, can we send an event like MONITORING_IDLE_TIMEOUT that logs/add alert and proceeds to the normal daemon shutdown?
There was a problem hiding this comment.
I like the idea, done in 597a0da
Moved the process.exit(1) to the index.js
Basically what happens now is that the "final" ERROR state is reached, that sends the alert through SQS, logs and when it resolves, the machine.onDone() callback is called, terminating the process
| logger, | ||
| ).finally(() => process.exit(1)); | ||
| } | ||
| }, idleTimeoutMs); |
There was a problem hiding this comment.
suggestion: Idle detection interval being equal to the desired idle time may be too gracious.
If interval is the same as the timeout we may wait up to 2 times the timeout before triggering the timeout.
To mitigate this we could have the setInteval timeout be idleTimeoutMs/2 so we will hang for at most "timeout * 1.5".
| * stalled without triggering a WebSocket disconnect. | ||
| * | ||
| * 2. Stuck-processing detection — PROCESSING_STARTED begins a one-shot timer; | ||
| * PROCESSING_COMPLETED cancels it. If the timer fires the actor fires a |
There was a problem hiding this comment.
question: is it the timer or the actor that will fire a CRITICAL alert?
| // ── Stuck-processing detection ─────────────────────────────────────────────── | ||
| let stuckTimer: ReturnType<typeof setTimeout> | null = null; | ||
|
|
||
| const startStuckTimer = () => { |
There was a problem hiding this comment.
issue (blocking): Stuck-processing logic does not implement the actor documentation comment because it does not notify the machine
The header comment says it “fires a CRITICAL alert and sends MONITORING_STUCK_PROCESSING back to the machine”, but the implementation only sends an alert (Severity.MAJOR).
There was a problem hiding this comment.
Updated the comment, thanks
| }, | ||
| logger, | ||
| ).catch((err: Error) => | ||
| logger.error(`[monitoring] Failed to send reconnection storm alert: ${err}`), |
There was a problem hiding this comment.
question: Will the err string interpolation use err.message or [object Object]?
There was a problem hiding this comment.
It would call Error.prototype.toString() which would produce Error: <message>, but I've changed it to use err.message directly
| const windowStart = now - stormWindowMs; | ||
| reconnectionTimestamps = reconnectionTimestamps.filter(t => t >= windowStart); | ||
|
|
||
| if (reconnectionTimestamps.length >= stormThreshold) { |
There was a problem hiding this comment.
issue(blocking): Reconnection storm will spam the alerting system.
If we trigger a reconection storm, each new "RECONNECTING" will trigger a new alert, which will spam the alerting system.
We could create a stormAlertLastFiredAt and only fire a new alert if we have not fired for a minute.
There was a problem hiding this comment.
Nice suggestion, thanks
Done
packages/daemon/src/delays/index.ts
Outdated
| const { ACK_TIMEOUT_MS } = getConfig(); | ||
| return ACK_TIMEOUT_MS; | ||
| }; | ||
|
|
| 'sendMonitoringReconnecting', | ||
| 'sendMonitoringDisconnected', |
There was a problem hiding this comment.
question: Are we sure these 2 actions should be triggered on the same state?
There was a problem hiding this comment.
Yes, it's disconnected and reconnecting
| target: `#${SYNC_MACHINE_STATES.ERROR}`, | ||
| }, { | ||
| actions: ['storeEvent', 'sendAck'], | ||
| actions: ['storeEvent', 'sendAck', 'sendMonitoringEventReceived'], |
There was a problem hiding this comment.
question: Do we have a better way to trigger sendMonitoringEventReceived on required transitions?
Marking on every transition may become an issue if we have more events in the future that we forget to add this.
There was a problem hiding this comment.
Unfortunately not! In XState v4, there is no clean mechanism to intercept all transitions through a particular state without adding the action explicitly per-transition
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a MonitoringActor and integrates monitoring events into the SyncMachine: idle detection, stuck-processing detection, and reconnection-storm detection with alert emission, config, types, actions, tests, Jest setup, and a final-state process exit handler. Changes
Sequence Diagram(s)sequenceDiagram
participant SyncMachine
participant MonitoringActor
participant AlertSystem as Alert System
participant Logger
SyncMachine->>MonitoringActor: MONITORING_EVENT (CONNECTED)
activate MonitoringActor
MonitoringActor->>MonitoringActor: Start idle timer (IDLE_EVENT_TIMEOUT_MS)
rect rgba(100, 200, 100, 0.5)
SyncMachine->>MonitoringActor: MONITORING_EVENT (EVENT_RECEIVED)
MonitoringActor->>MonitoringActor: Reset idle timer
end
rect rgba(200, 100, 100, 0.5)
MonitoringActor->>MonitoringActor: Idle timer expires
MonitoringActor->>AlertSystem: addAlert (MAJOR - idle)
AlertSystem-->>MonitoringActor: Alert acknowledged
MonitoringActor->>Logger: Log idle alert
MonitoringActor->>SyncMachine: Emit MONITORING_IDLE_TIMEOUT
end
rect rgba(200, 150, 100, 0.5)
SyncMachine->>MonitoringActor: MONITORING_EVENT (PROCESSING_STARTED)
MonitoringActor->>MonitoringActor: Start stuck-processing timer (STUCK_PROCESSING_TIMEOUT_MS)
SyncMachine->>MonitoringActor: MONITORING_EVENT (PROCESSING_COMPLETED)
MonitoringActor->>MonitoringActor: Clear stuck-processing timer
end
rect rgba(150, 100, 200, 0.5)
SyncMachine->>MonitoringActor: MONITORING_EVENT (RECONNECTING)
MonitoringActor->>MonitoringActor: Record reconnect timestamp
MonitoringActor->>MonitoringActor: Check timestamps within RECONNECTION_STORM_WINDOW_MS
alt Count >= RECONNECTION_STORM_THRESHOLD
MonitoringActor->>AlertSystem: addAlert (MAJOR - reconnection storm)
AlertSystem-->>MonitoringActor: Alert acknowledged
MonitoringActor->>Logger: Log storm alert (cooldown enforced)
end
end
deactivate MonitoringActor
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (6)
packages/daemon/__tests__/integration/balances.test.ts (1)
137-155: Per-test mocks are missing monitoring configuration keys.The per-test
getConfig.mockReturnValuecalls completely override the global mock but don't include the new monitoring configuration keys (IDLE_EVENT_TIMEOUT_MS,STUCK_PROCESSING_TIMEOUT_MS,RECONNECTION_STORM_THRESHOLD,RECONNECTION_STORM_WINDOW_MS). This inconsistency could cause issues ifMonitoringActoror related code expects these values to be defined.Consider adding the monitoring config keys to per-test mocks for consistency, or extract the common config into a shared object that can be spread into each mock.
♻️ Proposed approach
Extract common config values and spread them:
const baseConfig = { NETWORK: 'testnet', SERVICE_NAME: 'daemon-test', CONSOLE_LEVEL: 'debug', TX_CACHE_SIZE: 100, BLOCK_REWARD_LOCK: 300, FULLNODE_PEER_ID: 'simulator_peer_id', STREAM_ID: 'simulator_stream_id', FULLNODE_NETWORK: 'unittests', USE_SSL: false, DB_ENDPOINT, DB_NAME, DB_USER, DB_PASS, DB_PORT, ACK_TIMEOUT_MS: 300000, IDLE_EVENT_TIMEOUT_MS: 5 * 60 * 1000, STUCK_PROCESSING_TIMEOUT_MS: 5 * 60 * 1000, RECONNECTION_STORM_THRESHOLD: 10, RECONNECTION_STORM_WINDOW_MS: 5 * 60 * 1000, }; // Then in each test: getConfig.mockReturnValue({ ...baseConfig, FULLNODE_HOST: `127.0.0.1:${UNVOIDED_SCENARIO_PORT}`, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/__tests__/integration/balances.test.ts` around lines 137 - 155, Per-test mocks calling getConfig.mockReturnValue override the global config but omit monitoring keys used by MonitoringActor (IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS, RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS); update each per-test getConfig.mockReturnValue to include these monitoring keys (with suitable values) or extract a shared baseConfig object containing the common fields plus those monitoring keys and spread that into each test-specific mock (e.g., keep FULLNODE_HOST overrides but spread ...baseConfig) so MonitoringActor and related code always see the required settings.packages/daemon/src/actors/MonitoringActor.ts (2)
47-50: Harden monitoring thresholds against invalid numeric config.Using
??alone accepts0, negative values, andNaN, which can create pathological timer behavior. Normalize to positive integers before arming timers.🛡️ Proposed hardening
export default (callback: any, receive: any, config = getConfig()) => { logger.info('Starting monitoring actor'); - const idleTimeoutMs = config.IDLE_EVENT_TIMEOUT_MS ?? DEFAULT_IDLE_EVENT_TIMEOUT_MS; - const stuckTimeoutMs = config.STUCK_PROCESSING_TIMEOUT_MS ?? DEFAULT_STUCK_PROCESSING_TIMEOUT_MS; - const stormThreshold = config.RECONNECTION_STORM_THRESHOLD ?? DEFAULT_RECONNECTION_STORM_THRESHOLD; - const stormWindowMs = config.RECONNECTION_STORM_WINDOW_MS ?? DEFAULT_RECONNECTION_STORM_WINDOW_MS; + const asPositiveInt = (value: unknown, fallback: number) => { + const n = Number(value); + return Number.isFinite(n) && n > 0 ? Math.floor(n) : fallback; + }; + + const idleTimeoutMs = asPositiveInt(config.IDLE_EVENT_TIMEOUT_MS, DEFAULT_IDLE_EVENT_TIMEOUT_MS); + const stuckTimeoutMs = asPositiveInt(config.STUCK_PROCESSING_TIMEOUT_MS, DEFAULT_STUCK_PROCESSING_TIMEOUT_MS); + const stormThreshold = asPositiveInt(config.RECONNECTION_STORM_THRESHOLD, DEFAULT_RECONNECTION_STORM_THRESHOLD); + const stormWindowMs = asPositiveInt(config.RECONNECTION_STORM_WINDOW_MS, DEFAULT_RECONNECTION_STORM_WINDOW_MS);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/src/actors/MonitoringActor.ts` around lines 47 - 50, The config values idleTimeoutMs, stuckTimeoutMs, stormThreshold, and stormWindowMs are currently chosen with the nullish coalescing operator which still accepts 0, negatives, and NaN; change the assignment logic to coerce each config value to a Number (or parseInt), validate that it is a finite positive integer (for timeouts/ window use >0, for stormThreshold use >=1), and if validation fails fall back to the corresponding DEFAULT_* constant so timers and thresholds never receive 0, negative, or NaN values; update the code that initializes idleTimeoutMs, stuckTimeoutMs, stormThreshold, and stormWindowMs accordingly.
167-197: Log unknown monitoring sub-events explicitly.Unexpected
MONITORING_EVENTpayload types are currently ignored silently; adding a default warning improves diagnosability.🧭 Suggested improvement
switch (event.event.type) { case 'CONNECTED': logger.info('[monitoring] WebSocket connected — starting idle-event timer'); isConnected = true; startIdleCheck(); break; ... case 'RECONNECTING': trackReconnection(); break; + default: + logger.warn('[monitoring] Unexpected monitoring event subtype received by MonitoringActor'); + break; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/src/actors/MonitoringActor.ts` around lines 167 - 197, In MonitoringActor (the switch handling event.event.type), add a default branch for the switch over event.event.type that logs a warning including the unexpected type and the full event payload (e.g., logger.warn(`[monitoring] Unknown MONITORING_EVENT sub-type: ${event.event.type}`, { event })) so unknown monitoring sub-events are not silently ignored; keep the default branch side-effect free (do not change isConnected/timers).packages/daemon/__tests__/actors/MonitoringActor.test.ts (1)
95-97: Extract a shared microtask flush helper.
await Promise.resolve()is repeated across async timer tests; centralizing it would make intent clearer and reduce repetition.♻️ Suggested cleanup
+const flushMicrotasks = async () => { + await Promise.resolve(); + await Promise.resolve(); +}; + ... - await Promise.resolve(); - await Promise.resolve(); // flush the .finally() microtask + await flushMicrotasks();Also applies to: 123-124, 138-139, 150-151, 186-187
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 95 - 97, The tests in MonitoringActor.test.ts repeat "await Promise.resolve()" to flush microtasks after jest.advanceTimersByTime; extract a small helper (e.g., flushMicrotasks or flushPromises) and replace the duplicated await Promise.resolve() calls in the tests (the occurrences around jest.advanceTimersByTime and the other noted spots) with a single await flushMicrotasks() call to clarify intent and reduce repetition; implement the helper so it performs the same microtask flush behavior used currently and update all instances (including the mentioned repeated lines) to call it.packages/daemon/src/actions/index.ts (1)
217-263: Reduce duplicated monitoring action creators with a small factory.All
sendMonitoring*actions share the samechoose/sendTostructure and differ only by subtype.♻️ Suggested consolidation
+const sendMonitoringEvent = (monitoringEventType: string) => choose([{ + cond: monitoringIsPresent, + actions: sendTo(getMonitoringRefFromContext, { + type: EventTypes.MONITORING_EVENT, + event: { type: monitoringEventType }, + }), +}]); + -export const sendMonitoringConnected = choose([{ - cond: monitoringIsPresent, - actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'CONNECTED' } }), -}]); +export const sendMonitoringConnected = sendMonitoringEvent('CONNECTED'); ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/src/actions/index.ts` around lines 217 - 263, All sendMonitoring* exports duplicate the same choose/sendTo pattern; create a small factory function (e.g., makeSendMonitoringEvent) that accepts the event subtype string (like 'CONNECTED', 'DISCONNECTED', etc.) and returns choose([{ cond: monitoringIsPresent, actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: <subtype> } }) }]); then replace sendMonitoringConnected, sendMonitoringDisconnected, sendMonitoringEventReceived, sendMonitoringReconnecting, sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with calls to this factory. Keep references to choose, sendTo, monitoringIsPresent, getMonitoringRefFromContext, and EventTypes.MONITORING_EVENT to match the existing behavior.packages/daemon/src/machines/SyncMachine.ts (1)
213-373: Consider centralizing repeated processing lifecycle hooks.
entry: sendMonitoringProcessingStarted/exit: sendMonitoringProcessingCompletedis repeated across many states; a shared constant/helper would reduce future drift.♻️ Suggested pattern
+const MONITORING_PROCESSING_LIFECYCLE = { + entry: ['sendMonitoringProcessingStarted'], + exit: ['sendMonitoringProcessingCompleted'], +} as const; + [CONNECTED_STATES.handlingUnhandledEvent]: { id: CONNECTED_STATES.handlingUnhandledEvent, - entry: ['sendMonitoringProcessingStarted'], - exit: ['sendMonitoringProcessingCompleted'], + ...MONITORING_PROCESSING_LIFECYCLE, invoke: {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/src/machines/SyncMachine.ts` around lines 213 - 373, Create a single shared lifecycle constant (e.g. PROCESSING_LIFECYCLE = { entry: ['sendMonitoringProcessingStarted'], exit: ['sendMonitoringProcessingCompleted'] }) and replace the repeated entry/exit objects by spreading that constant into each state that currently repeats them (for example the state objects named handlingMetadataChanged, handlingVertexAccepted, handlingVertexRemoved, handlingVoidedTx, handlingUnvoidedTx, handlingFirstBlock, handlingNcExecVoided, handlingReorgStarted, handlingTokenCreated, checkingForMissedEvents and any other CONNECTED_STATES entries); if a state already has other entry/exit actions, merge arrays when spreading. This centralizes the hooks while keeping existing invoke/onDone/onError configs (functions like handleVertexAccepted, handleVoidedTx, metadataDiff, updateLastSyncedEvent, etc. remain unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts`:
- Around line 95-97: The tests in MonitoringActor.test.ts repeat "await
Promise.resolve()" to flush microtasks after jest.advanceTimersByTime; extract a
small helper (e.g., flushMicrotasks or flushPromises) and replace the duplicated
await Promise.resolve() calls in the tests (the occurrences around
jest.advanceTimersByTime and the other noted spots) with a single await
flushMicrotasks() call to clarify intent and reduce repetition; implement the
helper so it performs the same microtask flush behavior used currently and
update all instances (including the mentioned repeated lines) to call it.
In `@packages/daemon/__tests__/integration/balances.test.ts`:
- Around line 137-155: Per-test mocks calling getConfig.mockReturnValue override
the global config but omit monitoring keys used by MonitoringActor
(IDLE_EVENT_TIMEOUT_MS, STUCK_PROCESSING_TIMEOUT_MS,
RECONNECTION_STORM_THRESHOLD, RECONNECTION_STORM_WINDOW_MS); update each
per-test getConfig.mockReturnValue to include these monitoring keys (with
suitable values) or extract a shared baseConfig object containing the common
fields plus those monitoring keys and spread that into each test-specific mock
(e.g., keep FULLNODE_HOST overrides but spread ...baseConfig) so MonitoringActor
and related code always see the required settings.
In `@packages/daemon/src/actions/index.ts`:
- Around line 217-263: All sendMonitoring* exports duplicate the same
choose/sendTo pattern; create a small factory function (e.g.,
makeSendMonitoringEvent) that accepts the event subtype string (like
'CONNECTED', 'DISCONNECTED', etc.) and returns choose([{ cond:
monitoringIsPresent, actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { type: <subtype> } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to this factory. Keep references to choose, sendTo, monitoringIsPresent,
getMonitoringRefFromContext, and EventTypes.MONITORING_EVENT to match the
existing behavior.
In `@packages/daemon/src/actors/MonitoringActor.ts`:
- Around line 47-50: The config values idleTimeoutMs, stuckTimeoutMs,
stormThreshold, and stormWindowMs are currently chosen with the nullish
coalescing operator which still accepts 0, negatives, and NaN; change the
assignment logic to coerce each config value to a Number (or parseInt), validate
that it is a finite positive integer (for timeouts/ window use >0, for
stormThreshold use >=1), and if validation fails fall back to the corresponding
DEFAULT_* constant so timers and thresholds never receive 0, negative, or NaN
values; update the code that initializes idleTimeoutMs, stuckTimeoutMs,
stormThreshold, and stormWindowMs accordingly.
- Around line 167-197: In MonitoringActor (the switch handling
event.event.type), add a default branch for the switch over event.event.type
that logs a warning including the unexpected type and the full event payload
(e.g., logger.warn(`[monitoring] Unknown MONITORING_EVENT sub-type:
${event.event.type}`, { event })) so unknown monitoring sub-events are not
silently ignored; keep the default branch side-effect free (do not change
isConnected/timers).
In `@packages/daemon/src/machines/SyncMachine.ts`:
- Around line 213-373: Create a single shared lifecycle constant (e.g.
PROCESSING_LIFECYCLE = { entry: ['sendMonitoringProcessingStarted'], exit:
['sendMonitoringProcessingCompleted'] }) and replace the repeated entry/exit
objects by spreading that constant into each state that currently repeats them
(for example the state objects named handlingMetadataChanged,
handlingVertexAccepted, handlingVertexRemoved, handlingVoidedTx,
handlingUnvoidedTx, handlingFirstBlock, handlingNcExecVoided,
handlingReorgStarted, handlingTokenCreated, checkingForMissedEvents and any
other CONNECTED_STATES entries); if a state already has other entry/exit
actions, merge arrays when spreading. This centralizes the hooks while keeping
existing invoke/onDone/onError configs (functions like handleVertexAccepted,
handleVoidedTx, metadataDiff, updateLastSyncedEvent, etc. remain unchanged).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
packages/daemon/__tests__/actors/MonitoringActor.test.tspackages/daemon/__tests__/integration/balances.test.tspackages/daemon/__tests__/integration/token_creation.test.tspackages/daemon/jestIntegrationSetup.tspackages/daemon/jest_integration.config.jspackages/daemon/src/actions/index.tspackages/daemon/src/actors/MonitoringActor.tspackages/daemon/src/actors/index.tspackages/daemon/src/config.tspackages/daemon/src/index.tspackages/daemon/src/machines/SyncMachine.tspackages/daemon/src/types/event.tspackages/daemon/src/types/machine.ts
- Replace process.exit with MONITORING_IDLE_TIMEOUT callback event - Handle MONITORING_IDLE_TIMEOUT in SyncMachine (-> ERROR state) - Add machine.onDone shutdown path in index.ts - Wrap monitoring send-actions in choose guard (no-op if actor absent) - Add reconnection storm alert cooldown (1 min) to prevent spam - Halve idle check interval to reduce worst-case detection lag to 1.5x - Fix stuck-processing doc comment (MAJOR alert, machine keeps running) - Fix err.message interpolation in alert catch handlers
597a0da to
53aba8e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
packages/daemon/__tests__/actors/MonitoringActor.test.ts (2)
43-62: Prefer restoring spies in teardown, not only clearing calls.
clearAllMocksresets history but leaves spy wrappers active. Restoring makes this suite safer and more isolated.Proposed teardown hardening
+ afterEach(() => { + processExitSpy.mockRestore(); + }); + afterAll(() => { - jest.clearAllMocks(); + jest.restoreAllMocks(); jest.useRealTimers(); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 43 - 62, The test teardown only clears mocks and timers but doesn't restore spy wrappers—ensure you restore any spies to avoid leaking instrumentation: add a restore call for processExitSpy (the jest.spyOn(process, 'exit') created in beforeEach) in the teardown (afterAll or afterEach) so process.exit is returned to its original implementation; similarly restore any other spies if/when added to this suite to fully undo spy wrappers.
91-103: Assert alert severity explicitly in alert-focused tests.These tests validate title/callback, but not the severity level. Adding severity assertions would lock the MAJOR-alert requirement.
Example assertions to add
expect(mockAddAlert).toHaveBeenCalledTimes(1); expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Idle — No Events Received'); + expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR'); expect(mockCallback).toHaveBeenCalledWith(MONITORING_IDLE_TIMEOUT_EVENT);expect(mockAddAlert).toHaveBeenCalledTimes(1); expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Stuck In Processing State'); + expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');expect(mockAddAlert).toHaveBeenCalledTimes(1); expect(mockAddAlert.mock.calls[0][0]).toBe('Daemon Reconnection Storm'); + expect(mockAddAlert.mock.calls[0][2]).toBe('MAJOR');Also applies to: 181-193, 236-246
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts` around lines 91 - 103, These alert-focused tests (calling MonitoringActor and asserting mockAddAlert/mockCallback) must also assert the alert severity; for each test that currently checks the title/callback (e.g., the test invoking MonitoringActor and asserting mockAddAlert.mock.calls[0][0] is the title and mockCallback is called), add an assertion that mockAddAlert.mock.calls[0][1] equals the required MAJOR severity—use the project constant (e.g., MONITORING_ALERT_SEVERITY.MAJOR) if available, otherwise assert the string 'MAJOR'; apply the same addition to the other alert tests referenced in the comment so the severity is explicitly verified.packages/daemon/src/actions/index.ts (1)
217-263: Consider a small factory helper to remove repeated monitoring action boilerplate.The six
choose + sendToblocks are near-identical; extracting a helper would reduce duplication and future drift.Example dedup refactor
+const createMonitoringSender = (type: string) => choose([{ + cond: monitoringIsPresent, + actions: sendTo(getMonitoringRefFromContext, { + type: EventTypes.MONITORING_EVENT, + event: { type }, + }), +}]); + -export const sendMonitoringConnected = choose([{ - cond: monitoringIsPresent, - actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'CONNECTED' } }), -}]); +export const sendMonitoringConnected = createMonitoringSender('CONNECTED'); -export const sendMonitoringDisconnected = choose([{ - cond: monitoringIsPresent, - actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'DISCONNECTED' } }), -}]); +export const sendMonitoringDisconnected = createMonitoringSender('DISCONNECTED'); -export const sendMonitoringEventReceived = choose([{ - cond: monitoringIsPresent, - actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'EVENT_RECEIVED' } }), -}]); +export const sendMonitoringEventReceived = createMonitoringSender('EVENT_RECEIVED'); -export const sendMonitoringReconnecting = choose([{ - cond: monitoringIsPresent, - actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'RECONNECTING' } }), -}]); +export const sendMonitoringReconnecting = createMonitoringSender('RECONNECTING'); -export const sendMonitoringProcessingStarted = choose([{ - cond: monitoringIsPresent, - actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'PROCESSING_STARTED' } }), -}]); +export const sendMonitoringProcessingStarted = createMonitoringSender('PROCESSING_STARTED'); -export const sendMonitoringProcessingCompleted = choose([{ - cond: monitoringIsPresent, - actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { type: 'PROCESSING_COMPLETED' } }), -}]); +export const sendMonitoringProcessingCompleted = createMonitoringSender('PROCESSING_COMPLETED');🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/daemon/src/actions/index.ts` around lines 217 - 263, Create a small factory that builds the repeated choose+sendTo block and replace the six exports with calls to it: implement a helper (e.g., makeMonitoringSender or monitoringEventSender) that accepts the event name or event object and returns choose([{ cond: monitoringIsPresent, actions: sendTo(getMonitoringRefFromContext, { type: EventTypes.MONITORING_EVENT, event: { ... } }) }]); then replace sendMonitoringConnected, sendMonitoringDisconnected, sendMonitoringEventReceived, sendMonitoringReconnecting, sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with calls to that helper using the respective event types; keep using the existing symbols monitoringIsPresent, getMonitoringRefFromContext, EventTypes.MONITORING_EVENT, sendTo, and choose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/daemon/__tests__/actors/MonitoringActor.test.ts`:
- Around line 43-62: The test teardown only clears mocks and timers but doesn't
restore spy wrappers—ensure you restore any spies to avoid leaking
instrumentation: add a restore call for processExitSpy (the jest.spyOn(process,
'exit') created in beforeEach) in the teardown (afterAll or afterEach) so
process.exit is returned to its original implementation; similarly restore any
other spies if/when added to this suite to fully undo spy wrappers.
- Around line 91-103: These alert-focused tests (calling MonitoringActor and
asserting mockAddAlert/mockCallback) must also assert the alert severity; for
each test that currently checks the title/callback (e.g., the test invoking
MonitoringActor and asserting mockAddAlert.mock.calls[0][0] is the title and
mockCallback is called), add an assertion that mockAddAlert.mock.calls[0][1]
equals the required MAJOR severity—use the project constant (e.g.,
MONITORING_ALERT_SEVERITY.MAJOR) if available, otherwise assert the string
'MAJOR'; apply the same addition to the other alert tests referenced in the
comment so the severity is explicitly verified.
In `@packages/daemon/src/actions/index.ts`:
- Around line 217-263: Create a small factory that builds the repeated
choose+sendTo block and replace the six exports with calls to it: implement a
helper (e.g., makeMonitoringSender or monitoringEventSender) that accepts the
event name or event object and returns choose([{ cond: monitoringIsPresent,
actions: sendTo(getMonitoringRefFromContext, { type:
EventTypes.MONITORING_EVENT, event: { ... } }) }]); then replace
sendMonitoringConnected, sendMonitoringDisconnected,
sendMonitoringEventReceived, sendMonitoringReconnecting,
sendMonitoringProcessingStarted, and sendMonitoringProcessingCompleted with
calls to that helper using the respective event types; keep using the existing
symbols monitoringIsPresent, getMonitoringRefFromContext,
EventTypes.MONITORING_EVENT, sendTo, and choose.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
packages/daemon/__tests__/actors/MonitoringActor.test.tspackages/daemon/src/actions/index.tspackages/daemon/src/actors/MonitoringActor.tspackages/daemon/src/index.tspackages/daemon/src/machines/SyncMachine.tspackages/daemon/src/types/event.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/daemon/src/actors/MonitoringActor.ts
- packages/daemon/src/machines/SyncMachine.ts
c3193f0 to
7044b7a
Compare
- restore processExitSpy after each test to prevent spy leakage - add Severity.MAJOR assertions to alert tests - replace repeated choose+sendTo blocks with makeMonitoringSender factory
Summary
MonitoringActorthat centralises runtime health checks for the sync state machine.IDLE_EVENT_TIMEOUT_MS,STUCK_PROCESSING_TIMEOUT_MS,RECONNECTION_STORM_THRESHOLD,RECONNECTION_STORM_WINDOW_MS.process.exit(1)(Kubernetes restarts)The stuck-processing check intentionally does not force a reconnection — a long-running handler (e.g. a large reorg) should be allowed to finish. Operators are alerted so they can investigate if needed.
Acceptance criteria
MonitoringActorunit tests pass; existing tests unaffected.Summary by CodeRabbit
New Features
Tests