Skip to content

Conversation

@Apollon77
Copy link
Collaborator

No description provided.

ISSUE:
The getInteractionClient() method had a check-then-create pattern with two
await points between the initial check and client creation. This created a
race condition in JavaScript's event loop:

1. Call A checks #clients map -> not found
2. Call B checks #clients map -> not found
3. Call A awaits nodeStore?.construction (yields to event loop)
4. Call B awaits nodeStore?.construction (yields to event loop)
5. Call A resumes, creates InteractionClient, stores in map
6. Call B resumes, creates DIFFERENT InteractionClient, overwrites A's client

Result: Multiple InteractionClient instances created for same peer address,
causing resource leaks and potential protocol issues.

SOLUTION:
Implement double-checked locking pattern - after the async operations
complete, check the map again before creating a new client. If another
concurrent call already created the client during our await, use that
instance instead of creating a duplicate.

This is the standard pattern for async singleton creation in JavaScript's
single-threaded event loop model.

File: packages/protocol/src/interaction/InteractionClient.ts:140-168
ISSUE:
The persistFabrics() method had an async race condition where fabric state
could change between persisting the fabric list and the fabric index:

1. Method captures fabric list: Array.from(this.#fabrics.values())
2. Starts persisting to storage (async operation)
3. During storage.set() promise, event loop yields control
4. Another operation adds/removes a fabric, increments #nextFabricIndex
5. First storage.set() completes, chains to second operation
6. Persists nextFabricIndex with NEW value, but fabric list has OLD value

Result: Storage contains inconsistent state - e.g., 3 fabrics stored but
nextFabricIndex = 5. On restart, fabric index allocation could conflict or
skip indices incorrectly.

EXAMPLE SCENARIO:
- State: fabrics=[1,2,3], nextIndex=4
- persistFabrics() captures fabrics=[1,2,3], starts async persist
- During await: addFabric(4) called, nextIndex becomes 5
- Persist completes: fabrics=[1,2,3] but nextIndex=5 (inconsistent!)

SOLUTION:
Capture both fabricConfigs and nextFabricIndex values synchronously before
any async operations. This creates an atomic snapshot of related state that
remains consistent even if concurrent operations modify the live state
during the async persistence operations.

This ensures that restored state will always be self-consistent, even if
it's slightly outdated due to operations that happened after the snapshot.

File: packages/protocol/src/fabric/FabricManager.ts:141-159
ISSUE:
The destroy() method had multiple await points before setting the #isClosing
flag, creating windows where event processing could corrupt session state:

1. destroy() called, immediately awaits clearSubscriptions() (line 341)
2. During this await, event loop processes incoming messages
3. New subscriptions could be added to the session
4. Messages could trigger other operations on the session
5. Only AFTER clearSubscriptions() completes does #isClosing get set (line 351)

Result: New subscriptions/operations accepted during teardown are never
properly cleaned up, causing memory leaks and potential protocol violations.

EXAMPLE SCENARIO:
- destroy() called, starts clearing subscriptions
- During await: incoming message creates new subscription
- clearSubscriptions() completes (doesn't clear the new one)
- #isClosing set to true
- destroyed.emit() fires
- New subscription never cleaned up -> memory leak

CODE CHECKING isClosing:
- SessionManager.ts:327 - skips operations if session.isClosing
- InteractionServer.ts:178 - blocks new activity if isClosing
Setting the flag early ensures these checks protect against new operations.

SOLUTION:
Set #isClosing = true immediately at the start of destroy() (before any
awaits) when closeAfterExchangeFinished is false. This ensures any code
checking isClosing will reject new operations before async teardown begins.

Note: When closeAfterExchangeFinished=true, we use #closingAfterExchangeFinished
flag instead, which serves a similar but delayed purpose.

File: packages/protocol/src/session/NodeSession.ts:340-372
ISSUE:
The BTP codec had mismatched array indices between encoding and decoding,
causing version numbers to be swapped during BLE handshake negotiation.

ENCODING (lines 93-96):
  writer.writeUInt8((versions[1] << 4) | versions[0]);
  -> versions[1] goes to HIGH nibble (bits 7-4)
  -> versions[0] goes to LOW nibble (bits 3-0)

DECODING (original lines 188-201):
  ver[0] = (version & 0xf0) >> 4;  -> HIGH nibble to ver[0]
  ver[1] = version & 0x0f;         -> LOW nibble to ver[1]

ROUND-TRIP CORRUPTION:
  Input:  versions = [4, 3, 2, 1, 0, 0, 0, 0]
  Encode: byte 0 = (3 << 4) | 4 = 0x34
  Decode: ver[0] = 3, ver[1] = 4  -> SWAPPED\!
  Result: versions = [3, 4, 2, 1, 0, 0, 0, 0]  -> WRONG

This breaks BLE version negotiation as the device and controller would
disagree on supported protocol versions after handshake.

SOLUTION:
Match the decoding indices to the encoding pattern:
  ver[1] = HIGH nibble (matches versions[1] from encoding)
  ver[0] = LOW nibble (matches versions[0] from encoding)

Now round-trip preserves the version array correctly.

VERIFICATION:
  Input:  versions = [4, 3, 2, 1, 0, 0, 0, 0]
  Encode: byte 0 = (3 << 4) | 4 = 0x34
  Decode: ver[1] = 3, ver[0] = 4
  Result: versions = [4, 3, 2, 1, 0, 0, 0, 0]  -> CORRECT

File: packages/protocol/src/codec/BtpCodec.ts:187-201
Added three documentation files summarizing code analysis results:

1. ANALYSIS-FIXES-APPLIED.md
   - Documents 4 critical fixes applied to the codebase
   - Provides detailed explanation of each issue and solution
   - Includes testing recommendations and impact assessment

2. ANALYSIS-RESOURCE-LEAKS.md
   - Catalogs 11 resource leak and memory management issues
   - Prioritized as HIGH (4), MEDIUM (5), Investigation (2)
   - Includes specific file locations and recommendations

3. ANALYSIS-INVESTIGATED-NON-ISSUES.md
   - Documents items that were investigated but are not bugs
   - Includes 2 confirmed correct behaviors (MDNS discriminator, window covering null)
   - Notes 1 item needing Matter spec clarification (message counter rollover)
   - Lists 5 false positives from initial analysis (safe due to JS single-threaded model)

These documents provide a comprehensive record of the code analysis process
and serve as a reference for future development and maintenance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants