Skip to content

Commit 09d55de

Browse files
protocol and client specs
1 parent 35d4065 commit 09d55de

File tree

9 files changed

+270
-0
lines changed

9 files changed

+270
-0
lines changed

spec/TOPICS.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,10 @@
1212

1313
- **Certificate chain trust model**: ChainCertificates (Shared.hs) defines 0–4 cert chain semantics, used by both Client.hs (validateCertificateChain) and Server.hs (validateClientCertificate, SNI credential switching). The 4-length case skipping index 2 (operator cert) and the FQHN-disabled x509validate are decisions that span the entire transport security model.
1414

15+
- **SMP proxy protocol flow**: The PRXY/PFWD/RFWD proxy protocol involves Client.hs (proxySMPCommand with 10 error scenarios, forwardSMPTransmission with sessionSecret encryption), Protocol.hs (command types, version-dependent encoding), Transport.hs (proxiedSMPRelayVersion cap, proxyServer flag disabling block encryption). The double encryption (client-relay via PFWD + proxy-relay via RFWD), combined timeout (tcpConnect + tcpTimeout), nonce/reverseNonce pairing, and version downgrade logic are not visible from any single module.
16+
17+
- **Service certificate subscription model**: Service subscriptions (SUBS/NSUBS) and per-queue subscriptions (SUB/NSUB) coexist with complex state transitions. Client/Agent.hs manages dual active/pending subscription maps with session-aware cleanup. Protocol.hs defines useServiceAuth (only NEW/SUB/NSUB). Client.hs implements authTransmission with dual signing (entity key over cert hash + transmission, service key over transmission only). Transport.hs handles the service certificate handshake extension (v16+). The full subscription lifecycle — from DBService credentials through handshake to service subscription to disconnect/reconnect — spans all four modules.
18+
19+
- **Two agent layers**: Client/Agent.hs ("small agent") is used only in servers — SMP proxy and notification server — to manage client connections to other SMP servers. Agent.hs + Agent/Client.hs ("big agent") is used in client applications. Both manage SMP client connections with subscription tracking and reconnection, but the big agent adds the full messaging agent layer (connections, double ratchet, file transfer). When documenting Agent/Client.hs, Client/Agent.hs should be reviewed for shared patterns and differences.
20+
1521
- **Handshake protocol family**: SMP (Transport.hs), NTF (Notifications/Transport.hs), and XFTP (FileTransfer/Transport.hs) all have handshake protocols with the same structure (version negotiation + session binding + key exchange) but different feature sets. NTF is a strict subset. XFTP doesn't use the TLS handshake at all (HTTP2 layer). The shared types (THandle, THandleParams, THandleAuth) mean changes to the handshake infrastructure affect all three protocols.

spec/modules/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@ Do NOT document:
7474
- **Function-by-function prose that restates the implementation** — "this function takes X and returns Y by doing Z" adds nothing
7575
- **Line numbers** — they're brittle and break on every edit
7676
- **Comments that fit in one line in source** — put those in the source file instead as `-- spec:` comments
77+
- **Verbatim quotes of source comments** — reference them instead: "See comment on `functionName`." Then add only what the comment doesn't cover (cross-module implications, what breaks if violated). If the source comment says everything, the function doesn't need a doc entry.
78+
- **Tables that reproduce code structure** — if the information is self-evident from reading the code's pattern matching or type definitions, it doesn't belong in the doc (e.g., per-command credential requirements, version-conditional encoding branches)
7779

7880
## Format
7981

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Simplex.Messaging.Client
2+
3+
> Generic protocol client: connection management, command sending/receiving, batching, proxy protocol, reconnection.
4+
5+
**Source**: [`Client.hs`](../../../../src/Simplex/Messaging/Client.hs)
6+
7+
**Protocol spec**: [`protocol/simplex-messaging.md`](../../../../protocol/simplex-messaging.md) — SimpleX Messaging Protocol.
8+
9+
## Overview
10+
11+
This module implements the client side of the `Protocol` typeclass — connecting to servers, sending commands, receiving responses, and managing connection lifecycle. It is generic over `Protocol v err msg`, instantiated for SMP as `SMPClient` (= `ProtocolClient SMPVersion ErrorType BrokerMsg`). The SMP proxy protocol (PRXY/PFWD/RFWD) is also implemented here.
12+
13+
## Four concurrent threads — teardown semantics
14+
15+
`getProtocolClient` launches four threads via `raceAny_`:
16+
- `send`: reads from `sndQ` (TBQueue) and writes to TLS
17+
- `receive`: reads from TLS and writes to `rcvQ` (TBQueue), updates `lastReceived`
18+
- `process`: reads from `rcvQ` and dispatches to response vars or `msgQ`
19+
- `monitor`: periodic ping loop (only when `smpPingInterval > 0`)
20+
21+
When ANY thread exits (normally or exceptionally), `raceAny_` cancels all others. `E.finally` ensures the `disconnected` callback always fires. Implication: a single stuck thread (e.g., TLS read blocked on a half-open connection) keeps the entire client alive until `monitor` drops it. There is no per-thread health check — liveness depends entirely on the monitor's timeout logic.
22+
23+
## Request lifecycle and leak risk
24+
25+
`mkRequest` inserts a `Request` into `sentCommands` TMap BEFORE the transmission is written to TLS. If the TLS write fails silently or the connection drops before the response, the entry remains in `sentCommands` until the monitor's timeout counter exceeds `maxCnt` and drops the entire client. There is no per-request cleanup on send failure — individual request entries are only removed by `processMsg` (on response) or by `getResponse` timeout (which sets `pending = False` but doesn't remove the entry).
26+
27+
## getResponse — pending flag race contract
28+
29+
This is the core concurrency contract between timeout and response processing:
30+
31+
1. `getResponse` waits with `timeout` for `takeTMVar responseVar`
32+
2. Regardless of result, atomically sets `pending = False` and tries `tryTakeTMVar` again (see comment on `getResponse`)
33+
3. In `processMsg`, when a response arrives for a request where `pending` is already `False` (timeout won), `wasPending` is `False` and the response is forwarded to `msgQ` as `STResponse` rather than discarded
34+
35+
The double-check pattern (`swapTVar pending False` + `tryTakeTMVar`) handles the race window where a response arrives between timeout firing and `pending` being set to `False`. Without this, responses arriving in that gap would be silently lost.
36+
37+
`timeoutErrorCount` is reset to 0 in three places: in `getResponse` when a response arrives, in `receive` on every TLS read, and the monitor uses this count to decide when to drop the connection.
38+
39+
## processMsg — server events vs expired responses
40+
41+
When `corrId` is empty, the message is an `STEvent` (server-initiated). When non-empty and the request was already expired (`wasPending` is `False`), the response becomes `STResponse` — not discarded, but forwarded to `msgQ` with the original command context. Entity ID mismatch is `STUnexpectedError`.
42+
43+
## nonBlockingWriteTBQueue — fork on full
44+
45+
If `tryWriteTBQueue` returns `False`, a new thread is forked for the blocking write. No backpressure mechanism — under sustained overload, thread count grows without bound. This is a deliberate tradeoff: the caller never blocks (preventing deadlock between send and process threads), at the cost of potential unbounded thread creation.
46+
47+
## Batch commands do not expire
48+
49+
See comment on `sendBatch`. Batched commands are written with `Nothing` as the request parameter — the send thread skips the `pending` flag check. Individual commands use `Just r` and the send thread checks `pending` after dequeue. The coupling: if the server stops responding, batched commands can block the send queue indefinitely since they have no timeout-based expiry.
50+
51+
## monitor — quasi-periodic adaptive ping
52+
53+
The ping loop sleeps for `smpPingInterval`, then checks elapsed time since `lastReceived`. If significant time remains in the interval (> 1 second), it re-sleeps for just the remaining time rather than sending a ping. This means ping frequency adapts to actual receive activity — frequent receives suppress pings.
54+
55+
Pings are only sent when `sendPings` is `True`, set by `enablePings` (called from `subscribeSMPQueue`, `subscribeSMPQueues`, `subscribeSMPQueueNotifications`, `subscribeSMPQueuesNtfs`, `subscribeService`). The client drops the connection when `maxCnt` commands have timed out in sequence AND at least `recoverWindow` (15 minutes) has passed since the last received response.
56+
57+
## clientCorrId — dual-purpose random values
58+
59+
`clientCorrId` is a `TVar ChaChaDRG` generating random `CbNonce` values that serve as both correlation IDs and nonces for proxy encryption. When a nonce is explicitly passed (e.g., by `createSMPQueue`), it is used instead of generating a random one.
60+
61+
## Proxy command re-parameterization
62+
63+
`proxySMPCommand` constructs modified `thParams` per-request — setting `sessionId`, `peerServerPubKey`, and `thVersion` to the proxy-relay connection's parameters rather than the client-proxy connection's. A single `SMPClient` connection to the proxy carries commands with different auth parameters per destination relay. The encoding, signing, and encryption all use these per-request params, not the connection's original params.
64+
65+
## proxySMPCommand — error classification
66+
67+
See comment above `proxySMPCommand` for the 9 error scenarios (0-9) mapping each combination of success/error at client-proxy and proxy-relay boundaries. Errors from the destination relay wrapped in `PRES` are thrown as `ExceptT` errors (transparent proxy). Errors from the proxy itself are returned as `Left ProxyClientError`.
68+
69+
## forwardSMPTransmission — proxy-side forwarding
70+
71+
Used by the proxy server to forward `RFWD` to the destination relay. Uses `cbEncryptNoPad`/`cbDecryptNoPad` (no padding) with the session secret from the proxy-relay connection. Response nonce is `reverseNonce` of the request nonce.
72+
73+
## authTransmission — dual auth with service signature
74+
75+
When `useServiceAuth` is `True` and a service certificate is present, the entity key signs over `serviceCertHash <> transmission` (not just the transmission) — see comment on `authTransmission`. The service key only signs the transmission itself. For X25519 keys, `cbAuthenticate` produces a `TAAuthenticator`; for Ed25519/Ed448, `C.sign'` produces a `TASignature`.
76+
77+
The service signature is only added when the entity authenticator is non-empty. If authenticator generation fails silently (returns empty bytes), service signing is silently skipped. This mirrors the [state-dependent parser contract](./Protocol.md#service-signature--state-dependent-parser-contract) in Protocol.hs.
78+
79+
## action — weak thread reference
80+
81+
`action` stores a `Weak ThreadId` (via `mkWeakThreadId`) to the main client thread. `closeProtocolClient` dereferences and kills it. The weak reference allows the thread to be garbage collected if all other references are dropped.
82+
83+
## writeSMPMessage — server-side event injection
84+
85+
`writeSMPMessage` writes directly to `msgQ` as `STEvent`, bypassing the entire command/response pipeline. This is used by the server to inject MSG events into the subscription response path.
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Simplex.Messaging.Client.Agent
2+
3+
> SMP client connections with subscription management, reconnection, and service certificate support.
4+
5+
**Source**: [`Client/Agent.hs`](../../../../../src/Simplex/Messaging/Client/Agent.hs)
6+
7+
## Overview
8+
9+
This is the "small agent" — used only in servers (SMP proxy, notification server) to manage client connections to other SMP servers. The "big agent" in `Simplex.Messaging.Agent` + `Simplex.Messaging.Agent.Client` serves client applications and adds the full messaging agent layer. See [Two agent layers](../../../../TOPICS.md) topic.
10+
11+
`SMPClientAgent` manages `SMPClient` connections via `smpClients :: TMap SMPServer SMPClientVar` (one per SMP server), tracks active and pending subscriptions, and handles automatic reconnection. It is parameterized by `Party` (`p`) and uses the `ServiceParty` constraint to support both `RecipientService` and `NotifierService` modes.
12+
13+
## Dual subscription model
14+
15+
Four TMap fields track subscriptions in two dimensions:
16+
17+
| | Active | Pending |
18+
|---|---|---|
19+
| **Service** | `activeServiceSubs` (TMap SMPServer (TVar (Maybe (ServiceSub, SessionId)))) | `pendingServiceSubs` (TMap SMPServer (TVar (Maybe ServiceSub))) |
20+
| **Queue** | `activeQueueSubs` (TMap SMPServer (TMap QueueId (SessionId, C.APrivateAuthKey))) | `pendingQueueSubs` (TMap SMPServer (TMap QueueId C.APrivateAuthKey)) |
21+
22+
See comments on `activeServiceSubs` and `pendingServiceSubs` for the coexistence rules. Key constraint: only one service subscription per server. Active subs store the `SessionId` that established them.
23+
24+
## SessionVar compare-and-swap — core concurrency safety
25+
26+
`removeSessVar` (in Session.hs) uses `sessionVarId` (monotonically increasing counter from `sessSeq`) to prevent stale removal. When a disconnected client's cleanup runs after a new client already replaced the map entry, the ID mismatch causes removal to silently no-op. See comment on `removeSessVar`. This is used throughout: `removeClientAndSubs` for client map, `cleanup` for worker map.
27+
28+
## removeClientAndSubs — outside-STM lookup optimization
29+
30+
See comment on `removeClientAndSubs`. Subscription TVar references are obtained outside STM (via `TM.lookupIO`), then modified inside `atomically`. This is safe because the invariant is that subscription TVar entries for a server are never deleted from the outer TMap, only their contents change. Moving lookups inside the STM transaction would cause excessive re-evaluation under contention.
31+
32+
## Disconnect preserves others' subscriptions
33+
34+
`updateServiceSub` only moves active→pending when `sessId` matches the disconnected client (see its comment). If a new client already established different subscriptions on the same server, those are preserved. Queue subs use `M.partition` to split by SessionId — only matching subs move to pending, non-matching remain active.
35+
36+
## Pending never reset to Nothing on disconnect
37+
38+
See comment on `updateServiceSub`. After clearing an active service sub, the code sets pending to the cleared value but does NOT reset pending to `Nothing`. This avoids the race where a concurrent new client session has already set a different pending subscription. Implication: pending subs can only grow (be set) during disconnect, never shrink (be cleared).
39+
40+
## persistErrorInterval — delayed error cleanup
41+
42+
When `connectClient` calls `newSMPClient` and it fails, the error is stored with an expiry timestamp. `waitForSMPClient` checks expiry before retrying. When `persistErrorInterval` is 0, the error is stored without timestamp and the SessionVar is immediately removed from the map.
43+
44+
## Session validation after subscription RPC
45+
46+
Both `smpSubscribeQueues` and `smpSubscribeService` validate `activeClientSession` AFTER the subscription RPC completes, before committing results to state. If the session changed during the RPC (client reconnected), results are discarded and reconnection is triggered. This is optimistic execution with post-hoc validation — the RPC may succeed but its results are thrown away if the session is stale.
47+
48+
## groupSub — subscription response classification
49+
50+
Each queue response is classified by a `foldr` over the (subs, responses) zip:
51+
52+
- **Success with matching serviceId**: counted as service-subscribed (`sQs` list)
53+
- **Success without matching serviceId**: counted as queue-only (`qOks` list with SessionId and key)
54+
- **Not in pending map**: silently skipped (handles concurrent activation by another path)
55+
- **Temporary error** (network, timeout): sets the `tempErrs` flag but does NOT remove from pending — queue stays pending for retry on reconnect
56+
- **Permanent error**: removes from pending and added to `finalErrs` — terminal, no automatic retry
57+
58+
Even if multiple temporary errors occur in a batch, only one `reconnectClient` call is made (via the boolean accumulator flag).
59+
60+
## updateActiveServiceSub — accumulative merge
61+
62+
When serviceId and sessionId match the existing active subscription, queue count is added (`n + n'`) and IdsHash is XOR-merged (`idsHash <> idsHash'`). This accumulates across multiple subscription batches for the same service. When they don't match, the subscription is replaced entirely (silently drops old data).
63+
64+
## CAServiceUnavailable — cascade to queue resubscription
65+
66+
When `smpSubscribeService` detects service ID or role mismatch with the connection, it fires `CAServiceUnavailable`. See comment on `CAServiceUnavailable` for the full implication: the app must resubscribe all queues individually, creating new associations. This can happen if the SMP server reassigns service IDs (e.g., after downgrade and upgrade).
67+
68+
## getPending — polymorphic over STM/IO
69+
70+
`getPending` uses rank-2 polymorphism to work in both STM (for the "should we spawn a worker?" check, providing a consistent snapshot) and IO (for the actual reconnection data read, providing fresh data). Between these two calls, new pending subs could be added — the worker loop handles this by re-checking on each iteration.
71+
72+
## Reconnect worker lifecycle
73+
74+
### Spawn decision
75+
`reconnectClient` checks `active` outside STM, then atomically checks for pending subs and gets/creates a worker SessionVar. If no pending subs exist, no worker is spawned — this prevents race with cleanup and adding pending queues in another call.
76+
77+
### Worker cleanup blocks on TMVar fill
78+
See comment on `cleanup`. The STM `retry` loop waits until the async handle is inserted into the TMVar before removing the worker from the map. Without this, cleanup could race ahead of the `putTMVar` in `newSubWorker`, leaving a terminated worker in the map.
79+
80+
### Double timeout on reconnection
81+
`runSubWorker` wraps the entire reconnection in `System.Timeout.timeout` using `tcpConnectTimeout` in addition to the network-layer timeout. Two layers — network for the connection attempt, outer for the entire operation including subscription.
82+
83+
### Reconnect filters already-active queues
84+
During reconnection, `reconnectSMPClient` reads current active queue subs (outside STM, same "vars never removed" invariant) and filters them out before resubscribing. Subscription is chunked by `agentSubsBatchSize` — partial success is possible across chunks.
85+
86+
## Agent shutdown ordering
87+
88+
`closeSMPClientAgent` executes in order: set `active = False`, close all client connections, then swap workers map to empty and fork cancellation threads. The cancel threads use `uninterruptibleCancel` but are fire-and-forget — `closeSMPClientAgent` may return before all workers are actually cancelled.
89+
90+
## addSubs_ — left-biased union
91+
92+
`addSubs_` uses `TM.union` which delegates to `M.union` (left-biased). If a queue subscription already exists, the new auth key from the incoming map wins. Service subs use `writeTVar` (overwrite) since only one service sub exists per server.

0 commit comments

Comments
 (0)