smp-server: analyze slow queries

shumvgolove · shumvgolove · commit 3ddf214f3caf · 2026-03-25T13:43:28.000Z
diff --git a/plans/slow-queries-analysis.md b/plans/slow-queries-analysis.md
@@ -0,0 +1,237 @@
+# Slow query analysis: rcv-services related queries
+
+Data from three production SMP servers (A, B, C) over a multi-day observation window.
+
+Below: only queries related to services, subscriptions, and service-driven queue lookups.
+
+---
+
+## 1. getEntityCounts — service-related subqueries
+
+**Query** (`QueueStore/Postgres.hs:160-167`):
+
+```sql
+SELECT
+  (SELECT COUNT(1) FROM msg_queues WHERE deleted_at IS NULL) AS queue_count,
+  (SELECT COUNT(1) FROM msg_queues WHERE deleted_at IS NULL AND notifier_id IS NOT NULL) AS notifier_count,
+  (SELECT COUNT(1) FROM services WHERE service_role = ?) AS rcv_service_count,        -- ①
+  (SELECT COUNT(1) FROM services WHERE service_role = ?) AS ntf_service_count,        -- ②
+  (SELECT COUNT(1) FROM msg_queues WHERE rcv_service_id IS NOT NULL AND deleted_at IS NULL) AS rcv_service_queues_count, -- ③
+  (SELECT COUNT(1) FROM msg_queues WHERE ntf_service_id IS NOT NULL AND deleted_at IS NULL) AS ntf_service_queues_count -- ④
+```
+
+Subqueries ①-④ are service-specific. The first two (queue_count, notifier_count) are general.
+
+All three servers show consistent results: **~315ms avg, up to ~2s max per call**,
+with total accumulated time of ~800s over the observation window (~2500 calls each).
+
+**Why it's slow**: Subqueries ③ and ④ do full scans of `msg_queues` with conditions on
+`rcv_service_id IS NOT NULL` and `ntf_service_id IS NOT NULL` — no covering index exists
+for these filters. Each adds ~50-75ms on top of the already expensive base query.
+
+Subqueries ① and ② hit the small `services` table (few rows) — negligible cost.
+
+**How to fix**: Subqueries ③ and ④ can be replaced with aggregates from `services.queue_count`,
+which is already maintained by triggers:
+
+```sql
+-- Replace ③ and ④ with:
+COALESCE((SELECT SUM(queue_count) FROM services WHERE service_role = 'M'), 0) AS rcv_service_queues_count,
+COALESCE((SELECT SUM(queue_count) FROM services WHERE service_role = 'N'), 0) AS ntf_service_queues_count
+```
+
+This eliminates 2 of the 4 `msg_queues` table scans per call. The `services` table has <10 rows.
+
+**Expected savings**: ~100-150ms per call, significant total reduction proportional to call frequency.
+
+---
+
+## 2. UPDATE msg_queues SET rcv_service_id — service association
+
+**Query** (`QueueStore/Postgres.hs:490`):
+
+```sql
+UPDATE msg_queues SET rcv_service_id = $1 WHERE recipient_id = $2 AND deleted_at IS NULL
+```
+
+Called by `setQueueService` during `sharedSubscribeQueue` when a queue is associated with a receiving service.
+
+**Only appears on Server C.** Servers A and B don't show this query in the slow query log.
+
+Per-call: **0.19ms avg, ~5ms max**. High volume of service associations observed.
+Each UPDATE fires the `on_queue_update` trigger, which calls `update_aggregates` (see #3).
+The chain is:
+
+1. `UPDATE msg_queues SET rcv_service_id` → 0.19ms
+2. `on_queue_update` trigger → `update_aggregates(OLD.rcv_service_id, ...)` → 0.07ms
+3. `UPDATE services SET queue_count = ...` → 0.05ms
+
+Total: ~0.31ms per association.
+
+**Why it might be too frequent**: The observed rate (~tens per minute) could indicate:
+- Normal service subscription flow after restarts
+- Re-associations when services reconnect
+- Possible redundant updates where `rcv_service_id` already equals the target
+
+**How to fix**: The application-level guard at `Postgres.hs:487` should skip unchanged associations:
+
+```haskell
+| rcvServiceId q == serviceId -> pure ()
+| otherwise -> ...
+```
+
+If this guard fires correctly, the volume means genuinely new associations. If not (e.g., the
+`QueueRec` is read before service ID is set), add a DB-level guard to skip no-op updates and
+avoid firing the trigger:
+
+```sql
+UPDATE msg_queues SET rcv_service_id = $1
+WHERE recipient_id = $2 AND deleted_at IS NULL AND rcv_service_id IS DISTINCT FROM $1
+```
+
+---
+
+## 3. update_aggregates trigger chain (Server C only)
+
+**Queries** (from triggers in `m20250915_queue_ids_hash`):
+
+```sql
+-- Called by on_queue_update trigger
+SELECT update_aggregates(OLD.rcv_service_id, 'M', OLD.recipient_id, -1)
+SELECT update_aggregates(NEW.rcv_service_id, 'M', NEW.recipient_id, +1)
+
+-- Inside update_aggregates:
+UPDATE services
+  SET queue_count = queue_count + p_change,
+      queue_ids_hash = xor_combine(queue_ids_hash, public.digest(p_queue_id, 'md5'))
+  WHERE service_id = p_service_id AND service_role = p_role
+```
+
+Per-call: **0.07ms avg** for `update_aggregates`, **0.05ms avg** for `UPDATE services`.
+Call volume matches #2 (one trigger per service association update).
+
+**Why it happens**: Every `rcv_service_id` UPDATE (see #2) fires the trigger, which does 2 calls
+to `update_aggregates` (decrement old, increment new), each doing an UPDATE on `services`.
+
+**How to fix**: Same as #2 — if the `IS DISTINCT FROM` guard is added, unchanged associations
+don't trigger updates at all.
+
+---
+
+## 4. Batch notifier_id IN (...) lookups — service notification subscription
+
+**Query** (`QueueStore/Postgres.hs`, `getQueues_ SNotifier`):
+
+```sql
+SELECT ntf_service_id, notifier_id
+FROM msg_queues
+WHERE notifier_id IN ($1, ..., $N) AND deleted_at IS NULL
+```
+
+Called by `getQueuesNtfService` when a notification service subscribes to queues.
+
+All three servers: batch sizes 36–150 params, **0.5-0.7ms avg per call**,
+tens of thousands of calls with ~25s total accumulated time each.
+
+The existing index `idx_msg_queues_notifier_id` (UNIQUE on `notifier_id`) doesn't include
+`ntf_service_id` or `deleted_at`. Each matching row requires a heap access to:
+1. Check `deleted_at IS NULL`
+2. Read `ntf_service_id`
+
+**How to fix**: Replace the index with a partial covering index:
+
+```sql
+DROP INDEX idx_msg_queues_notifier_id;
+CREATE UNIQUE INDEX idx_msg_queues_notifier_id
+  ON msg_queues (notifier_id)
+  INCLUDE (ntf_service_id)
+  WHERE deleted_at IS NULL;
+```
+
+This enables index-only scans, avoiding heap access. Estimated ~30-40% reduction in per-call time.
+
+---
+
+## 5. Batch recipient_id IN (...) lookups — service queue subscription
+
+**Query** (`QueueStore/Postgres.hs`, `getQueues_ SRecipient`):
+
+```sql
+SELECT recipient_id, recipient_keys, rcv_dh_secret, sender_id, sender_key, queue_mode,
+  notifier_id, notifier_key, rcv_ntf_dh_secret, ntf_service_id,
+  status, updated_at, link_id, rcv_service_id
+FROM msg_queues
+WHERE recipient_id IN ($1, ..., $135) AND deleted_at IS NULL
+```
+
+All three servers show consistent results: **~2.2ms avg, ~40ms max per call**,
+~20K calls each with ~135 IDs per batch.
+
+These are service subscription batches from `subscribeServiceMessages` / `subscribeServiceNotifications`.
+
+**Current performance**: ~2ms for ~135 random PK lookups is reasonable. The primary key index
+is used. No optimization needed for per-call latency.
+
+**Observation**: Nearly all requested queues are returned, meaning services are subscribing
+to known queues, not probing.
+
+---
+
+## 6. foldRcvServiceMessages — service subscription delivery
+
+**Query** (`MsgStore/Postgres.hs:127-141`):
+
+```sql
+SELECT q.recipient_id, q.recipient_keys, q.rcv_dh_secret,
+  q.sender_id, q.sender_key, q.queue_mode,
+  q.notifier_id, q.notifier_key, q.rcv_ntf_dh_secret, q.ntf_service_id,
+  q.status, q.updated_at, q.link_id, q.rcv_service_id,
+  m.msg_id, m.msg_ts, m.msg_quota, m.msg_ntf_flag, m.msg_body
+FROM msg_queues q
+LEFT JOIN LATERAL (
+    SELECT msg_id, msg_ts, msg_quota, msg_ntf_flag, msg_body
+    FROM messages
+    WHERE recipient_id = q.recipient_id
+    ORDER BY message_id ASC
+    LIMIT 1
+) m ON true
+WHERE q.rcv_service_id = ? AND q.deleted_at IS NULL;
+```
+
+Called on `subscribeServiceMessages` to deliver pending messages for all queues of a service.
+
+**Not visible in slow query logs** — runs once per service subscription (startup), not repeatedly.
+
+**Potential concern**: `LEFT JOIN LATERAL` for every queue of the service. For services with many
+queues, this scans all matching rows in `idx_msg_queues_rcv_service_id(rcv_service_id, deleted_at)`,
+plus one `idx_messages_recipient_id_message_id` probe per queue.
+
+**No change needed** — startup-only operation, properly indexed.
+
+---
+
+## 7. getServiceQueueCountHash — already optimized
+
+**Query** (`QueueStore/Postgres.hs`, `getServiceQueueCountHash`):
+
+```sql
+SELECT queue_count, queue_ids_hash FROM services WHERE service_id = ? AND service_role = ?
+```
+
+Reads trigger-maintained aggregates from the small `services` table instead of scanning `msg_queues`.
+
+**Not in slow query logs** — fast single-row lookup. Already the right approach.
+
+---
+
+## Summary of fixes
+
+| # | Problem | Impact | Fix |
+|---|---------|--------|-----|
+| 1 | getEntityCounts: ③ and ④ scan msg_queues for service queue counts | ~100-150ms saved per call | Use `SUM(queue_count) FROM services` |
+| 2 | SET rcv_service_id fires trigger on no-op updates (Server C) | Eliminates redundant trigger chain | Add `IS DISTINCT FROM` guard |
+| 3 | update_aggregates triggers fire needlessly | Eliminated by #2 | (same fix) |
+| 4 | Batch notifier_id lookups: no covering index | ~30-40% faster per call | Add `INCLUDE (ntf_service_id)` partial index |
+| 5 | Batch recipient_id lookups: PK lookups at ~2ms/call | Acceptable | No change needed |
+| 6 | foldRcvServiceMessages: startup-only | N/A | No change needed |
+| 7 | getServiceQueueCountHash | Already optimized | Already reads from services table |