|
| 1 | +# Slow query analysis: rcv-services related queries |
| 2 | + |
| 3 | +Data from three production SMP servers (A, B, C) over a multi-day observation window. |
| 4 | + |
| 5 | +Below: only queries related to services, subscriptions, and service-driven queue lookups. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. getEntityCounts — service-related subqueries |
| 10 | + |
| 11 | +**Query** (`QueueStore/Postgres.hs:160-167`): |
| 12 | + |
| 13 | +```sql |
| 14 | +SELECT |
| 15 | + (SELECT COUNT(1) FROM msg_queues WHERE deleted_at IS NULL) AS queue_count, |
| 16 | + (SELECT COUNT(1) FROM msg_queues WHERE deleted_at IS NULL AND notifier_id IS NOT NULL) AS notifier_count, |
| 17 | + (SELECT COUNT(1) FROM services WHERE service_role = ?) AS rcv_service_count, -- ① |
| 18 | + (SELECT COUNT(1) FROM services WHERE service_role = ?) AS ntf_service_count, -- ② |
| 19 | + (SELECT COUNT(1) FROM msg_queues WHERE rcv_service_id IS NOT NULL AND deleted_at IS NULL) AS rcv_service_queues_count, -- ③ |
| 20 | + (SELECT COUNT(1) FROM msg_queues WHERE ntf_service_id IS NOT NULL AND deleted_at IS NULL) AS ntf_service_queues_count -- ④ |
| 21 | +``` |
| 22 | + |
| 23 | +Subqueries ①-④ are service-specific. The first two (queue_count, notifier_count) are general. |
| 24 | + |
| 25 | +All three servers show consistent results: **~315ms avg, up to ~2s max per call**, |
| 26 | +with total accumulated time of ~800s over the observation window (~2500 calls each). |
| 27 | + |
| 28 | +**Why it's slow**: Subqueries ③ and ④ do full scans of `msg_queues` with conditions on |
| 29 | +`rcv_service_id IS NOT NULL` and `ntf_service_id IS NOT NULL` — no covering index exists |
| 30 | +for these filters. Each adds ~50-75ms on top of the already expensive base query. |
| 31 | + |
| 32 | +Subqueries ① and ② hit the small `services` table (few rows) — negligible cost. |
| 33 | + |
| 34 | +**How to fix**: Subqueries ③ and ④ can be replaced with aggregates from `services.queue_count`, |
| 35 | +which is already maintained by triggers: |
| 36 | + |
| 37 | +```sql |
| 38 | +-- Replace ③ and ④ with: |
| 39 | +COALESCE((SELECT SUM(queue_count) FROM services WHERE service_role = 'M'), 0) AS rcv_service_queues_count, |
| 40 | +COALESCE((SELECT SUM(queue_count) FROM services WHERE service_role = 'N'), 0) AS ntf_service_queues_count |
| 41 | +``` |
| 42 | + |
| 43 | +This eliminates 2 of the 4 `msg_queues` table scans per call. The `services` table has <10 rows. |
| 44 | + |
| 45 | +**Expected savings**: ~100-150ms per call, significant total reduction proportional to call frequency. |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +## 2. UPDATE msg_queues SET rcv_service_id — service association |
| 50 | + |
| 51 | +**Query** (`QueueStore/Postgres.hs:490`): |
| 52 | + |
| 53 | +```sql |
| 54 | +UPDATE msg_queues SET rcv_service_id = $1 WHERE recipient_id = $2 AND deleted_at IS NULL |
| 55 | +``` |
| 56 | + |
| 57 | +Called by `setQueueService` during `sharedSubscribeQueue` when a queue is associated with a receiving service. |
| 58 | + |
| 59 | +**Only appears on Server C.** Servers A and B don't show this query in the slow query log. |
| 60 | + |
| 61 | +Per-call: **0.19ms avg, ~5ms max**. High volume of service associations observed. |
| 62 | +Each UPDATE fires the `on_queue_update` trigger, which calls `update_aggregates` (see #3). |
| 63 | +The chain is: |
| 64 | + |
| 65 | +1. `UPDATE msg_queues SET rcv_service_id` → 0.19ms |
| 66 | +2. `on_queue_update` trigger → `update_aggregates(OLD.rcv_service_id, ...)` → 0.07ms |
| 67 | +3. `UPDATE services SET queue_count = ...` → 0.05ms |
| 68 | + |
| 69 | +Total: ~0.31ms per association. |
| 70 | + |
| 71 | +**Why it might be too frequent**: The observed rate (~tens per minute) could indicate: |
| 72 | +- Normal service subscription flow after restarts |
| 73 | +- Re-associations when services reconnect |
| 74 | +- Possible redundant updates where `rcv_service_id` already equals the target |
| 75 | + |
| 76 | +**How to fix**: The application-level guard at `Postgres.hs:487` should skip unchanged associations: |
| 77 | + |
| 78 | +```haskell |
| 79 | +| rcvServiceId q == serviceId -> pure () |
| 80 | +| otherwise -> ... |
| 81 | +``` |
| 82 | + |
| 83 | +If this guard fires correctly, the volume means genuinely new associations. If not (e.g., the |
| 84 | +`QueueRec` is read before service ID is set), add a DB-level guard to skip no-op updates and |
| 85 | +avoid firing the trigger: |
| 86 | + |
| 87 | +```sql |
| 88 | +UPDATE msg_queues SET rcv_service_id = $1 |
| 89 | +WHERE recipient_id = $2 AND deleted_at IS NULL AND rcv_service_id IS DISTINCT FROM $1 |
| 90 | +``` |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## 3. update_aggregates trigger chain (Server C only) |
| 95 | + |
| 96 | +**Queries** (from triggers in `m20250915_queue_ids_hash`): |
| 97 | + |
| 98 | +```sql |
| 99 | +-- Called by on_queue_update trigger |
| 100 | +SELECT update_aggregates(OLD.rcv_service_id, 'M', OLD.recipient_id, -1) |
| 101 | +SELECT update_aggregates(NEW.rcv_service_id, 'M', NEW.recipient_id, +1) |
| 102 | + |
| 103 | +-- Inside update_aggregates: |
| 104 | +UPDATE services |
| 105 | + SET queue_count = queue_count + p_change, |
| 106 | + queue_ids_hash = xor_combine(queue_ids_hash, public.digest(p_queue_id, 'md5')) |
| 107 | + WHERE service_id = p_service_id AND service_role = p_role |
| 108 | +``` |
| 109 | + |
| 110 | +Per-call: **0.07ms avg** for `update_aggregates`, **0.05ms avg** for `UPDATE services`. |
| 111 | +Call volume matches #2 (one trigger per service association update). |
| 112 | + |
| 113 | +**Why it happens**: Every `rcv_service_id` UPDATE (see #2) fires the trigger, which does 2 calls |
| 114 | +to `update_aggregates` (decrement old, increment new), each doing an UPDATE on `services`. |
| 115 | + |
| 116 | +**How to fix**: Same as #2 — if the `IS DISTINCT FROM` guard is added, unchanged associations |
| 117 | +don't trigger updates at all. |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## 4. Batch notifier_id IN (...) lookups — service notification subscription |
| 122 | + |
| 123 | +**Query** (`QueueStore/Postgres.hs`, `getQueues_ SNotifier`): |
| 124 | + |
| 125 | +```sql |
| 126 | +SELECT ntf_service_id, notifier_id |
| 127 | +FROM msg_queues |
| 128 | +WHERE notifier_id IN ($1, ..., $N) AND deleted_at IS NULL |
| 129 | +``` |
| 130 | + |
| 131 | +Called by `getQueuesNtfService` when a notification service subscribes to queues. |
| 132 | + |
| 133 | +All three servers: batch sizes 36–150 params, **0.5-0.7ms avg per call**, |
| 134 | +tens of thousands of calls with ~25s total accumulated time each. |
| 135 | + |
| 136 | +The existing index `idx_msg_queues_notifier_id` (UNIQUE on `notifier_id`) doesn't include |
| 137 | +`ntf_service_id` or `deleted_at`. Each matching row requires a heap access to: |
| 138 | +1. Check `deleted_at IS NULL` |
| 139 | +2. Read `ntf_service_id` |
| 140 | + |
| 141 | +**How to fix**: Replace the index with a partial covering index: |
| 142 | + |
| 143 | +```sql |
| 144 | +DROP INDEX idx_msg_queues_notifier_id; |
| 145 | +CREATE UNIQUE INDEX idx_msg_queues_notifier_id |
| 146 | + ON msg_queues (notifier_id) |
| 147 | + INCLUDE (ntf_service_id) |
| 148 | + WHERE deleted_at IS NULL; |
| 149 | +``` |
| 150 | + |
| 151 | +This enables index-only scans, avoiding heap access. Estimated ~30-40% reduction in per-call time. |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## 5. Batch recipient_id IN (...) lookups — service queue subscription |
| 156 | + |
| 157 | +**Query** (`QueueStore/Postgres.hs`, `getQueues_ SRecipient`): |
| 158 | + |
| 159 | +```sql |
| 160 | +SELECT recipient_id, recipient_keys, rcv_dh_secret, sender_id, sender_key, queue_mode, |
| 161 | + notifier_id, notifier_key, rcv_ntf_dh_secret, ntf_service_id, |
| 162 | + status, updated_at, link_id, rcv_service_id |
| 163 | +FROM msg_queues |
| 164 | +WHERE recipient_id IN ($1, ..., $135) AND deleted_at IS NULL |
| 165 | +``` |
| 166 | + |
| 167 | +All three servers show consistent results: **~2.2ms avg, ~40ms max per call**, |
| 168 | +~20K calls each with ~135 IDs per batch. |
| 169 | + |
| 170 | +These are service subscription batches from `subscribeServiceMessages` / `subscribeServiceNotifications`. |
| 171 | + |
| 172 | +**Current performance**: ~2ms for ~135 random PK lookups is reasonable. The primary key index |
| 173 | +is used. No optimization needed for per-call latency. |
| 174 | + |
| 175 | +**Observation**: Nearly all requested queues are returned, meaning services are subscribing |
| 176 | +to known queues, not probing. |
| 177 | + |
| 178 | +--- |
| 179 | + |
| 180 | +## 6. foldRcvServiceMessages — service subscription delivery |
| 181 | + |
| 182 | +**Query** (`MsgStore/Postgres.hs:127-141`): |
| 183 | + |
| 184 | +```sql |
| 185 | +SELECT q.recipient_id, q.recipient_keys, q.rcv_dh_secret, |
| 186 | + q.sender_id, q.sender_key, q.queue_mode, |
| 187 | + q.notifier_id, q.notifier_key, q.rcv_ntf_dh_secret, q.ntf_service_id, |
| 188 | + q.status, q.updated_at, q.link_id, q.rcv_service_id, |
| 189 | + m.msg_id, m.msg_ts, m.msg_quota, m.msg_ntf_flag, m.msg_body |
| 190 | +FROM msg_queues q |
| 191 | +LEFT JOIN LATERAL ( |
| 192 | + SELECT msg_id, msg_ts, msg_quota, msg_ntf_flag, msg_body |
| 193 | + FROM messages |
| 194 | + WHERE recipient_id = q.recipient_id |
| 195 | + ORDER BY message_id ASC |
| 196 | + LIMIT 1 |
| 197 | +) m ON true |
| 198 | +WHERE q.rcv_service_id = ? AND q.deleted_at IS NULL; |
| 199 | +``` |
| 200 | + |
| 201 | +Called on `subscribeServiceMessages` to deliver pending messages for all queues of a service. |
| 202 | + |
| 203 | +**Not visible in slow query logs** — runs once per service subscription (startup), not repeatedly. |
| 204 | + |
| 205 | +**Potential concern**: `LEFT JOIN LATERAL` for every queue of the service. For services with many |
| 206 | +queues, this scans all matching rows in `idx_msg_queues_rcv_service_id(rcv_service_id, deleted_at)`, |
| 207 | +plus one `idx_messages_recipient_id_message_id` probe per queue. |
| 208 | + |
| 209 | +**No change needed** — startup-only operation, properly indexed. |
| 210 | + |
| 211 | +--- |
| 212 | + |
| 213 | +## 7. getServiceQueueCountHash — already optimized |
| 214 | + |
| 215 | +**Query** (`QueueStore/Postgres.hs`, `getServiceQueueCountHash`): |
| 216 | + |
| 217 | +```sql |
| 218 | +SELECT queue_count, queue_ids_hash FROM services WHERE service_id = ? AND service_role = ? |
| 219 | +``` |
| 220 | + |
| 221 | +Reads trigger-maintained aggregates from the small `services` table instead of scanning `msg_queues`. |
| 222 | + |
| 223 | +**Not in slow query logs** — fast single-row lookup. Already the right approach. |
| 224 | + |
| 225 | +--- |
| 226 | + |
| 227 | +## Summary of fixes |
| 228 | + |
| 229 | +| # | Problem | Impact | Fix | |
| 230 | +|---|---------|--------|-----| |
| 231 | +| 1 | getEntityCounts: ③ and ④ scan msg_queues for service queue counts | ~100-150ms saved per call | Use `SUM(queue_count) FROM services` | |
| 232 | +| 2 | SET rcv_service_id fires trigger on no-op updates (Server C) | Eliminates redundant trigger chain | Add `IS DISTINCT FROM` guard | |
| 233 | +| 3 | update_aggregates triggers fire needlessly | Eliminated by #2 | (same fix) | |
| 234 | +| 4 | Batch notifier_id lookups: no covering index | ~30-40% faster per call | Add `INCLUDE (ntf_service_id)` partial index | |
| 235 | +| 5 | Batch recipient_id lookups: PK lookups at ~2ms/call | Acceptable | No change needed | |
| 236 | +| 6 | foldRcvServiceMessages: startup-only | N/A | No change needed | |
| 237 | +| 7 | getServiceQueueCountHash | Already optimized | Already reads from services table | |
0 commit comments