Skip to content

Commit 3ddf214

Browse files
committed
smp-server: analyze slow queries
1 parent 2012236 commit 3ddf214

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed

plans/slow-queries-analysis.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Slow query analysis: rcv-services related queries
2+
3+
Data from three production SMP servers (A, B, C) over a multi-day observation window.
4+
5+
Below: only queries related to services, subscriptions, and service-driven queue lookups.
6+
7+
---
8+
9+
## 1. getEntityCounts — service-related subqueries
10+
11+
**Query** (`QueueStore/Postgres.hs:160-167`):
12+
13+
```sql
14+
SELECT
15+
(SELECT COUNT(1) FROM msg_queues WHERE deleted_at IS NULL) AS queue_count,
16+
(SELECT COUNT(1) FROM msg_queues WHERE deleted_at IS NULL AND notifier_id IS NOT NULL) AS notifier_count,
17+
(SELECT COUNT(1) FROM services WHERE service_role = ?) AS rcv_service_count, --
18+
(SELECT COUNT(1) FROM services WHERE service_role = ?) AS ntf_service_count, --
19+
(SELECT COUNT(1) FROM msg_queues WHERE rcv_service_id IS NOT NULL AND deleted_at IS NULL) AS rcv_service_queues_count, --
20+
(SELECT COUNT(1) FROM msg_queues WHERE ntf_service_id IS NOT NULL AND deleted_at IS NULL) AS ntf_service_queues_count --
21+
```
22+
23+
Subqueries ①-④ are service-specific. The first two (queue_count, notifier_count) are general.
24+
25+
All three servers show consistent results: **~315ms avg, up to ~2s max per call**,
26+
with total accumulated time of ~800s over the observation window (~2500 calls each).
27+
28+
**Why it's slow**: Subqueries ③ and ④ do full scans of `msg_queues` with conditions on
29+
`rcv_service_id IS NOT NULL` and `ntf_service_id IS NOT NULL` — no covering index exists
30+
for these filters. Each adds ~50-75ms on top of the already expensive base query.
31+
32+
Subqueries ① and ② hit the small `services` table (few rows) — negligible cost.
33+
34+
**How to fix**: Subqueries ③ and ④ can be replaced with aggregates from `services.queue_count`,
35+
which is already maintained by triggers:
36+
37+
```sql
38+
-- Replace ③ and ④ with:
39+
COALESCE((SELECT SUM(queue_count) FROM services WHERE service_role = 'M'), 0) AS rcv_service_queues_count,
40+
COALESCE((SELECT SUM(queue_count) FROM services WHERE service_role = 'N'), 0) AS ntf_service_queues_count
41+
```
42+
43+
This eliminates 2 of the 4 `msg_queues` table scans per call. The `services` table has <10 rows.
44+
45+
**Expected savings**: ~100-150ms per call, significant total reduction proportional to call frequency.
46+
47+
---
48+
49+
## 2. UPDATE msg_queues SET rcv_service_id — service association
50+
51+
**Query** (`QueueStore/Postgres.hs:490`):
52+
53+
```sql
54+
UPDATE msg_queues SET rcv_service_id = $1 WHERE recipient_id = $2 AND deleted_at IS NULL
55+
```
56+
57+
Called by `setQueueService` during `sharedSubscribeQueue` when a queue is associated with a receiving service.
58+
59+
**Only appears on Server C.** Servers A and B don't show this query in the slow query log.
60+
61+
Per-call: **0.19ms avg, ~5ms max**. High volume of service associations observed.
62+
Each UPDATE fires the `on_queue_update` trigger, which calls `update_aggregates` (see #3).
63+
The chain is:
64+
65+
1. `UPDATE msg_queues SET rcv_service_id` → 0.19ms
66+
2. `on_queue_update` trigger → `update_aggregates(OLD.rcv_service_id, ...)` → 0.07ms
67+
3. `UPDATE services SET queue_count = ...` → 0.05ms
68+
69+
Total: ~0.31ms per association.
70+
71+
**Why it might be too frequent**: The observed rate (~tens per minute) could indicate:
72+
- Normal service subscription flow after restarts
73+
- Re-associations when services reconnect
74+
- Possible redundant updates where `rcv_service_id` already equals the target
75+
76+
**How to fix**: The application-level guard at `Postgres.hs:487` should skip unchanged associations:
77+
78+
```haskell
79+
| rcvServiceId q == serviceId -> pure ()
80+
| otherwise -> ...
81+
```
82+
83+
If this guard fires correctly, the volume means genuinely new associations. If not (e.g., the
84+
`QueueRec` is read before service ID is set), add a DB-level guard to skip no-op updates and
85+
avoid firing the trigger:
86+
87+
```sql
88+
UPDATE msg_queues SET rcv_service_id = $1
89+
WHERE recipient_id = $2 AND deleted_at IS NULL AND rcv_service_id IS DISTINCT FROM $1
90+
```
91+
92+
---
93+
94+
## 3. update_aggregates trigger chain (Server C only)
95+
96+
**Queries** (from triggers in `m20250915_queue_ids_hash`):
97+
98+
```sql
99+
-- Called by on_queue_update trigger
100+
SELECT update_aggregates(OLD.rcv_service_id, 'M', OLD.recipient_id, -1)
101+
SELECT update_aggregates(NEW.rcv_service_id, 'M', NEW.recipient_id, +1)
102+
103+
-- Inside update_aggregates:
104+
UPDATE services
105+
SET queue_count = queue_count + p_change,
106+
queue_ids_hash = xor_combine(queue_ids_hash, public.digest(p_queue_id, 'md5'))
107+
WHERE service_id = p_service_id AND service_role = p_role
108+
```
109+
110+
Per-call: **0.07ms avg** for `update_aggregates`, **0.05ms avg** for `UPDATE services`.
111+
Call volume matches #2 (one trigger per service association update).
112+
113+
**Why it happens**: Every `rcv_service_id` UPDATE (see #2) fires the trigger, which does 2 calls
114+
to `update_aggregates` (decrement old, increment new), each doing an UPDATE on `services`.
115+
116+
**How to fix**: Same as #2 — if the `IS DISTINCT FROM` guard is added, unchanged associations
117+
don't trigger updates at all.
118+
119+
---
120+
121+
## 4. Batch notifier_id IN (...) lookups — service notification subscription
122+
123+
**Query** (`QueueStore/Postgres.hs`, `getQueues_ SNotifier`):
124+
125+
```sql
126+
SELECT ntf_service_id, notifier_id
127+
FROM msg_queues
128+
WHERE notifier_id IN ($1, ..., $N) AND deleted_at IS NULL
129+
```
130+
131+
Called by `getQueuesNtfService` when a notification service subscribes to queues.
132+
133+
All three servers: batch sizes 36–150 params, **0.5-0.7ms avg per call**,
134+
tens of thousands of calls with ~25s total accumulated time each.
135+
136+
The existing index `idx_msg_queues_notifier_id` (UNIQUE on `notifier_id`) doesn't include
137+
`ntf_service_id` or `deleted_at`. Each matching row requires a heap access to:
138+
1. Check `deleted_at IS NULL`
139+
2. Read `ntf_service_id`
140+
141+
**How to fix**: Replace the index with a partial covering index:
142+
143+
```sql
144+
DROP INDEX idx_msg_queues_notifier_id;
145+
CREATE UNIQUE INDEX idx_msg_queues_notifier_id
146+
ON msg_queues (notifier_id)
147+
INCLUDE (ntf_service_id)
148+
WHERE deleted_at IS NULL;
149+
```
150+
151+
This enables index-only scans, avoiding heap access. Estimated ~30-40% reduction in per-call time.
152+
153+
---
154+
155+
## 5. Batch recipient_id IN (...) lookups — service queue subscription
156+
157+
**Query** (`QueueStore/Postgres.hs`, `getQueues_ SRecipient`):
158+
159+
```sql
160+
SELECT recipient_id, recipient_keys, rcv_dh_secret, sender_id, sender_key, queue_mode,
161+
notifier_id, notifier_key, rcv_ntf_dh_secret, ntf_service_id,
162+
status, updated_at, link_id, rcv_service_id
163+
FROM msg_queues
164+
WHERE recipient_id IN ($1, ..., $135) AND deleted_at IS NULL
165+
```
166+
167+
All three servers show consistent results: **~2.2ms avg, ~40ms max per call**,
168+
~20K calls each with ~135 IDs per batch.
169+
170+
These are service subscription batches from `subscribeServiceMessages` / `subscribeServiceNotifications`.
171+
172+
**Current performance**: ~2ms for ~135 random PK lookups is reasonable. The primary key index
173+
is used. No optimization needed for per-call latency.
174+
175+
**Observation**: Nearly all requested queues are returned, meaning services are subscribing
176+
to known queues, not probing.
177+
178+
---
179+
180+
## 6. foldRcvServiceMessages — service subscription delivery
181+
182+
**Query** (`MsgStore/Postgres.hs:127-141`):
183+
184+
```sql
185+
SELECT q.recipient_id, q.recipient_keys, q.rcv_dh_secret,
186+
q.sender_id, q.sender_key, q.queue_mode,
187+
q.notifier_id, q.notifier_key, q.rcv_ntf_dh_secret, q.ntf_service_id,
188+
q.status, q.updated_at, q.link_id, q.rcv_service_id,
189+
m.msg_id, m.msg_ts, m.msg_quota, m.msg_ntf_flag, m.msg_body
190+
FROM msg_queues q
191+
LEFT JOIN LATERAL (
192+
SELECT msg_id, msg_ts, msg_quota, msg_ntf_flag, msg_body
193+
FROM messages
194+
WHERE recipient_id = q.recipient_id
195+
ORDER BY message_id ASC
196+
LIMIT 1
197+
) m ON true
198+
WHERE q.rcv_service_id = ? AND q.deleted_at IS NULL;
199+
```
200+
201+
Called on `subscribeServiceMessages` to deliver pending messages for all queues of a service.
202+
203+
**Not visible in slow query logs** — runs once per service subscription (startup), not repeatedly.
204+
205+
**Potential concern**: `LEFT JOIN LATERAL` for every queue of the service. For services with many
206+
queues, this scans all matching rows in `idx_msg_queues_rcv_service_id(rcv_service_id, deleted_at)`,
207+
plus one `idx_messages_recipient_id_message_id` probe per queue.
208+
209+
**No change needed** — startup-only operation, properly indexed.
210+
211+
---
212+
213+
## 7. getServiceQueueCountHash — already optimized
214+
215+
**Query** (`QueueStore/Postgres.hs`, `getServiceQueueCountHash`):
216+
217+
```sql
218+
SELECT queue_count, queue_ids_hash FROM services WHERE service_id = ? AND service_role = ?
219+
```
220+
221+
Reads trigger-maintained aggregates from the small `services` table instead of scanning `msg_queues`.
222+
223+
**Not in slow query logs** — fast single-row lookup. Already the right approach.
224+
225+
---
226+
227+
## Summary of fixes
228+
229+
| # | Problem | Impact | Fix |
230+
|---|---------|--------|-----|
231+
| 1 | getEntityCounts: ③ and ④ scan msg_queues for service queue counts | ~100-150ms saved per call | Use `SUM(queue_count) FROM services` |
232+
| 2 | SET rcv_service_id fires trigger on no-op updates (Server C) | Eliminates redundant trigger chain | Add `IS DISTINCT FROM` guard |
233+
| 3 | update_aggregates triggers fire needlessly | Eliminated by #2 | (same fix) |
234+
| 4 | Batch notifier_id lookups: no covering index | ~30-40% faster per call | Add `INCLUDE (ntf_service_id)` partial index |
235+
| 5 | Batch recipient_id lookups: PK lookups at ~2ms/call | Acceptable | No change needed |
236+
| 6 | foldRcvServiceMessages: startup-only | N/A | No change needed |
237+
| 7 | getServiceQueueCountHash | Already optimized | Already reads from services table |

0 commit comments

Comments
 (0)