Skip to content

Commit 5cef698

Browse files
ivanagasgewenyu99
andauthored
Docs: Put details on optimizing queries in more places (#14153)
Co-authored-by: Vincent (Wen Yu) Ge <[email protected]>
1 parent e07449a commit 5cef698

File tree

5 files changed

+330
-296
lines changed

5 files changed

+330
-296
lines changed
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
When writing custom queries, the burden of performance falls onto you. PostHog handles performance for queries we own (for example, in product analytics insights and experiments, etc.), but because performance depends on how queries are structured and written, we can't optimize them for you. Large data sets particularly require extra careful attention to performance.
2+
3+
Here is some advice for making sure your queries are quick and don't read over too much data (which can increase costs):
4+
5+
### 1. Use shorter time ranges
6+
7+
You should almost always include a time range in your queries, and the shorter the better. There are a variety of SQL features to help you do this including `now()`, `INTERVAL`, and `dateDiff`. See more about these in our [SQL docs](/docs/product-analytics/sql#date-and-time).
8+
9+
<MultiLanguage>
10+
11+
```sql
12+
SELECT count() FROM events WHERE timestamp >= now() - INTERVAL 7 DAY
13+
```
14+
15+
```bash
16+
curl \
17+
-H 'Content-Type: application/json' \
18+
-H "Authorization: Bearer $POSTHOG_PERSONAL_API_KEY" \
19+
<ph_app_host>/api/projects/:project_id/query/ \
20+
-d '{
21+
"query": {
22+
"kind": "HogQLQuery",
23+
"query": "SELECT count() FROM events WHERE timestamp >= now() - INTERVAL 7 DAY"
24+
},
25+
"name": "event count in last 7 days"
26+
}'
27+
```
28+
29+
```python
30+
import requests
31+
import json
32+
33+
url = "<ph_app_host>/api/projects/{project_id}/query/"
34+
headers = {
35+
'Content-Type': 'application/json',
36+
'Authorization': 'Bearer {POSTHOG_PERSONAL_API_KEY}'
37+
}
38+
payload = {
39+
"query": {
40+
"kind": "HogQLQuery",
41+
"query": """
42+
SELECT count()
43+
FROM events
44+
WHERE timestamp >= now() - INTERVAL 7 DAY
45+
"""
46+
},
47+
"name": "event count in last 7 days"
48+
}
49+
response = requests.post(url, headers=headers, data=json.dumps(payload))
50+
print(response.json())
51+
```
52+
53+
```node
54+
import fetch from "node-fetch";
55+
56+
async function createQuery() {
57+
const url = "<ph_app_host>/api/projects/:project_id/query/";
58+
const headers = {
59+
"Content-Type": "application/json",
60+
"Authorization": "Bearer {POSTHOG_PERSONAL_API_KEY}"
61+
};
62+
63+
const payload = {
64+
"query": {
65+
"kind": "HogQLQuery",
66+
"query": `
67+
SELECT count()
68+
FROM events
69+
WHERE timestamp >= now() - INTERVAL 7 DAY
70+
LIMIT 100
71+
`
72+
},
73+
"name": "event count in last 7 days"
74+
}
75+
76+
const response = await fetch(url, {
77+
method: "POST",
78+
headers: headers,
79+
body: JSON.stringify(payload),
80+
});
81+
82+
const data = await response.json();
83+
console.log(data);
84+
}
85+
86+
createQuery()
87+
```
88+
89+
</MultiLanguage>
90+
91+
### 2. Materialize a view for the data you need
92+
93+
The data warehouse enables you to [save and materialize views](/docs/data-warehouse/views/materialize) of your data. This means that the view is precomputed, which can significantly improve query performance.
94+
95+
To do this, write your query in the [SQL editor](https://us.posthog.com/sql), click **Materialize**, then **Save and materialize**, and give it a name without spaces (I chose `mat_event_count`). You can also schedule to update the view at a specific interval.
96+
97+
<ProductScreenshot
98+
imageLight="https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2025_06_19_at_16_12_28_2x_db1e37a5cf.png"
99+
imageDark="https://res.cloudinary.com/dmukukwp6/image/upload/Clean_Shot_2025_06_19_at_16_12_44_2x_90bd8f28ca.png"
100+
alt="Materialize view"
101+
classes="rounded"
102+
/>
103+
104+
Once done, you can query the view like any other table.
105+
106+
<MultiLanguage>
107+
108+
```sql
109+
SELECT * FROM mat_event_count
110+
```
111+
112+
```bash
113+
curl \
114+
-H 'Content-Type: application/json' \
115+
-H "Authorization: Bearer $POSTHOG_PERSONAL_API_KEY" \
116+
<ph_app_host>/api/projects/:project_id/query/ \
117+
-d '{
118+
"query": {
119+
"kind": "HogQLQuery",
120+
"query": "SELECT * FROM mat_event_count"
121+
},
122+
"name": "get materialized event count"
123+
}'
124+
```
125+
126+
```python
127+
import requests
128+
import json
129+
130+
url = "<ph_app_host>/api/projects/{project_id}/query/"
131+
headers = {
132+
'Content-Type': 'application/json',
133+
'Authorization': 'Bearer {POSTHOG_PERSONAL_API_KEY}'
134+
}
135+
payload = {
136+
"query": {
137+
"kind": "HogQLQuery",
138+
"query": """
139+
SELECT *
140+
FROM mat_event_count
141+
"""
142+
},
143+
"name": "get materialized event count"
144+
}
145+
response = requests.post(url, headers=headers, data=json.dumps(payload))
146+
print(response.json())
147+
```
148+
149+
```node
150+
import fetch from "node-fetch";
151+
152+
async function createQuery() {
153+
const url = "<ph_app_host>/api/projects/:project_id/query/";
154+
const headers = {
155+
"Content-Type": "application/json",
156+
"Authorization": "Bearer {POSTHOG_PERSONAL_API_KEY}"
157+
};
158+
159+
const payload = {
160+
"query": {
161+
"kind": "HogQLQuery",
162+
"query": `
163+
SELECT *
164+
FROM mat_event_count
165+
`
166+
},
167+
"name": "get materialized counts"
168+
}
169+
170+
const response = await fetch(url, {
171+
method: "POST",
172+
headers: headers,
173+
body: JSON.stringify(payload),
174+
});
175+
176+
const data = await response.json();
177+
console.log(data);
178+
}
179+
180+
createQuery()
181+
```
182+
183+
</MultiLanguage>
184+
185+
### 3. Don't scan the same table multiple times
186+
187+
Reading a large table like `events` or `persons` more than once in the same query multiplies the work PostHog has to do (more I/O, more CPU, more memory). For example, this query is inefficient:
188+
189+
```sql
190+
WITH us_events AS (
191+
SELECT *
192+
FROM events
193+
WHERE properties.$geoip_country_code = 'US'
194+
),
195+
ca_events AS (
196+
SELECT *
197+
FROM events
198+
WHERE properties.$geoip_country_code = 'CA'
199+
)
200+
SELECT *
201+
FROM us_events
202+
UNION ALL
203+
SELECT *
204+
FROM ca_events
205+
```
206+
207+
Instead, pull the rows you need **once** and save it as a [materialized view](/docs/data-warehouse/views/materialize). You can then query from that materialized view in all the other steps.
208+
209+
Start by saving this materialized view, e.g. as `base_events`:
210+
211+
```sql
212+
SELECT event, properties.$geoip_country_code as country
213+
FROM events
214+
WHERE properties.$geoip_country_code IN ('US', 'CA')
215+
```
216+
217+
You can then query from `base_events` in your main query, which avoids scanning the raw `events` table multiple times:
218+
219+
```sql
220+
WITH us_events AS (
221+
SELECT event
222+
FROM base_events
223+
WHERE country = 'US'
224+
),
225+
ca_events AS (
226+
SELECT event
227+
FROM base_events
228+
WHERE country = 'CA'
229+
)
230+
SELECT *
231+
FROM us_events
232+
UNION ALL
233+
SELECT *
234+
FROM ca_events
235+
```
236+
237+
### 4. Name your queries for easier debugging
238+
239+
Always provide a meaningful `name` parameter for your queries. This helps you:
240+
241+
- Identify slow or problematic queries in the [`query_log` table](/docs/data/query-log)
242+
- Analyze query performance patterns over time
243+
- Debug issues more efficiently
244+
- Track resource usage by query type
245+
246+
Good query names are descriptive and include the purpose:
247+
248+
- `daily_active_users_last_7_days`
249+
- `funnel_signup_to_activation`
250+
- `revenue_by_country_monthly`
251+
252+
Bad names are generic and vague:
253+
254+
- `query1`
255+
- `test`
256+
- `data`
257+
258+
### 5. Use timestamp-based pagination instead of OFFSET
259+
260+
When querying large datasets like `events` or `query_log` over multiple batches, avoid using `OFFSET` for pagination. Instead, use timestamp-based pagination, which is much more efficient and scales better.
261+
262+
**❌ Inefficient approach using OFFSET:**
263+
```sql
264+
-- First batch
265+
SELECT timestamp, event, distinct_id
266+
FROM events
267+
WHERE timestamp >= '2024-01-01'
268+
ORDER BY timestamp
269+
LIMIT 1000;
270+
271+
-- Second batch
272+
SELECT timestamp, event, distinct_id
273+
FROM events
274+
WHERE timestamp >= '2024-01-01'
275+
ORDER BY timestamp
276+
LIMIT 1000 OFFSET 1000; -- This gets slower with each page
277+
```
278+
279+
**✅ Efficient approach using timestamp pagination:**
280+
```sql
281+
-- First batch
282+
SELECT timestamp, event, distinct_id
283+
FROM events
284+
WHERE timestamp >= '2024-01-01'
285+
ORDER BY timestamp
286+
LIMIT 1000;
287+
288+
-- Second batch (use timestamp of last event from previous batch)
289+
SELECT timestamp, event, distinct_id
290+
FROM events
291+
WHERE timestamp > '2024-01-01 12:34:56.789' -- timestamp from last row of previous batch
292+
ORDER BY timestamp
293+
LIMIT 1000;
294+
```
295+
296+
This approach is more efficient because:
297+
- **Constant performance**: Each query executes in similar time regardless of how many rows you've already retrieved
298+
- **Index-friendly**: Uses the timestamp index effectively for filtering
299+
- **Scalable**: Performance doesn't degrade as you paginate through millions of rows
300+
301+
**For geeks:** OFFSET-based pagination gets progressively slower because the database must scan and skip all the offset rows for each query. With timestamp-based pagination, the database uses the timestamp index to directly jump to the right starting point, maintaining consistent performance across all pages.
302+
303+
### 6. Other SQL optimizations
304+
305+
Options 1-5 make the most difference, but other generic SQL optimizations work too. See our [SQL docs](/docs/product-analytics/sql) for commands, useful functions, and more to help you with this.

0 commit comments

Comments
 (0)