Skip to content

Commit 8f3d5b2

Browse files
committed
Creating redshift-useful-sql page from archived help.segment.com page
1 parent 414b320 commit 8f3d5b2

File tree

1 file changed

+344
-0
lines changed

1 file changed

+344
-0
lines changed
Lines changed: 344 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,344 @@
1+
---
2+
title: Useful SQL Queries for Redshift
3+
---
4+
Below you'll find a library of some of the most useful SQL queries customers use in their Redshift warehouses. You can run these right in your Redshift instance with little to no modification.
5+
6+
> note " "
7+
> If you're looking for SQL queries for warehouses other than Redshift, check out some of our [Analyizing with SQL guides](/docs/connections/storage/warehouses/index/#analyzing-with-sql).
8+
9+
## Tracking events
10+
11+
The track allows you to record any actions your users perform. A track call takes three parameters: the userId, the event, and any optional properties.
12+
13+
Here's a basic track call:
14+
15+
```javascript
16+
analytics.track('Completed Order',
17+
item: 'pants',
18+
color: 'blue'
19+
size: '32x32'
20+
payment: 'credit card'
21+
});
22+
```
23+
24+
And another completed order track call might look like this:
25+
26+
```javascript
27+
analytics.track('Completed Order', {
28+
item: 'shirt',
29+
color: 'green'
30+
size: 'Large'
31+
payment: 'paypal'
32+
});
33+
```
34+
35+
Each track call is stored as a distinct row in a single Redshift table called `tracks`. To get a table of your completed orders, you can run the following query:
36+
37+
```sql
38+
select *
39+
from initech.tracks
40+
where event = 'completed_order'
41+
```
42+
43+
This SQL query returns a table that looks like this:
44+
45+
| event | event_id | user_id | sent_at | item | color | size | payment |
46+
| --------------- | -------- | ---------- | ------------------- | ----- | ------ | ------ | ----------- |
47+
| completed_order | jrse3pyf | BQ9R7u4NA9 | 2021-12-09 03:50:00 | pants | blue | 32x33 | credit card |
48+
| completed_order | cxntjkc7 | TQ9D7x4NA4 | 2021-12-09 02:50:01 | pants | green | 31x33 | credit card |
49+
| completed_order | xjuvaely | APR97u8NB9 | 2021-12-09 01:50:39 | shirt | red | Medium | credit card |
50+
| completed_order | rft31ial | ly3jaeillp | 2021-12-09 08:50:13 | shirt | yellow | Large | credit card |
51+
| completed_order | k8bhgc6h | X9G5Qg0tha | 2021-12-09 07:20:19 | shirt | yellow | Small | paypal |
52+
53+
But why are there columns in the table that weren't a part of our **track** call, like `event_id`?
54+
This is because the track method automatically includes additional properties of the event, like `event_id`, `sent_at`, and `user_id` (for client-side libraries)!
55+
56+
## Grouping events by day
57+
If you want to know how many orders were completed over a span of time, you can use the `date()` and `count` function with the `sent_at` timestamp:
58+
59+
```sql
60+
select date(sent_at) as date, count(event)
61+
from initech.tracks
62+
where event = 'completed_order'
63+
group by date
64+
```
65+
That query will return a table like this:
66+
67+
| date | count |
68+
| ---------- | ----- |
69+
| 2021-12-09 | 5 |
70+
| 2021-12-08 | 3 |
71+
| 2021-12-07 | 2 |
72+
73+
If you wanted to see how many pants and shirts were sold on each of those dates, you can query that using case statements:
74+
75+
```sql
76+
select date(sent_at) as date,
77+
sum(case when item = 'shirt' then 1 else 0 end) as shirts,
78+
sum(case when item = 'pants' then 1 else 0 end) as pants
79+
from initech.tracks
80+
where event = 'completed_order'
81+
group by date
82+
```
83+
84+
That query returns a table like this:
85+
86+
| date | shirts | pants |
87+
| ---------- | ------ | ----- |
88+
| 2021-12-09 | 3 | 2 |
89+
| 2021-12-08 | 1 | 2 |
90+
| 2021-12-07 | 2 | 0 |
91+
92+
93+
## Defining sessions
94+
Segment’s API does not impose any restrictions on your data with regard to user sessions.
95+
96+
Sessions aren’t fundamental facts about the user experience. They’re stories we build around the data to understand how customers actually use the product in their day-to-day lives. And since Segment’s API is about collecting raw, factual data, we don’t have an API for collecting sessions. We leave session interpretation to our partners, which let you design how you measure sessions based on how customers use your product.
97+
98+
### How to define user sessions using SQL
99+
Each of our SQL partners allows you to define sessions based on your specific business needs. With Looker, for example, you can take advantage of their persistent derived tables and LookML modeling language to layer sessionization on top of your Segment SQL data. We recommend checking out their approach here!
100+
101+
For defining sessions with raw SQL, the best query and explanation we’ve come across is from our friends at Mode Analytics.
102+
103+
Here’s the query to make it happen, but we definitely recommend checking out their blog post as well! They walk you through the reasoning behind the query, what each portion accomplishes, how you can tweak it to suit your needs, and what kind of further analysis you can do on top of it.
104+
105+
```sql
106+
-- Finding the start of every session
107+
SELECT *
108+
FROM (
109+
SELECT *
110+
LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event
111+
FROM "your_source".tracks
112+
) last
113+
WHERE EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
114+
OR last_event IS NULL
115+
116+
-- Mapping every event to its session
117+
SELECT *,
118+
SUM(is_new_session) OVER (ORDER BY user_id, sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
119+
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
120+
FROM (
121+
SELECT *,
122+
CASE WHEN EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
123+
OR last_event IS NULL
124+
THEN 1 ELSE 0 END AS is_new_session
125+
FROM (
126+
SELECT *,
127+
LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event
128+
FROM "your_source".tracks
129+
) last
130+
) final
131+
```
132+
133+
## Identifying users
134+
135+
### Historical traits
136+
137+
The `identify` method ties user attributes to a `userId`.
138+
139+
```javascript
140+
analytics.identify('bob123',{
141+
142+
plan: 'Free'
143+
});
144+
```
145+
As these user traits change over time, you can continue calling the identify method to update their changes. Here we’ll update Bob’s account plan to “Premium”.
146+
147+
```javascript
148+
analytics.identify('bob123', {
149+
150+
plan: 'Premium'
151+
});
152+
```
153+
154+
Each identify call is stored in a single Redshift table called `identifies`. To see how a user's plan changes over time, you can run the following query:
155+
156+
```sql
157+
select email, plan, sent_at
158+
from initech.identifies
159+
where email = '[email protected]'
160+
```
161+
162+
This SQL query returns a table of Bob's account information, with each entry representing the state of his account at different time periods:
163+
164+
| user_id | email | plan | sent_at |
165+
| ------- | -------------- | ------- | ------------------- |
166+
| bob123 | [email protected] | Premium | 2013-12-20 19:44:03 |
167+
| bob123 | [email protected] | Basic | 2013-12-18 17:48:10 |
168+
169+
If you want to see what your users looked like at a previous point in time, that data is right there in your `identifies` table! (To get this table for your users, replace ‘initech’ with your source slug).
170+
171+
But what if you only want to see the most recent state of the user? Luckily, we can convert the `identifies` table into a distinct users table by taking the most recent identify call for each account.
172+
173+
## Converting The identifies table into a users table
174+
175+
The following query will return your `identifies` table:
176+
177+
```sql
178+
select *
179+
from initech.identifies
180+
```
181+
That query returns a table like this:
182+
183+
| user_id | email | plan | sent_at |
184+
| ------- | -------------- | ------- | ------------------- |
185+
| bob123 | [email protected] | Premium | 2013-12-20 19:44:03 |
186+
| bob123 | [email protected] | Basic | 2013-12-18 17:48:10 |
187+
| jeff123 | [email protected] | Premium | 2013-12-20 19:44:03 |
188+
| jeff123 | [email protected] | Basic | 2013-12-18 17:48:10 |
189+
190+
if all you want is a table of distinct user with their current traits and without duplicates, you can do so with the following query:
191+
192+
```sql
193+
with identifies as (
194+
select user_id,
195+
email,
196+
plan,
197+
sent_at,
198+
row_number() over (partition by user_id order by sent_at desc) as rn
199+
from initech.identifies
200+
),
201+
users as (
202+
select user_id,
203+
email,
204+
plan
205+
from identifies
206+
where rn = 1
207+
)
208+
209+
select *
210+
from users
211+
```
212+
213+
## Counts of user traits
214+
Let's say you have an `identifies` table that looks like this:
215+
216+
| user_id | email | plan | sent_at |
217+
| ------- | -------------- | ------- | ------------------- |
218+
| bob123 | [email protected] | Premium | 2013-12-20 19:44:03 |
219+
| bob123 | [email protected] | Basic | 2013-12-18 17:48:10 |
220+
| jeff123 | [email protected] | Premium | 2013-12-20 19:44:03 |
221+
| jeff123 | [email protected] | Basic | 2013-12-18 17:48:10 |
222+
223+
If we want to query the traits of these users, we first need to [convert the identifies table into a users table](#converting-the-identifies-table-into-a-users-table) From there, we can run a query like this to get a count of users with each type of plan:
224+
225+
```sql
226+
with identifies as (
227+
select user_id,
228+
email,
229+
plan,
230+
sent_at,
231+
row_number() over (partition by user_id order by sent_at desc) as rn
232+
from initech.identifies
233+
),
234+
users as (
235+
select plan
236+
from identifies
237+
where rn = 1
238+
)
239+
240+
select sum(case when plan = 'Premium' then 1 else 0 end) as premium,
241+
sum(case when plan = 'Free' then 1 else 0 end) as free
242+
from users
243+
```
244+
245+
And there you go: a count of users with each type of plan!
246+
247+
| premium | free |
248+
| ------- | ---- |
249+
| 2 | 0 |
250+
251+
## Groups to accounts
252+
253+
### Historical Traits
254+
255+
The `group` method ties a user to a group. Be it a company, organization, account, source, team or whatever other crazy name you came up with for the same concept! It also lets you record custom traits about the group, like industry or number of employees.
256+
257+
Here’s what a basic `group` call looks like:
258+
259+
```javascript
260+
analytics.group'0e8c78ea9d97a7b8185e8632', {
261+
name: 'Initech',
262+
industry: 'Technology',
263+
employees: 329,
264+
plan: 'Premium'
265+
});
266+
```
267+
As these group traits change over time, you can continue calling the group method to update their changes.
268+
269+
```javascript
270+
analytics.group('0e8c78ea9d97a7b8185e8632', {
271+
name: 'Initech',
272+
industry: 'Technology',
273+
employees: 600,
274+
plan: 'Enterprise'
275+
});
276+
```
277+
278+
Each group call is stored as a distinct row in a single Redshift table called `groups`. To see how a group changes over time, you can run the following query:
279+
280+
```sql
281+
select name, industry, plan, employees, sent_at
282+
from initech.groups
283+
where name = 'Initech'
284+
```
285+
286+
The previous query will return a table of Initech's group information, with each entry representing the state of the account at different times.
287+
288+
| name | industry | employees | plan | sent_at |
289+
| ------- | ---------- | --------- | ------- | ------------------- |
290+
| Initech | Technology | 600 | Premium | 2021-12-20 19:44:03 |
291+
| Initech | Technology | 349 | Free | 2013-12-18 17:18:15 |
292+
293+
If you want to see a group’s traits at a previous point in time, that data is right there in your groups table! (To get this table for your groups, replace ‘initech’ with your source slug).
294+
295+
But what if you only want to see the most recent state of the group? You can convert the groups table into a distinct groups table by taking the most recent groups call for each account.
296+
297+
### Converting the Groups Table into an Organizations Table
298+
299+
The following query will return your groups table:
300+
301+
```sql
302+
select *
303+
from initech.groups
304+
```
305+
306+
The previous query returns the following table:
307+
308+
| name | industry | employees | plan | sent_at |
309+
| --------- | ------------- | --------- | ------- | ------------------- |
310+
| Initech | Technology | 600 | Premium | 2021-12-20 19:44:03 |
311+
| Initech | Technology | 349 | Free | 2013-12-18 17:18:15 |
312+
| Acme Corp | Entertainment | 15 | Premium | 2021-12-20 19:44:03 |
313+
| Acme Corp | Entertainment | 10 | Free | 2013-12-18 17:18:15 |
314+
315+
However, if all you want is a table of distinct groups and current traits, you can do so with the following query:
316+
317+
```sql
318+
with groups as (
319+
select name,
320+
industry,
321+
employees,
322+
plan,
323+
sent_at,
324+
row_number() over (partition by name order by sent_at desc) as rn
325+
from initech.groups
326+
),
327+
organizations as (
328+
select name,
329+
industry,
330+
employees,
331+
plan
332+
from groups
333+
where rn = 1
334+
)
335+
336+
select *
337+
from organizations
338+
```
339+
This query will retun a table with your distinct groups, without duplicates!
340+
341+
| name | industry | employees | plan | sent_at |
342+
| --------- | ------------- | --------- | ------- | ------------------- |
343+
| Initech | Technology | 600 | Premium | 2021-12-20 19:44:03 |
344+
| Acme Corp | Entertainment | 15 | Premium | 2021-12-20 19:44:03 |

0 commit comments

Comments
 (0)