Skip to content

Commit 669c874

Browse files
authored
Merge pull request #4 from gavinjwl/dev/1.0.0
v1.0.0
2 parents 35cca8e + d61ce73 commit 669c874

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1927
-697
lines changed

.gitignore

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,3 +136,31 @@ dmypy.json
136136

137137
# AWS CDK
138138
cdk.out
139+
140+
# General
141+
.DS_Store
142+
.AppleDouble
143+
.LSOverride
144+
145+
# Icon must end with two \r
146+
Icon
147+
148+
149+
# Thumbnails
150+
._*
151+
152+
# Files that might appear in the root of a volume
153+
.DocumentRevisions-V100
154+
.fseventsd
155+
.Spotlight-V100
156+
.TemporaryItems
157+
.Trashes
158+
.VolumeIcon.icns
159+
.com.apple.timemachine.donotpresent
160+
161+
# Directories potentially created on remote AFP share
162+
.AppleDB
163+
.AppleDesktop
164+
Network Trash Folder
165+
Temporary Items
166+
.apdisk

README.en.md

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Clickstream on AWS
2+
3+
[繁體中文版說明](./README.zh-tw.md)
4+
5+
## Getting started
6+
7+
We recommand that using [Cloud9 environment](https://aws.amazon.com/cloud9/) to deploy, or you must ensure you had installed following requirements in local before starting
8+
9+
- [AWS CDK](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install) for constructing AWS environment
10+
- [Poetry](https://python-poetry.org/docs/#installation) for Pythen dependency management
11+
- [Docker](https://docs.docker.com/engine/install/) runtime.
12+
13+
## Deploying your AWS environment
14+
15+
### Clone the repo
16+
17+
```bash
18+
git clone https://github.com/gavinjwl/clickstream-on-aws
19+
20+
cd clickstream-on-aws
21+
```
22+
23+
### Activate Python virtual environment
24+
25+
```bash
26+
poetry install
27+
28+
source .venv/bin/activate
29+
```
30+
31+
### Deploy CDK stacks
32+
33+
Deploy all stacks
34+
35+
```bash
36+
cdk deploy --all \
37+
--parameters CoreStack:WriteKey='<define-your-write-key>' \
38+
--parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
39+
--parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'
40+
```
41+
42+
Or, deploy ONLY CoreStack by
43+
44+
```bash
45+
cdk deploy CoreStack \
46+
--parameters CoreStack:WriteKey='<define-your-write-key>' \
47+
--parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
48+
--parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'
49+
```
50+
51+
Or, deploy ONLY CoreStack with Dashboard by
52+
53+
```bash
54+
cdk deploy CoreStack Dashboard \
55+
--parameters CoreStack:WriteKey='<define-your-write-key>' \
56+
--parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
57+
--parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'
58+
```
59+
60+
Or, deploy ONLY CoreStack with Scheduled Refresh feature by
61+
62+
```bash
63+
cdk deploy CoreStack ScheduledRefreshStack \
64+
--parameters CoreStack:WriteKey='<define-your-write-key>' \
65+
--parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
66+
--parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'
67+
```
68+
69+
### Change Redshift Serverless Namespace password
70+
71+
After CDK deployments complete, we need to change Redshift Serverless Namespace password, so that we can connect.
72+
73+
![redshift-change-namespace-password](images/redshift-change-namespace-password.png)
74+
75+
### Connect to Redshift Serverless Namespace
76+
77+
Use Username-Password to Connect to Redshift Serverless Namespace. Following show how to connect with QueryEditorV2
78+
79+
![redshift-connect-with-password](images/redshift-connect-with-password.png)
80+
81+
### Enable Redshift Streaming Ingestion
82+
83+
Create an external schema for Kinesis Stream
84+
85+
```sql
86+
-- Create external schema for kinesis
87+
CREATE EXTERNAL SCHEMA IF NOT EXISTS kinesis FROM KINESIS IAM_ROLE default;
88+
```
89+
90+
Create clickstream schema
91+
92+
```sql
93+
-- Create schema for clickstream
94+
CREATE SCHEMA IF NOT EXISTS clickstream;
95+
```
96+
97+
Create user and grant permissions
98+
99+
```sql
100+
-- Create clickstream user and grant required permissions
101+
-- Please do not change `IAMR:ClickstreamRedshiftRole`
102+
CREATE USER "IAMR:ClickstreamRedshiftRole" PASSWORD DISABLE;
103+
104+
GRANT ALL ON SCHEMA kinesis TO "IAMR:ClickstreamRedshiftRole";
105+
106+
GRANT ALL ON SCHEMA clickstream TO "IAMR:ClickstreamRedshiftRole";
107+
GRANT ALL ON ALL TABLES IN SCHEMA clickstream TO "IAMR:ClickstreamRedshiftRole";
108+
```
109+
110+
Create a materialized view to consume the stream data
111+
112+
```sql
113+
SET enable_case_sensitive_identifier TO true;
114+
CREATE MATERIALIZED VIEW clickstream.mv_kinesisSource
115+
AS
116+
SELECT
117+
ApproximateArrivalTimestamp AS approximateArrivalTimestamp,
118+
PartitionKey AS partitionKey,
119+
ShardId AS shardId,
120+
SequenceNumber AS sequenceNumber,
121+
-- JSON_PARSE(from_varbyte(Data, 'utf-8')) as data,
122+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'messageId')::VARCHAR AS messageId,
123+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'timestamp')::VARCHAR AS event_timestamp,
124+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'type')::VARCHAR AS type,
125+
-- Common
126+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'userId')::VARCHAR AS userId,
127+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'anonymousId')::VARCHAR AS anonymousId,
128+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'context')::SUPER AS context,
129+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'integrations')::SUPER AS integrations,
130+
131+
-- Identify
132+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'traits')::SUPER AS traits,
133+
134+
-- Track
135+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'event')::VARCHAR AS event,
136+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'properties')::SUPER AS properties,
137+
138+
-- Alias
139+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'previousId')::VARCHAR AS previousId,
140+
141+
-- Group
142+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'groupId')::VARCHAR AS groupId,
143+
144+
-- Page
145+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'category')::VARCHAR AS category,
146+
json_extract_path_text(from_varbyte(data, 'utf-8'), 'name')::VARCHAR AS name
147+
FROM kinesis."ClickstreamKinesisStream"
148+
WHERE is_utf8(Data) AND is_valid_json(from_varbyte(Data, 'utf-8'));
149+
```
150+
151+
Change materialized view owner to `IAMR:ClickstreamRedshiftRole` so that ScheduledRefreshStack can work.
152+
153+
```sql
154+
SET enable_case_sensitive_identifier TO true;
155+
ALTER TABLE clickstream.mv_kinesisSource OWNER TO "IAMR:ClickstreamRedshiftRole";
156+
```
157+
158+
## Simulate clickstream
159+
160+
- The easiest way to simulate is doing follow command, [for more detail](simulator.py)
161+
162+
```bash
163+
# Enable your python venv, if not
164+
source .venv/bin/activate
165+
166+
# Execute simulator
167+
python3 simulator.py --host <API Gateway URL> --writeKey <Your Write Key>
168+
```
169+
170+
- If you want to simulate more users, you can leverage [Locust](https://docs.locust.io/en/stable/).
171+
172+
**Note**
173+
You need to change `HOST = '<API Gateway URL>'` and `WRITE_KEY = '<Your Write Key>'` in [main.py](./benchmark/main.py) first.
174+
175+
```bash
176+
# Enable your python venv, if not
177+
source .venv/bin/activate
178+
179+
# Start locust
180+
locust -f benchmark/main.py \
181+
--web-port 8089
182+
183+
# Open your browser and input <API Gateway URL> and how many users you want.
184+
```
185+
186+
## Explore clickstream data
187+
188+
Open [Redshift Query Editor V2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html)
189+
190+
```sql
191+
SET enable_case_sensitive_identifier TO true;
192+
193+
SELECT *
194+
FROM clickstream.mv_kinesisSource
195+
LIMIT 10
196+
;
197+
```
198+
199+
## Install Tracking Code
200+
201+
**Note**
202+
You need to change `HOST` to your API Gateway url and `WRITE_KEY` to the value you defined in CDK deployment in any SDK.
203+
204+
### Client Side based
205+
206+
- Using [Google Tag Manager](https://segment.com/catalog/integrations/google-tag-manager/)
207+
- [Pure Javascript](https://segment.com/docs/connections/sources/catalog/libraries/website/javascript/)
208+
- [Android](https://segment.com/docs/connections/sources/catalog/libraries/mobile/android/)
209+
- [iOS](https://segment.com/docs/connections/sources/catalog/libraries/mobile/ios/)
210+
211+
[Full List](https://segment.com/docs/connections/sources/catalog/#website)
212+
213+
### Server Side based
214+
215+
- [Java](https://segment.com/docs/connections/sources/catalog/libraries/server/java/)
216+
- [.Net](https://segment.com/docs/connections/sources/catalog/libraries/server/net/)
217+
- [PHP](https://segment.com/docs/connections/sources/catalog/libraries/server/php/)
218+
- [Python](https://segment.com/docs/connections/sources/catalog/libraries/server/python/)
219+
220+
[Full List](https://segment.com/docs/connections/sources/catalog/#server)

0 commit comments

Comments
 (0)