You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/catalog/redshift/index.md
+42-98Lines changed: 42 additions & 98 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,54 +5,45 @@ redirect_from:
5
5
- '/connections/warehouses/catalog/redshift/'
6
6
---
7
7
8
-
This guide will explain how to provision a Redshift cluster and allow the Segment warehouse connector to write to it.
9
-
10
-
This document was last updated on 23rd April, 2018. If you notice any gaps, out-dated information or simply want to leave some feedback to help us improve our documentation, [let us know](https://segment.com/help/contact)!
8
+
This guide explains the process to provision a Redshift cluster and allow the Segment warehouse connector to write to it.
11
9
12
10
## Getting Started
13
11
14
-
There are four steps to get started using Redshift with Segment:
15
-
16
-
1. Pick the best instance for your needs
17
-
2. Provision a new Redshift Cluster
18
-
3. Create a database user
19
-
4. Connect Redshift to Segment
12
+
Complete the following steps to provision your Redshift cluster, and connect Segment to it:
20
13
21
-
### Pick the best instance for your needs
14
+
1.[Choose the best instance for your needs](#choose-the-best-instance-for-your-needs)
15
+
2.[Provision a new Redshift Cluster](#provision-a-new-redshift-cluster)
16
+
3.[Create a database user](#create-a-database-user)
17
+
4.[Connect Redshift to Segment](#connect-redshift-to-segment)
22
18
23
-
While the number of events (database records) are important, the storage capacity utilization of your cluster depends primarily on the number of unique tables and columns created in the cluster. Keep in mind that each unique `.track()` event creates a new table, and each property sent creates a new column in that table. For reason, we highly recommend starting with a detailed [tracking plan](/docs/protocols/tracking-plan/create/) before implementing Segment libraries to ensure that only necessary events are being passed to Segment in a consistent way.
19
+
## Choose the best instance for your needs
24
20
25
-
There are two kinds of Redshift clusters: **Dense Compute**and **Dense Storage**
21
+
While the number of events (database records) are important, the storage capacity usage of your cluster depends primarily on the number of unique tables and columns created in the cluster. Keep in mind that each unique `.track()` event creates a new table, and each property sent creates a new column in that table. To avoid storing unnecessary data, start with a detailed [tracking plan](/docs/protocols/tracking-plan/create/) before you install Segment libraries to ensure that only the necessary events are passed to Segment.
26
22
27
-
**Dense Compute** clusters are designed to maximize query speed and performance at the expense of storage capacity. This is done by using fast CPUs, large amounts of RAM and solid-state storage. While there are no hard and fast rules for sizing a cluster, we recommend that customers with fewer than 20 million monthly events start with a single DC1 node cluster and add nodes as needed. A single node cluster includes 200GB, with a max size of 2.56TB.
23
+
Redshift gives the option of three cluster types:
28
24
29
-
**Dense Storage** clusters are designed to maximize the amount of storage capacity for customers who have 100s of millions of events and prefer to save money on Redshift hosting costs. This is done by using slower CPUs, less RAM, and disk-based storage. A single DS2 node cluster includes 2TB of space, with a max size of 16TB.
25
+
-**Dense Compute**: These clusters are designed to maximize query speed and performance at the expense of storage capacity. This is done by using fast CPUs, large amounts of RAM and solid-state storage. While there are no hard and fast rules for sizing a cluster, customers with fewer than 20 million monthly events should start with a single DC1 node cluster and add nodes as needed. A single node cluster includes 200GB, with a max size of 2.56TB.
26
+
-**Dense Storage** These clusters are designed to maximize the amount of storage capacity for customers who have 100s of millions of events and prefer to save money on Redshift hosting costs. This is done by using slower CPUs, less RAM, and disk-based storage. A single DS2 node cluster includes 2TB of space, with a max size of 16TB.
27
+
-**RA3**: These clusters provide managed storage to help optimize your data warehouse splitting the cost of compute and storage.
30
28
31
-
### Provision a new Redshift Cluster
32
29
33
-
You can skip this step if you already have a Redshift cluster:
Follow the steps below to create a new Redshift cluster. If you have a cluster already provisioned, skip this step.
39
33
40
-
3. Fill out the cluster details (make sure to select a secure password!)
41
-

34
+
1. From the Redshift dashboard, click **Create Cluster**.
42
35
43
-
4. Choose your cluster size:
44
-

45
-
46
-
5. set up your cluster Security Group or VPC and proceed to review (see below for instructions on settings up a VPC group)
36
+
2. Name your new cluster, and select the type and size of the nodes within the cluster. 
47
37
38
+
3. Configure the database connection details and admin user for the cluster. 
48
39
49
40
Now that you've provisioned your Redshift cluster, you'll need to configure your Redshift cluster to allow Segment to access it.
50
41
51
-
###Create a Database User
42
+
## Create a Database User
52
43
53
-
The username and password you've already created for your cluster is your admin password, which you should keep for your own usage. For Segment, and any other 3rd-parties, it is best to create distinct users. This will allow you to isolate queries from one another using [WLM](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) and perform audits easier.
44
+
The username and password you've already created for your cluster is your admin password, which you should keep for your own use. For Segment, and any other 3rd-parties, it is best to create distinct users. This allows you to isolate queries from one another using [WLM](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) and perform audits easier.
54
45
55
-
To create a [new user](http://docs.aws.amazon.com/redshift/latest/dg/r_Users.html), you'll need to log into the Redshift database directly and run the following SQL commands:
46
+
To create a [new user](http://docs.aws.amazon.com/redshift/latest/dg/r_Users.html), log into the Redshift cluster directly (using the credentials you defined in Step 3 above) and run the following SQL commands:
56
47
57
48
```sql
58
49
-- create a user named "segment" that Segment will use when connecting to your Redshift cluster.
GRANT CREATE ON DATABASE "<enter database name here>" TO "segment";
63
54
```
64
55
65
-
When setting up your warehouse in Segment, use the username/password you've created here instead of your admin account.
56
+
When you configure your warehouse in Segment, use the username/password you've created here instead of your admin account.
66
57
67
-
###Connect Redshift to Segment
58
+
## Connect Redshift to Segment
68
59
69
60
After creating a Redshift warehouse, the next step is to connect Segment:
70
61
71
-
1. In the Segment App, select 'Add Destination'
72
-
2. Search for and select 'Redshift'
73
-
3.Select which sources and collections/properties will sync to this Warehouse
74
-
3. Enter your Redshift credentials
62
+
1. In the Segment App, navigate to the Connections tab and click **Add Destination**
63
+
2. Search for and select `Redshift`
64
+
3.Add the necessary connection details, add your Redshift credentials
65
+
4. Select which sources and collections/properties will sync to this Warehouse
75
66
76
67
## Security
77
68
VPCs keep servers inaccessible to traffic from the internet. With VPC, you're able to designate specific web servers access to your servers. In this case, you will be whitelisting the [Segment IPs](/docs/connections/storage/warehouses/faq#which-ips-should-i-whitelist) to write to your data warehouse.
78
69
79
-
## Best Practice
70
+
## Best practices
80
71
81
72
### Networking
82
73
83
-
Redshift clusters can either be in a **EC2 Classic subnet** or **VPC subnet**.
84
-
85
-
If your cluster has a field called `Cluster Security Groups`, proceed to [EC2 Classic](//docs/connections/storage/catalog/redshift/#ec2-classic)
86
-

87
-
88
-
Or if your cluster has a field called `VPC Security Groups`, proceed to [EC2 VPC](/docs/connections/storage/catalog/redshift/#ec2-vpc)
89
-

90
-
91
-
#### EC2-Classic
92
-
93
-
1. Navigate to your Redshift Cluster settings: `Redshift Dashboard > Clusters > Select Your Cluster`
94
-
95
-
2. Click on the Cluster Security Groups
96
-
97
-

98
-
99
-
3. Open the Cluster Security Group
100
-
101
-

102
-
103
-
4. Click on "Add Connection Type"
104
-
105
-

74
+
Redshift clusters are created in a VPC subnet. To configure:
106
75
107
-
5. Choose Connection Type CIDR/IP and authorize Segment to write into your Redshift Port using `52.25.130.38/32`
76
+
1. Navigate to your to the Redshift cluster you created previously. Click **Edit**.
108
77
109
-

78
+
2. Expand the *Network and security* section. Click *Open tab* to access the Network and security settings. 
110
79
111
-
#### EC2-VPC
80
+
3. Click the VPC security group to access its settings. The Security group opens in a new tab. 
112
81
113
-
1. Navigate to your `Redshift Dashboard > Clusters > Select Your Cluster`
82
+
4. Click the Security group in the list to access its settings.
114
83
115
-
2. Click on the VPC Security Groups
84
+
5. On the Inbound tab, add or edit a rule to enable Segment to write to your Redshift port from `52.25.130.38/32`. 
116
85
117
-

86
+
6. On the Outbound tab, ensure Redshift can make outbound requests to the Segment S3 bucket. The default behavior is to allow all outbound traffic, but security groups can limit outbound behavior. 
118
87
119
-
3. Select the "Inbound" tab and then "Edit"
88
+
6. Navigate back to the cluster's settings, and click **Edit publicly accessible** to allow access to the cluster from outside of the VPC. 
120
89
121
-

90
+
### Electing to encrypt data
122
91
123
-
4. Allow Segment to write into your Redshift Port using `52.25.130.38/32`
124
-
125
-

126
-
127
-
You can find more information on that [here](http://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-vpc.html).
128
-
129
-
5. Navigate back to your Redshift Cluster Settings: `Redshift Dashboard > Clusters > Select Your Cluster`
130
-
131
-
6. Select the "Cluster" button and then "Modify"
132
-

133
-
134
-
7. Make sure the "Publicly Accessible" option is set to "Yes"
135
-

136
-
137
-
8. Check your "Outbound" tab to make sure your Redshift instance is set up to make outbound requests to the Segment S3 bucket. The default behavior is to allow all outbound traffic, but security groups can be put in place to limit outbound behavior.
138
-
139
-

140
-
141
-
9. If your outbound traffic is not configured to allow all traffic, you can switch to default settings or specifically whitelist the Segment S3 buckets
142
-
143
-

144
-
145
-
### Electing to encrypt your data
146
-
147
-
You can elect to encrypt your data in your Redshift console and it will not affect Segment's ability to read or write.
92
+
You can encrypt data in the Redshift console. Encryption does not affect Segment's ability to read or write.
148
93
149
94
150
95
### Distribution Key
@@ -153,26 +98,25 @@ The `id` column is the common distribution key used across all tables. When you
153
98
154
99
### Reserved Words
155
100
156
-
Redshift limits the use of [reserved words](http://docs.aws.amazon.com/redshift/latest/dg/r_pg_keywords.html) in schema, table, and column names. Additionally, you should avoid naming traits or properties that conflict with top level Segment fields (e.g. userId, receivedAt, messageId, etc.). These traits and properties that conflict with Redshift or Segment fields will be `_`-prefixed when we create columns for them in your schema, but keeping track of which is which (Segment-reserved vs. custom property columns) can be tricky!
157
-
158
-
Redshift limits the use of integers at the start of a schema or table name. We will automatically prepend a `_` to any schema, table or column name that starts with an integer. So a source named '3doctors' will be loaded into a Redshift schema named `_3doctors`.
101
+
Redshift limits the use of [reserved words](http://docs.aws.amazon.com/redshift/latest/dg/r_pg_keywords.html) in schema, table, and column names. Additionally, avoid naming traits or properties that conflict with top level Segment fields (for example, `userId`, `receivedAt`, or `messageId`). These traits and properties that conflict with Redshift or Segment fields are `_`-prefixed when Segment creates columns for them in your schema.
159
102
103
+
Redshift limits the use of integers at the start of a schema or table name. Segment prepends an underscore `_` to any schema, table or column name that starts with an integer. A source named `3doctors` is loaded into a Redshift schema named `_3doctors`.
160
104
161
105
### CPU
162
106
163
-
In an usual workload we have seen Redshift using around 20-40% of CPU, we take advantage of the COPY command to ensure to make full use of your cluster to load your data as fast as we can.
107
+
In a usual workload Redshift around 20-40% of CPU. Segment takes advantage of the COPY command to make full use of your cluster to load your data as efficiently as possible.
164
108
165
109
## Troubleshooting
166
110
167
111
### How do I improve Query Speed?
168
112
169
-
The speed of your queries depends on the capabilities of the hardware you have chosen as well as the size of the dataset. The amount of data utilization in the cluster will also impact query speed. For Redshift clusters if you're above 75% utilization, you will likely experience degradation in query speed. [Here's a guide on how to improve your query speeds.](/docs/connections/storage/warehouses/redshift-tuning/)
113
+
The speed of your queries depends on the capabilities of the hardware you have chosen as well as the size of the dataset. The amount of data use in the cluster will also impact query speed. For Redshift clusters, if you're above 75% capacity, you will likely experience degradation in query speed. [Here's a guide on how to improve your query speeds.](/docs/connections/storage/warehouses/redshift-tuning/)
170
114
171
115
## FAQ
172
116
173
117
### How do I sync data in and out between Redshift and Segment?
174
118
175
-
It's often the case that our customers want to combine 1stparty transactional and operational data their Segment data to generate a 360 degree view of the customer. The challenge is that those data sets are often stored in separate data warehouses.
119
+
It's often the case that customers want to combine 1st-party transactional and operational data with Segment data to generate a full view of the customer. The challenge is that those data sets are often stored in separate data warehouses.
176
120
177
121
If you're interested in importing data into a Redshift cluster, it's important that you follow these [guidelines](/docs/connections/storage/warehouses/faq/).
0 commit comments