Skip to content

Commit fba1080

Browse files
author
markzegarelli
authored
Merge pull request #1512 from segmentio/DOC-151_Redshift-update
DOC 151 Redshift update
2 parents 29bd300 + f510138 commit fba1080

File tree

10 files changed

+42
-98
lines changed

10 files changed

+42
-98
lines changed
131 KB
Loading
117 KB
Loading
106 KB
Loading
114 KB
Loading
66.1 KB
Loading
29.3 KB
Loading
61.4 KB
Loading
66.6 KB
Loading

src/connections/storage/catalog/redshift/index.md

Lines changed: 42 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -5,54 +5,45 @@ redirect_from:
55
- '/connections/warehouses/catalog/redshift/'
66
---
77

8-
This guide will explain how to provision a Redshift cluster and allow the Segment warehouse connector to write to it.
9-
10-
This document was last updated on 23rd April, 2018. If you notice any gaps, out-dated information or simply want to leave some feedback to help us improve our documentation, [let us know](https://segment.com/help/contact)!
8+
This guide explains the process to provision a Redshift cluster and allow the Segment warehouse connector to write to it.
119

1210
## Getting Started
1311

14-
There are four steps to get started using Redshift with Segment:
15-
16-
1. Pick the best instance for your needs
17-
2. Provision a new Redshift Cluster
18-
3. Create a database user
19-
4. Connect Redshift to Segment
12+
Complete the following steps to provision your Redshift cluster, and connect Segment to it:
2013

21-
### Pick the best instance for your needs
14+
1. [Choose the best instance for your needs](#choose-the-best-instance-for-your-needs)
15+
2. [Provision a new Redshift Cluster](#provision-a-new-redshift-cluster)
16+
3. [Create a database user](#create-a-database-user)
17+
4. [Connect Redshift to Segment](#connect-redshift-to-segment)
2218

23-
While the number of events (database records) are important, the storage capacity utilization of your cluster depends primarily on the number of unique tables and columns created in the cluster. Keep in mind that each unique `.track()` event creates a new table, and each property sent creates a new column in that table. For reason, we highly recommend starting with a detailed [tracking plan](/docs/protocols/tracking-plan/create/) before implementing Segment libraries to ensure that only necessary events are being passed to Segment in a consistent way.
19+
## Choose the best instance for your needs
2420

25-
There are two kinds of Redshift clusters: **Dense Compute** and **Dense Storage**
21+
While the number of events (database records) are important, the storage capacity usage of your cluster depends primarily on the number of unique tables and columns created in the cluster. Keep in mind that each unique `.track()` event creates a new table, and each property sent creates a new column in that table. To avoid storing unnecessary data, start with a detailed [tracking plan](/docs/protocols/tracking-plan/create/) before you install Segment libraries to ensure that only the necessary events are passed to Segment.
2622

27-
**Dense Compute** clusters are designed to maximize query speed and performance at the expense of storage capacity. This is done by using fast CPUs, large amounts of RAM and solid-state storage. While there are no hard and fast rules for sizing a cluster, we recommend that customers with fewer than 20 million monthly events start with a single DC1 node cluster and add nodes as needed. A single node cluster includes 200GB, with a max size of 2.56TB.
23+
Redshift gives the option of three cluster types:
2824

29-
**Dense Storage** clusters are designed to maximize the amount of storage capacity for customers who have 100s of millions of events and prefer to save money on Redshift hosting costs. This is done by using slower CPUs, less RAM, and disk-based storage. A single DS2 node cluster includes 2TB of space, with a max size of 16TB.
25+
- **Dense Compute**: These clusters are designed to maximize query speed and performance at the expense of storage capacity. This is done by using fast CPUs, large amounts of RAM and solid-state storage. While there are no hard and fast rules for sizing a cluster, customers with fewer than 20 million monthly events should start with a single DC1 node cluster and add nodes as needed. A single node cluster includes 200GB, with a max size of 2.56TB.
26+
- **Dense Storage** These clusters are designed to maximize the amount of storage capacity for customers who have 100s of millions of events and prefer to save money on Redshift hosting costs. This is done by using slower CPUs, less RAM, and disk-based storage. A single DS2 node cluster includes 2TB of space, with a max size of 16TB.
27+
- **RA3**: These clusters provide managed storage to help optimize your data warehouse splitting the cost of compute and storage.
3028

31-
### Provision a new Redshift Cluster
3229

33-
You can skip this step if you already have a Redshift cluster:
34-
1. Open the Redshift Console
35-
![](images/Screen+Shot+2015-09-17+at+10.25.47+AM.png)
30+
## Provision a new Redshift Cluster
3631

37-
2. Click on "Launch Cluster"
38-
![](images/Screen+Shot+2015-09-17+at+10.26.03+AM.png)
32+
Follow the steps below to create a new Redshift cluster. If you have a cluster already provisioned, skip this step.
3933

40-
3. Fill out the cluster details (make sure to select a secure password!)
41-
![Image](images/cVcF5ZtC51a+.png)
34+
1. From the Redshift dashboard, click **Create Cluster**.
4235

43-
4. Choose your cluster size:
44-
![](images/1442616281635_undefined.png)
45-
46-
5. set up your cluster Security Group or VPC and proceed to review (see below for instructions on settings up a VPC group)
36+
2. Name your new cluster, and select the type and size of the nodes within the cluster. ![create the cluster](images/redshift01.png)
4737

38+
3. Configure the database connection details and admin user for the cluster. ![db user](images/redshift02.png)
4839

4940
Now that you've provisioned your Redshift cluster, you'll need to configure your Redshift cluster to allow Segment to access it.
5041

51-
### Create a Database User
42+
## Create a Database User
5243

53-
The username and password you've already created for your cluster is your admin password, which you should keep for your own usage. For Segment, and any other 3rd-parties, it is best to create distinct users. This will allow you to isolate queries from one another using [WLM](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) and perform audits easier.
44+
The username and password you've already created for your cluster is your admin password, which you should keep for your own use. For Segment, and any other 3rd-parties, it is best to create distinct users. This allows you to isolate queries from one another using [WLM](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) and perform audits easier.
5445

55-
To create a [new user](http://docs.aws.amazon.com/redshift/latest/dg/r_Users.html), you'll need to log into the Redshift database directly and run the following SQL commands:
46+
To create a [new user](http://docs.aws.amazon.com/redshift/latest/dg/r_Users.html), log into the Redshift cluster directly (using the credentials you defined in Step 3 above) and run the following SQL commands:
5647

5748
```sql
5849
-- create a user named "segment" that Segment will use when connecting to your Redshift cluster.
@@ -62,89 +53,43 @@ CREATE USER segment PASSWORD '<enter password here>';
6253
GRANT CREATE ON DATABASE "<enter database name here>" TO "segment";
6354
```
6455

65-
When setting up your warehouse in Segment, use the username/password you've created here instead of your admin account.
56+
When you configure your warehouse in Segment, use the username/password you've created here instead of your admin account.
6657

67-
### Connect Redshift to Segment
58+
## Connect Redshift to Segment
6859

6960
After creating a Redshift warehouse, the next step is to connect Segment:
7061

71-
1. In the Segment App, select 'Add Destination'
72-
2. Search for and select 'Redshift'
73-
3. Select which sources and collections/properties will sync to this Warehouse
74-
3. Enter your Redshift credentials
62+
1. In the Segment App, navigate to the Connections tab and click **Add Destination**
63+
2. Search for and select `Redshift`
64+
3. Add the necessary connection details, add your Redshift credentials
65+
4. Select which sources and collections/properties will sync to this Warehouse
7566

7667
## Security
7768
VPCs keep servers inaccessible to traffic from the internet. With VPC, you're able to designate specific web servers access to your servers. In this case, you will be whitelisting the [Segment IPs](/docs/connections/storage/warehouses/faq#which-ips-should-i-whitelist) to write to your data warehouse.
7869

79-
## Best Practice
70+
## Best practices
8071

8172
### Networking
8273

83-
Redshift clusters can either be in a **EC2 Classic subnet** or **VPC subnet**.
84-
85-
If your cluster has a field called `Cluster Security Groups`, proceed to [EC2 Classic](//docs/connections/storage/catalog/redshift/#ec2-classic)
86-
![](images/redshift_permissioning1.png)
87-
88-
Or if your cluster has a field called `VPC Security Groups`, proceed to [EC2 VPC](/docs/connections/storage/catalog/redshift/#ec2-vpc)
89-
![](images/redshift_permissioning2.png)
90-
91-
#### EC2-Classic
92-
93-
1. Navigate to your Redshift Cluster settings: `Redshift Dashboard > Clusters > Select Your Cluster`
94-
95-
2. Click on the Cluster Security Groups
96-
97-
![](images/redshift_permissioning4.png)
98-
99-
3. Open the Cluster Security Group
100-
101-
![](images/redshift_permissioning5.png)
102-
103-
4. Click on "Add Connection Type"
104-
105-
![](images/redshift_permissioning6.png)
74+
Redshift clusters are created in a VPC subnet. To configure:
10675

107-
5. Choose Connection Type CIDR/IP and authorize Segment to write into your Redshift Port using `52.25.130.38/32`
76+
1. Navigate to your to the Redshift cluster you created previously. Click **Edit**.
10877

109-
![](images/redshift_permissioning7.png)
78+
2. Expand the *Network and security* section. Click *Open tab* to access the Network and security settings. ![security](images/redshift03.png)
11079

111-
#### EC2-VPC
80+
3. Click the VPC security group to access its settings. The Security group opens in a new tab. ![group](images/redshift04.png)
11281

113-
1. Navigate to your `Redshift Dashboard > Clusters > Select Your Cluster`
82+
4. Click the Security group in the list to access its settings.
11483

115-
2. Click on the VPC Security Groups
84+
5. On the Inbound tab, add or edit a rule to enable Segment to write to your Redshift port from `52.25.130.38/32`. ![inbound](images/redshift05.png)
11685

117-
![](images/redshift_permissioning8.png)
86+
6. On the Outbound tab, ensure Redshift can make outbound requests to the Segment S3 bucket. The default behavior is to allow all outbound traffic, but security groups can limit outbound behavior. ![outbound](images/redshift06.png)
11887

119-
3. Select the "Inbound" tab and then "Edit"
88+
6. Navigate back to the cluster's settings, and click **Edit publicly accessible** to allow access to the cluster from outside of the VPC. ![public](images/redshift07.png)
12089

121-
![](images/redshift_permissioning9.png)
90+
### Electing to encrypt data 
12291

123-
4. Allow Segment to write into your Redshift Port using `52.25.130.38/32`
124-
125-
![](images/redshift_permissioning10.png)
126-
127-
You can find more information on that [here](http://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-vpc.html)
128-
129-
5. Navigate back to your Redshift Cluster Settings: `Redshift Dashboard > Clusters > Select Your Cluster`
130-
131-
6. Select the "Cluster" button and then "Modify"
132-
![](images/redshift_cluster_modify.png)
133-
134-
7. Make sure the "Publicly Accessible" option is set to "Yes"
135-
![](images/rs-mgmt-clusters-modify.png)
136-
137-
8. Check your "Outbound" tab to make sure your Redshift instance is set up to make outbound requests to the Segment S3 bucket. The default behavior is to allow all outbound traffic, but security groups can be put in place to limit outbound behavior.
138-
139-
![](images/redshift_outbound_permissions.png)
140-
141-
9. If your outbound traffic is not configured to allow all traffic, you can switch to default settings or specifically whitelist the Segment S3 buckets
142-
143-
![](images/redshift_custom_outbound_group.png)
144-
145-
### Electing to encrypt your data 
146-
147-
You can elect to encrypt your data in your Redshift console and it will not affect Segment's ability to read or write.
92+
You can encrypt data in the Redshift console. Encryption does not affect Segment's ability to read or write.
14893

14994

15095
### Distribution Key
@@ -153,26 +98,25 @@ The `id` column is the common distribution key used across all tables. When you
15398

15499
### Reserved Words
155100

156-
Redshift limits the use of [reserved words](http://docs.aws.amazon.com/redshift/latest/dg/r_pg_keywords.html) in schema, table, and column names. Additionally, you should avoid naming traits or properties that conflict with top level Segment fields (e.g. userId, receivedAt, messageId, etc.). These traits and properties that conflict with Redshift or Segment fields will be `_`-prefixed when we create columns for them in your schema, but keeping track of which is which (Segment-reserved vs. custom property columns) can be tricky!
157-
158-
Redshift limits the use of integers at the start of a schema or table name. We will automatically prepend a `_` to any schema, table or column name that starts with an integer. So a source named '3doctors' will be loaded into a Redshift schema named `_3doctors`.
101+
Redshift limits the use of [reserved words](http://docs.aws.amazon.com/redshift/latest/dg/r_pg_keywords.html) in schema, table, and column names. Additionally, avoid naming traits or properties that conflict with top level Segment fields (for example, `userId`, `receivedAt`, or `messageId`). These traits and properties that conflict with Redshift or Segment fields are `_`-prefixed when Segment creates columns for them in your schema.
159102

103+
Redshift limits the use of integers at the start of a schema or table name. Segment prepends an underscore `_` to any schema, table or column name that starts with an integer. A source named `3doctors` is loaded into a Redshift schema named `_3doctors`.
160104

161105
### CPU
162106

163-
In an usual workload we have seen Redshift using around 20-40% of CPU, we take advantage of the COPY command to ensure to make full use of your cluster to load your data as fast as we can.
107+
In a usual workload Redshift around 20-40% of CPU. Segment takes advantage of the COPY command to make full use of your cluster to load your data as efficiently as possible.
164108

165109
## Troubleshooting
166110

167111
### How do I improve Query Speed?
168112

169-
The speed of your queries depends on the capabilities of the hardware you have chosen as well as the size of the dataset. The amount of data utilization in the cluster will also impact query speed. For Redshift clusters if you're above 75% utilization, you will likely experience degradation in query speed. [Here's a guide on how to improve your query speeds.](/docs/connections/storage/warehouses/redshift-tuning/)
113+
The speed of your queries depends on the capabilities of the hardware you have chosen as well as the size of the dataset. The amount of data use in the cluster will also impact query speed. For Redshift clusters, if you're above 75% capacity, you will likely experience degradation in query speed. [Here's a guide on how to improve your query speeds.](/docs/connections/storage/warehouses/redshift-tuning/)
170114

171115
## FAQ
172116

173117
### How do I sync data in and out between Redshift and Segment?
174118

175-
It's often the case that our customers want to combine 1st party transactional and operational data their Segment data to generate a 360 degree view of the customer. The challenge is that those data sets are often stored in separate data warehouses.
119+
It's often the case that customers want to combine 1st-party transactional and operational data with Segment data to generate a full view of the customer. The challenge is that those data sets are often stored in separate data warehouses.
176120

177121
If you're interested in importing data into a Redshift cluster, it's important that you follow these [guidelines](/docs/connections/storage/warehouses/faq/).
178122

0 commit comments

Comments
 (0)