You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In most cases, you will get a much better price-to-performance ratio with Redshift for typical analyses.
9
9
10
-
Redshift lacks some [features](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html), [datatypes](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-datatypes.html), and [functions](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html) supported by Postgres and also implements [some features](http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-sql-implementated-differently.html) differently. If you absolutely need any of the features or functions missing in Redshift and BigQuery, choose Postgres. If not (or you're not sure), we recommend Redshift.
10
+
Redshift lacks some [features](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html), [datatypes](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-datatypes.html), and [functions](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html) supported by Postgres and also implements [some features](http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-sql-implementated-differently.html) differently. If you need any of the features or functions missing in Redshift and BigQuery, choose Postgres. If not (or you're not sure), Segment recommends choosing Redshift.
11
11
12
12
If you'd like more information, Amazon wrote [about this in their documentation](http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html).
13
13
14
14
## Comparing Redshift and BigQuery
15
15
16
-
Both Redshift and BigQuery are attractive cloud-hosted, relatively cheap, and performant analytical databases. The differences between the two are largely around their architecture and pricing.
16
+
Both Redshift and BigQuery are attractive cloud-hosted, affordable, and performant analytical databases. The differences between the two are around their architecture and pricing.
17
17
18
18
## Architecture
19
19
20
-
When you provision a Redshift cluster, you're renting a server from Amazon Web Services. Your cluster is comprised of [nodes](http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html), each with dedicated memory, CPU, and disk storage. These nodes handle data storage, query execution, and - if your cluster contains multiple nodes - a leader node will handle coordination across the cluster.
20
+
When you provision a Redshift cluster, you're renting a server from Amazon Web Services. Your cluster comprises of [nodes](http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html), each with dedicated memory, CPU, and disk storage. These nodes handle data storage, query execution, and - if your cluster contains multiple nodes - a leader node will handle coordination across the cluster.
21
21
22
22
Redshift performance and storage capacity is a function of cluster size and cluster type. As your storage or performance requirements change, you can scale up or down your cluster as needed.
23
23
24
24
With BigQuery, you're not constrained by the storage capacity or compute resources of a given cluster. Instead, you can load large amounts of data into BigQuery without running out of memory, and execute complex queries without maxing out CPU.
25
25
26
-
This is possible because BigQuery takes advantage of distributed storage and networking to separate data storage from compute power. Data is distributed across many servers in the Google cloud using their [Colossus distributed file system](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood). When you execute a query, the [Dremel query engine](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood) splits the query into smaller sub-tasks, distributes the sub-tasks to many computers across Google data centers, and then re-assembles them into your results.
26
+
This is possible because BigQuery takes advantage of distributed storage and networking to separate data storage from compute power. Data distributes across many servers in the Google cloud using their [Colossus distributed file system](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood). When you execute a query, the [Dremel query engine](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood) splits the query into smaller sub-tasks, distributes the sub-tasks to computers across Google data centers, and then re-assembles them into your results.
27
27
28
28
## Pricing
29
29
30
30
The difference in architecture translates into differences in pricing.
31
31
32
-
[Redshift prices](https://aws.amazon.com/redshift/pricing/) based on an hourly rate determined by the number and types of nodes in your cluster. They offer dense storage - optimized for storage - and dense compute nodes - optimized for query performance.
32
+
[Redshift prices](https://aws.amazon.com/redshift/pricing/)are based on an hourly rate determined by the number and types of nodes in your cluster. They offer dense storage - optimized for storage - and dense compute nodes - optimized for query performance.
33
33
34
-
BigQuery has two [pricing options](https://cloud.google.com/bigquery/pricing): variable and fixed pricing. With the variable, pay-as-you-go plan, you pay for the data you load into BigQuery, and then pay only for the amount of data you query. BigQuery allows you to set up [Cost Controls and Alerts](https://cloud.google.com/bigquery/cost-controls) to help control and monitor costs.
34
+
BigQuery has two [pricing options](https://cloud.google.com/bigquery/pricing): variable and fixed pricing. With the variable, pay-as-you-go plan, you pay for the data you load into BigQuery, and then pay for the amount of data you query. BigQuery allows you to set up [Cost Controls and Alerts](https://cloud.google.com/bigquery/cost-controls) to help control and monitor costs.
35
35
36
-
Fixed-price plans are geared toward high-volume customers and allow you to rent a fixed amount of compute power.
36
+
Fixed-price plans are more for high-volume customers and allow you to rent a fixed amount of compute power.
37
37
38
38
## Resource Management
39
39
40
-
Redshift does require you to create a cluster, schedule vacuums, choose sort and distribution keys, and resize your cluster as storage and performance needs change over time.
40
+
Redshift does require you to create a cluster, choose sort and distribution keys, and resize your cluster as storage and performance needs change over time.
41
41
42
-
BigQuery is "fully-managed", which means that you'll never have to resize, vacuum, or adjust distribution or sort keys. All of that is handled by BigQuery.
42
+
BigQuery is "fully-managed", which means that you'll never have to resizeor adjust distribution or sort keys. BigQuery handles all of that.
Waiting minutes and minutes, maybe even an hour, for your queries to compute is an unfortunate reality for many growing companies. Whether your data has grown faster than your cluster, or you're running too many jobs in parallel, there are lots of reasons your queries might be slowing down.
6
+
Waiting minutes and minutes, maybe even an hour, for your queries to compute is an unfortunate reality for growing companies. Whether your data has grown faster than your cluster, or you're running too many jobs in parallel, there are lots of reasons your queries might be slowing down.
7
7
8
8
To help you improve your query performance, this guide takes you through common issues and how to mitigate them.
9
9
@@ -13,19 +13,19 @@ To help you improve your query performance, this guide takes you through common
13
13
14
14
As your data volume grows and your team writes more queries, you might be running out of space in your cluster.
15
15
16
-
To check if you're getting close to your max, run this query. It will tell you the percentage of storage used in your cluster. We recommend never exceeding 75-80% of your storage capacity. If you're nearing capacity, consider adding some more nodes.
16
+
To check if you're getting close to your max, run this query. It will tell you the percentage of storage used in your cluster. Segment recommends never exceeding 75-80% of your storage capacity. If you're nearing capacity, consider adding some more nodes.
17
17
18
18

19
19
20
-
[Learn how to resize your cluster here.](http://docs.aws.amazon.com/redshift/latest/mgmt/rs-resize-tutorial.html)
20
+
[Learn how to resize your cluster.](http://docs.aws.amazon.com/redshift/latest/mgmt/rs-resize-tutorial.html)
21
21
22
22
### 2\. Inefficient queries
23
23
24
24
Another thing you'll want to check is if your queries are efficient. For example, if you're scanning an entire dataset with a query, you're probably not making the best use of your compute resources.
25
25
26
-
A few tips for writing performant queries:
26
+
Some tips for writing performant queries:
27
27
28
-
* Consider using `INNER joins` as they are are more efficient that`LEFT joins`.
28
+
* Consider using `INNER joins` as they are more efficient than`LEFT joins`.
29
29
30
30
* Stay away from `UNION` whenever possible.
31
31
@@ -43,7 +43,7 @@ To learn more about writing beautiful SQL, check out these resources:
43
43
*[Chartio on Improving Query Performance](https://support.chartio.com/knowledgebase/improving-query-performance)
44
44
45
45
46
-
### 3\.Multiple ETL processes and queries running
46
+
### 3\.Running multiple ETL processes and queries
47
47
48
48
Some databases like Redshift have limited computing resources. Running multiple queries or ETL processes that insert data into your warehouse at the same time will compete for compute power.
49
49
@@ -53,17 +53,17 @@ If you're a Segment Business Tier customer, you can schedule your sync times und
53
53
54
54

55
55
56
-
In addition, you might want to take advantage of Redshift's [Workload Management](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) that helps ensure fast-running queries won't get stuck behind long ones.
56
+
You also might want to take advantage of Redshift's [Workload Management](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) that helps ensure fast-running queries won't get stuck behind long ones.
57
57
58
58
### 4\. Default WLM Queue Configuration
59
59
60
60
As mentioned before, Redshift schedules and prioritizes queries using [Workload Management](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html). Each queue is configured to distribute resources in ways that can optimize for your use-case.
61
61
62
-
The default configuration is a single queue with only 5 queries running concurrently, but we've discovered that the default only works well for very low-volume warehouses. More often than not, adjusting this configuration can significantly improve your sync times.
62
+
The default configuration is a single queue with only 5 queries running concurrently, but Segment discovered that the default only works well for low-volume warehouses. More often than not, adjusting this configuration can improve your sync times.
63
63
64
-
Before our SQL statements, we use`set query_group to "segment";` to group all of our queries together. This allows you to easily create a queue just for Segment that can be isolated from your own queries. The maximum concurrency that Redshift supports is 50 across _all_ query groups, and resources like memory are distributed evenly across all those queries.
64
+
Before Segment's SQL statements, Segment uses`set query_group to "segment";` to group all the queries together. This allows you to create a queue just for Segment that isolates from your own queries. The maximum concurrency that Redshift supports is 50 across _all_ query groups, and resources like memory distribute evenly across all those queries.
65
65
66
-
Our initial recommendation is for 2 WLM queues:
66
+
Segment's initial recommendation is for 2 WLM queues:
67
67
68
68
1. a queue for the `segment` query group with a concurrency of `10`
69
69
@@ -72,29 +72,28 @@ Our initial recommendation is for 2 WLM queues:
72
72
73
73

74
74
75
-
Generally, we are responsible for most writes in the databases we connect to, so having a higher concurrency allows us to write as quickly as possible. However, if you are also using the same database for your own ETL process, you may want to use the same concurrency for both groups. In addition, you may even require additional queues if you have other applications writing to the database.
75
+
Generally, Segment is responsible for most writes in the databases Segment connects to, so having a higher concurrency allows Segment to write as fast as possible. If you're also using the same database for your own ETL process, you may want to use the same concurrency for both groups. You may even require additional queues if you have other applications writing to the database.
76
76
77
-
Each cluster may have different needs, so feel free to stray from this recommendation if another configuration works better for your use-case. AWS provides some [guidelines](http://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html), and of course you can always [contact us](https://segment.com/help/contact/) as we're more than happy to share what we have learned while working with Redshift.
77
+
Each cluster may have different needs, so feel free to stray from this recommendation if another configuration works better for your use-case. AWS provides some [guidelines](http://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html), and you can always [contact us](https://segment.com/help/contact/) as Segment is more than happy to share the learnings while working with Redshift.
78
78
79
79
## Pro-tips for Segment Warehouses
80
80
81
-
In addition to following performance best practices, here are a few more optimizations to consider if you're using Segment Warehouses.
81
+
In addition to following performance best practices, here are a some more optimizations to consider if you're using Segment Warehouses.
82
82
83
83
### Factors that affect load times
84
84
85
-
When Segment is actively loading data into your data warehouse, we're competing for cluster space and storage with any other jobs you might be running. Here are the parameters that influence your load time for Segment Warehouses.
85
+
When Segment is actively loading data into your data warehouse, Segment is competing for cluster space and storage with any other jobs you might be running. Here are the parameters that influence your load time for Segment Warehouses.
86
86
87
-
***Volume of data.** Our pipeline needs to load and deduplicate data for each sync, so simply having more volume means these operations will take longer.
88
-
***Number of sources.** When we start a sync of your data into your warehouse, we kick off a new job for every source you have in Segment. So the more sources you have, the longer your load time could take. This is where the WLM queue and the concurrency setting can make a big difference.
89
-
***Number and size of columns.** Column sizes and the number of columns also affect load time. If you have very long property values or lots of properties per event, the load may take longer as well.
87
+
***Volume of data.** Segment's pipeline needs to load and deduplicate data for each sync, so having more volume means these operations will take longer.
88
+
***Number of sources.** When Segment starts a sync of your data into your warehouse, Segment kicks off a new job for every source you have in Segment. The more sources you have, the longer your load time could take. This is where the WLM queue and the concurrency setting can make a big difference.
89
+
***Number and size of columns.** Column sizes and the number of columns also affect load time. If you have long property values or lots of properties per event, the load may take longer as well.
90
90
91
91
### Performance optimizations
92
92
93
93
To make sure you have enough headroom for quick queries while using Segment Warehouses, here are some tips!
94
94
95
-
***Size up your cluster.** If you find your queries are getting slow at key times during the day, add more nodes to give enough room for us to load data and for your team to run their queries.
96
-
***Disable unused sources.** If you're not actively analyzing data from a source, consider disabling the source for your Warehouse (available for business tier). If you don't use a source anymore—perhaps you were just playing around with it for testing, you might even want to remove it completely. This will kick off fewer jobs in our ETL process.
95
+
***Size up your cluster.** If you find your queries are getting slow at key times during the day, add more nodes to give enough room for Segment to load data and for your team to run their queries.
96
+
***Disable unused sources.** If you're not actively analyzing data from a source, consider disabling the source for your Warehouse (available for business tier). If you don't use a source anymore—perhaps you were just playing around with it for testing, you might even want to remove it. This will kick off fewer jobs in Segment's ETL process.
97
97
***Schedule syncs during off times.** If you're concerned about query times and you don't mind data that's a little stale, you can schedule your syncs to run when most of your team isn't actively using the database. (Available for business tier customers.)
98
-
***Schedule regular vacuums.** Make sure to schedule regular vacuums for your cluster, so old deleted data isn't taking up space.
99
98
100
-
We hope these steps will speed up your workflow! If you need any other help, feel free to [contact us](https://segment.com/help/contact/).
99
+
Hopefully these steps will help to speed up your workflow! If you need any other help, feel free to [contact us](https://segment.com/help/contact/).
0 commit comments