Skip to content

Commit 4045f91

Browse files
committed
Merge branch 'master' into onesignal-new
2 parents 387d3be + 7bba81f commit 4045f91

File tree

2 files changed

+29
-30
lines changed

2 files changed

+29
-30
lines changed

src/connections/storage/warehouses/choose-warehouse.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,36 +7,36 @@ redirect_from: '/connections/warehouses/choose-warehouse/'
77

88
In most cases, you will get a much better price-to-performance ratio with Redshift for typical analyses.
99

10-
Redshift lacks some [features](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html), [datatypes](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-datatypes.html), and [functions](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html) supported by Postgres and also implements [some features](http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-sql-implementated-differently.html) differently. If you absolutely need any of the features or functions missing in Redshift and BigQuery, choose Postgres. If not (or you're not sure), we recommend Redshift.
10+
Redshift lacks some [features](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html), [datatypes](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-datatypes.html), and [functions](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html) supported by Postgres and also implements [some features](http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-sql-implementated-differently.html) differently. If you need any of the features or functions missing in Redshift and BigQuery, choose Postgres. If not (or you're not sure), Segment recommends choosing Redshift.
1111

1212
If you'd like more information, Amazon wrote [about this in their documentation](http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html).
1313

1414
## Comparing Redshift and BigQuery
1515

16-
Both Redshift and BigQuery are attractive cloud-hosted, relatively cheap, and performant analytical databases. The differences between the two are largely around their architecture and pricing.
16+
Both Redshift and BigQuery are attractive cloud-hosted, affordable, and performant analytical databases. The differences between the two are around their architecture and pricing.
1717

1818
## Architecture
1919

20-
When you provision a Redshift cluster, you're renting a server from Amazon Web Services. Your cluster is comprised of [nodes](http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html), each with dedicated memory, CPU, and disk storage. These nodes handle data storage, query execution, and - if your cluster contains multiple nodes - a leader node will handle coordination across the cluster.
20+
When you provision a Redshift cluster, you're renting a server from Amazon Web Services. Your cluster consists of [nodes](http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html), each with dedicated memory, CPU, and disk storage. These nodes handle data storage, query execution, and - if your cluster contains multiple nodes - a leader node will handle coordination across the cluster.
2121

2222
Redshift performance and storage capacity is a function of cluster size and cluster type. As your storage or performance requirements change, you can scale up or down your cluster as needed.
2323

2424
With BigQuery, you're not constrained by the storage capacity or compute resources of a given cluster. Instead, you can load large amounts of data into BigQuery without running out of memory, and execute complex queries without maxing out CPU.
2525

26-
This is possible because BigQuery takes advantage of distributed storage and networking to separate data storage from compute power. Data is distributed across many servers in the Google cloud using their [Colossus distributed file system](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood). When you execute a query, the [Dremel query engine](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood) splits the query into smaller sub-tasks, distributes the sub-tasks to many computers across Google data centers, and then re-assembles them into your results.
26+
This is possible because BigQuery takes advantage of distributed storage and networking to separate data storage from compute power. Google's[Colossus distributed file system](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood) distributes data across many servers in the Google cloud. When you execute a query, the [Dremel query engine](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood) splits the query into smaller sub-tasks, distributes the sub-tasks to computers across Google data centers, and then re-assembles them into your results.
2727

2828
## Pricing
2929

3030
The difference in architecture translates into differences in pricing.
3131

32-
[Redshift prices](https://aws.amazon.com/redshift/pricing/) based on an hourly rate determined by the number and types of nodes in your cluster. They offer dense storage - optimized for storage - and dense compute nodes - optimized for query performance.
32+
[Redshift prices](https://aws.amazon.com/redshift/pricing/) are based on an hourly rate determined by the number and types of nodes in your cluster. They offer dense storage - optimized for storage - and dense compute nodes - optimized for query performance.
3333

34-
BigQuery has two [pricing options](https://cloud.google.com/bigquery/pricing): variable and fixed pricing. With the variable, pay-as-you-go plan, you pay for the data you load into BigQuery, and then pay only for the amount of data you query. BigQuery allows you to set up [Cost Controls and Alerts](https://cloud.google.com/bigquery/cost-controls) to help control and monitor costs.
34+
BigQuery has two [pricing options](https://cloud.google.com/bigquery/pricing): variable and fixed pricing. With the variable, pay-as-you-go plan, you pay for the data you load into BigQuery, and then pay for the amount of data you query. BigQuery allows you to set up [Cost Controls and Alerts](https://cloud.google.com/bigquery/cost-controls) to help control and monitor costs.
3535

36-
Fixed-price plans are geared toward high-volume customers and allow you to rent a fixed amount of compute power.
36+
Fixed-price plans are more for high-volume customers and allow you to rent a fixed amount of compute power.
3737

3838
## Resource Management
3939

40-
Redshift does require you to create a cluster, schedule vacuums, choose sort and distribution keys, and resize your cluster as storage and performance needs change over time.
40+
Redshift does require you to create a cluster, choose sort and distribution keys, and resize your cluster as storage and performance needs change over time.
4141

42-
BigQuery is "fully-managed", which means that you'll never have to resize, vacuum, or adjust distribution or sort keys. All of that is handled by BigQuery.
42+
BigQuery is "fully-managed", which means that you'll never have to resize or adjust distribution or sort keys. BigQuery handles all of that.

src/connections/storage/warehouses/redshift-tuning.md

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: How do I speed up my Redshift queries?
33
redirect_from: '/connections/warehouses/redshift-tuning/'
44
---
55

6-
Waiting minutes and minutes, maybe even an hour, for your queries to compute is an unfortunate reality for many growing companies. Whether your data has grown faster than your cluster, or you're running too many jobs in parallel, there are lots of reasons your queries might be slowing down.
6+
Waiting minutes and minutes, maybe even an hour, for your queries to compute is an unfortunate reality for growing companies. Whether your data has grown faster than your cluster, or you're running too many jobs in parallel, there are lots of reasons your queries might be slowing down.
77

88
To help you improve your query performance, this guide takes you through common issues and how to mitigate them.
99

@@ -13,19 +13,19 @@ To help you improve your query performance, this guide takes you through common
1313

1414
As your data volume grows and your team writes more queries, you might be running out of space in your cluster.
1515

16-
To check if you're getting close to your max, run this query. It will tell you the percentage of storage used in your cluster. We recommend never exceeding 75-80% of your storage capacity. If you're nearing capacity, consider adding some more nodes.
16+
To check if you're getting close to your max, run this query. It will tell you the percentage of storage used in your cluster. Segment recommends that you don't exceed 75-80% of your storage capacity. If you approach that limit, consider adding more nodes to your cluster.
1717

1818
![](images/asset_HvZs8FpE.png)
1919

20-
[Learn how to resize your cluster here.](http://docs.aws.amazon.com/redshift/latest/mgmt/rs-resize-tutorial.html)
20+
[Learn how to resize your cluster.](http://docs.aws.amazon.com/redshift/latest/mgmt/rs-resize-tutorial.html)
2121

2222
### 2\. Inefficient queries
2323

2424
Another thing you'll want to check is if your queries are efficient. For example, if you're scanning an entire dataset with a query, you're probably not making the best use of your compute resources.
2525

26-
A few tips for writing performant queries:
26+
Some tips for writing performant queries:
2727

28-
* Consider using `INNER joins` as they are are more efficient that `LEFT joins`.
28+
* Consider using `INNER joins` as they are more efficient than `LEFT joins`.
2929

3030
* Stay away from `UNION` whenever possible.
3131

@@ -43,7 +43,7 @@ To learn more about writing beautiful SQL, check out these resources:
4343
* [Chartio on Improving Query Performance](https://support.chartio.com/knowledgebase/improving-query-performance)
4444

4545

46-
### 3\. Multiple ETL processes and queries running
46+
### 3\. Running multiple ETL processes and queries
4747

4848
Some databases like Redshift have limited computing resources. Running multiple queries or ETL processes that insert data into your warehouse at the same time will compete for compute power.
4949

@@ -53,17 +53,17 @@ If you're a Segment Business Tier customer, you can schedule your sync times und
5353

5454
![](images/asset_fRccrNNd.png)
5555

56-
In addition, you might want to take advantage of Redshift's [Workload Management](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) that helps ensure fast-running queries won't get stuck behind long ones.
56+
You also might want to take advantage of Redshift's [Workload Management](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) that helps ensure fast-running queries won't get stuck behind long ones.
5757

5858
### 4\. Default WLM Queue Configuration
5959

6060
As mentioned before, Redshift schedules and prioritizes queries using [Workload Management](http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html). Each queue is configured to distribute resources in ways that can optimize for your use-case.
6161

62-
The default configuration is a single queue with only 5 queries running concurrently, but we've discovered that the default only works well for very low-volume warehouses. More often than not, adjusting this configuration can significantly improve your sync times.
62+
The default configuration is a single queue with only 5 queries running concurrently, but Segment discovered that the default only works well for low-volume warehouses. More often than not, adjusting this configuration can improve your sync times.
6363

64-
Before our SQL statements, we use `set query_group to "segment";` to group all of our queries together. This allows you to easily create a queue just for Segment that can be isolated from your own queries. The maximum concurrency that Redshift supports is 50 across _all_ query groups, and resources like memory are distributed evenly across all those queries.
64+
Before Segment's SQL statements, Segment uses `set query_group to "segment";` to group all the queries together. This allows you to create a queue that isolates Segment's queries from your own. The maximum concurrency that Redshift supports is 50 across _all_ query groups, and resources like memory distribute evenly across all those queries.
6565

66-
Our initial recommendation is for 2 WLM queues:
66+
Segment's initial recommendation is for 2 WLM queues:
6767

6868
1. a queue for the `segment` query group with a concurrency of `10`
6969

@@ -72,29 +72,28 @@ Our initial recommendation is for 2 WLM queues:
7272

7373
![](images/asset_sHNEIURK.png)
7474

75-
Generally, we are responsible for most writes in the databases we connect to, so having a higher concurrency allows us to write as quickly as possible. However, if you are also using the same database for your own ETL process, you may want to use the same concurrency for both groups. In addition, you may even require additional queues if you have other applications writing to the database.
75+
Generally, Segment is responsible for most writes in the databases Segment connects to, so having a higher concurrency allows Segment to write as fast as possible. If you're also using the same database for your own ETL process, you may want to use the same concurrency for both groups. You may even require additional queues if you have other applications writing to the database.
7676

77-
Each cluster may have different needs, so feel free to stray from this recommendation if another configuration works better for your use-case. AWS provides some [guidelines](http://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html), and of course you can always [contact us](https://segment.com/help/contact/) as we're more than happy to share what we have learned while working with Redshift.
77+
Each cluster may have different needs, so feel free to stray from this recommendation if another configuration works better for your use-case. AWS provides some [guidelines](http://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html), and you can always [contact us](https://segment.com/help/contact/) as Segment is more than happy to share the learnings while working with Redshift.
7878

7979
## Pro-tips for Segment Warehouses
8080

81-
In addition to following performance best practices, here are a few more optimizations to consider if you're using Segment Warehouses.
81+
In addition to following performance best practices, here are a some more optimizations to consider if you're using Segment Warehouses.
8282

8383
### Factors that affect load times
8484

85-
When Segment is actively loading data into your data warehouse, we're competing for cluster space and storage with any other jobs you might be running. Here are the parameters that influence your load time for Segment Warehouses.
85+
When Segment is actively loading data into your data warehouse, Segment is competing for cluster space and storage with any other jobs you might be running. Here are the parameters that influence your load time for Segment Warehouses.
8686

87-
* **Volume of data.** Our pipeline needs to load and deduplicate data for each sync, so simply having more volume means these operations will take longer.
88-
* **Number of sources.** When we start a sync of your data into your warehouse, we kick off a new job for every source you have in Segment. So the more sources you have, the longer your load time could take. This is where the WLM queue and the concurrency setting can make a big difference.
89-
* **Number and size of columns.** Column sizes and the number of columns also affect load time. If you have very long property values or lots of properties per event, the load may take longer as well.
87+
* **Volume of data.** Segment's pipeline needs to load and deduplicate data for each sync, so having more volume means these operations will take longer.
88+
* **Number of sources.** When Segment starts a sync of your data into your warehouse, Segment kicks off a new job for every source you have in Segment. The more sources you have, the longer your load time could take. This is where the WLM queue and the concurrency setting can make a big difference.
89+
* **Number and size of columns.** Column sizes and the number of columns also affect load time. If you have long property values or lots of properties per event, the load may take longer as well.
9090

9191
### Performance optimizations
9292

9393
To make sure you have enough headroom for quick queries while using Segment Warehouses, here are some tips!
9494

95-
* **Size up your cluster.** If you find your queries are getting slow at key times during the day, add more nodes to give enough room for us to load data and for your team to run their queries.
96-
* **Disable unused sources.** If you're not actively analyzing data from a source, consider disabling the source for your Warehouse (available for business tier). If you don't use a source anymore—perhaps you were just playing around with it for testing, you might even want to remove it completely. This will kick off fewer jobs in our ETL process.
95+
* **Size up your cluster.** If you find your queries are getting slow at key times during the day, add more nodes to give enough room for Segment to load data and for your team to run their queries.
96+
* **Disable unused sources.** If you're not actively analyzing data from a source, consider disabling the source for your Warehouse (available for business tier). If you don't use a source anymore—perhaps you were just playing around with it for testing, you might even want to remove it. This will kick off fewer jobs in Segment's ETL process.
9797
* **Schedule syncs during off times.** If you're concerned about query times and you don't mind data that's a little stale, you can schedule your syncs to run when most of your team isn't actively using the database. (Available for business tier customers.)
98-
* **Schedule regular vacuums.** Make sure to schedule regular vacuums for your cluster, so old deleted data isn't taking up space.
9998

100-
We hope these steps will speed up your workflow! If you need any other help, feel free to [contact us](https://segment.com/help/contact/).
99+
Hopefully these steps will help to speed up your workflow! If you need any other help, feel free to [contact us](https://segment.com/help/contact/).

0 commit comments

Comments
 (0)