Skip to content

Commit c8e8a24

Browse files
20240719 freshness pass
1 parent aaa2bd8 commit c8e8a24

File tree

1 file changed

+25
-22
lines changed

1 file changed

+25
-22
lines changed

articles/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute.md

Lines changed: 25 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,17 @@ ms.custom:
1616

1717
This article contains recommendations for designing hash-distributed and round-robin distributed tables in dedicated SQL pools.
1818

19-
This article assumes you are familiar with data distribution and data movement concepts in dedicated SQL pool. For more information, see [Azure Synapse Analytics architecture](massively-parallel-processing-mpp-architecture.md).
19+
This article assumes you are familiar with data distribution and data movement concepts in dedicated SQL pool. For more information, see [Azure Synapse Analytics architecture](massively-parallel-processing-mpp-architecture.md).
2020

2121
## What is a distributed table?
2222

23-
A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.
23+
A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.
2424

25-
**Hash-distribution** improves query performance on large fact tables, and is the focus of this article. **Round-robin distribution** is useful for improving loading speed. These design choices have a significant impact on improving query and loading performance.
25+
**Hash-distribution** improves query performance on large fact tables, and is the focus of this article. **Round-robin distribution** is useful for improving loading speed. These design choices have a significant effect on improving query and loading performance.
2626

2727
Another table storage option is to replicate a small table across all the Compute nodes. For more information, see [Design guidance for replicated tables](design-guidance-for-replicated-tables.md). To quickly choose among the three options, see Distributed tables in the [tables overview](sql-data-warehouse-tables-overview.md).
2828

29-
As part of table design, understand as much as possible about your data and how the data is queried.  For example, consider these questions:
29+
As part of table design, understand as much as possible about your data and how the data is queried. For example, consider these questions:
3030

3131
- How large is the table?
3232
- How often is the table refreshed?
@@ -51,7 +51,7 @@ Consider using a hash-distributed table when:
5151

5252
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
5353

54-
As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra step can slow down your queries. For example, joining a round-robin table usually requires reshuffling the rows, which is a performance hit.
54+
As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra step can slow down your queries. For example, joining a round-robin table usually requires reshuffling the rows, which is a performance hit.
5555

5656
Consider using the round-robin distribution for your table in the following scenarios:
5757

@@ -85,7 +85,7 @@ WITH
8585
);
8686
```
8787

88-
Hash distribution can be applied on multiple columns for a more even distribution of the base table. Multi-column distribution will allow you to choose up to eight columns for distribution. This not only reduces the data skew over time but also improves query performance. For example:
88+
Hash distribution can be applied on multiple columns for a more even distribution of the base table. Multi-column distribution allows you to choose up to eight columns for distribution. This not only reduces the data skew over time but also improves query performance. For example:
8989

9090
```sql
9191
CREATE TABLE [dbo].[FactInternetSales]
@@ -109,9 +109,9 @@ WITH
109109
> `ALTER DATABASE SCOPED CONFIGURATION SET DW_COMPATIBILITY_LEVEL = 50;`
110110
> For more information on setting the database compatibility level, see [ALTER DATABASE SCOPED CONFIGURATION](/sql/t-sql/statements/alter-database-scoped-configuration-transact-sql). For more information on multi-column distributions, see [CREATE MATERIALIZED VIEW](/sql/t-sql/statements/create-materialized-view-as-select-transact-sql), [CREATE TABLE](/sql/t-sql/statements/create-table-azure-sql-data-warehouse), or [CREATE TABLE AS SELECT](/sql/t-sql/statements/create-materialized-view-as-select-transact-sql).
111111
112-
Data stored in the distribution column(s) can be updated. Updates to data in distribution column(s) could result in data shuffle operation.
112+
Data stored in the distribution columns can be updated. Updates to data in distribution columns could result in data shuffle operation.
113113

114-
Choosing distribution column(s) is an important design decision since the values in the hash column(s) determine how the rows are distributed. The best choice depends on several factors, and usually involves tradeoffs. Once a distribution column or column set is chosen, you cannot change it. If you didn't choose the best column(s) the first time, you can use [CREATE TABLE AS SELECT (CTAS)](/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?toc=/azure/synapse-analytics/sql-data-warehouse/toc.json&bc=/azure/synapse-analytics/sql-data-warehouse/breadcrumb/toc.json&view=azure-sqldw-latest&preserve-view=true) to re-create the table with the desired distribution hash key.
114+
Choosing distribution columns is an important design decision since the values in the hash columns determine how the rows are distributed. The best choice depends on several factors, and usually involves tradeoffs. Once a distribution column or column set is chosen, you cannot change it. If you didn't choose the best columns the first time, you can use [CREATE TABLE AS SELECT (CTAS)](/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?toc=/azure/synapse-analytics/sql-data-warehouse/toc.json&bc=/azure/synapse-analytics/sql-data-warehouse/breadcrumb/toc.json&view=azure-sqldw-latest&preserve-view=true) to re-create the table with the desired distribution hash key.
115115

116116
### Choose a distribution column with data that distributes evenly
117117

@@ -122,25 +122,27 @@ For best performance, all of the distributions should have approximately the sam
122122

123123
To balance the parallel processing, select a distribution column or set of columns that:
124124

125-
- **Has many unique values.** The distribution column(s) can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others can end with zero values.
126-
- **Does not have NULLs, or has only a few NULLs.** For an extreme example, if all values in the distribution column(s) are NULL, all the rows are assigned to the same distribution. As a result, query processing is skewed to one distribution, and does not benefit from parallel processing.
127-
- **Is not a date column**. All data for the same date lands in the same distribution, or will cluster records by date. If several users are all filtering on the same date (such as today's date), then only 1 of the 60 distributions do all the processing work.
125+
- **Has many unique values.** One or more distribution columns can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others can end with zero values.
126+
- **Does not have NULLs, or has only a few NULLs.** For an extreme example, if all values in the distribution columns are NULL, all the rows are assigned to the same distribution. As a result, query processing is skewed to one distribution, and does not benefit from parallel processing.
127+
- **Is not a date column**. All data for the same date lands in the same distribution, or will cluster records by date. If several users are all filtering on the same date (such as today's date), then only 1 of the 60 distributions does all the processing work.
128128

129129
### Choose a distribution column that minimizes data movement
130130

131131
To get the correct query result queries might move data from one Compute node to another. Data movement commonly happens when queries have joins and aggregations on distributed tables. Choosing a distribution column or column set that helps minimize data movement is one of the most important strategies for optimizing performance of your dedicated SQL pool.
132132

133133
To minimize data movement, select a distribution column or set of columns that:
134134

135-
- Is used in `JOIN`, `GROUP BY`, `DISTINCT`, `OVER`, and `HAVING` clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column or column set that is frequently in the `GROUP BY` clause.
135+
- Is used in `JOIN`, `GROUP BY`, `DISTINCT`, `OVER`, and `HAVING` clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column or column set that is frequently in the `GROUP BY` clause.
136136
- Is *not* used in `WHERE` clauses. When a query's `WHERE` clause and the table's distribution columns are on the same column, the query could encounter high data skew, leading to processing load falling on only few distributions. This impacts query performance, ideally many distributions share the processing load.
137-
- Is *not* a date column. `WHERE` clauses often filter by date. When this happens, all the processing could run on only a few distributions affecting query performance. Ideally, many distributions share the processing load.
137+
- Is *not* a date column. `WHERE` clauses often filter by date. When this happens, all the processing could run on only a few distributions affecting query performance. Ideally, many distributions share the processing load.
138138

139139
Once you design a hash-distributed table, the next step is to load data into the table. For loading guidance, see [Loading overview](design-elt-data-loading.md).
140140

141141
## How to tell if your distribution is a good choice
142142

143-
After data is loaded into a hash-distributed table, check to see how evenly the rows are distributed across the 60 distributions. The rows per distribution can vary up to 10% without a noticeable impact on performance. Consider the following topics to evaluate your distribution column(s).
143+
After data is loaded into a hash-distributed table, check to see how evenly the rows are distributed across the 60 distributions. The rows per distribution can vary up to 10% without a noticeable impact on performance.
144+
145+
Consider the following ways to evaluate your distribution columns.
144146

145147
### Determine if the table has data skew
146148

@@ -153,7 +155,7 @@ DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
153155

154156
To identify which tables have more than 10% data skew:
155157

156-
1. Create the view `dbo.vTableSizes` that is shown in the [Tables overview](sql-data-warehouse-tables-overview.md#table-size-queries) article.
158+
1. Create the view `dbo.vTableSizes` that is shown in the [Tables overview](sql-data-warehouse-tables-overview.md#table-size-queries) article.
157159
1. Run the following query:
158160

159161
```sql
@@ -172,7 +174,7 @@ order by two_part_name, row_count;
172174

173175
### Check query plans for data movement
174176

175-
A good distribution column set enables joins and aggregations to have minimal data movement. This affects the way joins should be written. To get minimal data movement for a join on two hash-distributed tables, one of the join columns needs to be in distribution column or column(s). When two hash-distributed tables join on a distribution column of the same data type, the join does not require data movement. Joins can use additional columns without incurring data movement.
177+
A good distribution column set enables joins and aggregations to have minimal data movement. This affects the way joins should be written. To get minimal data movement for a join on two hash-distributed tables, one of the join columns needs to be in distribution column or columns. When two hash-distributed tables join on a distribution column of the same data type, the join does not require data movement. Joins can use additional columns without incurring data movement.
176178

177179
To avoid data movement during a join:
178180

@@ -181,22 +183,23 @@ To avoid data movement during a join:
181183
- The columns must be joined with an equals operator.
182184
- The join type cannot be a `CROSS JOIN`.
183185

184-
To see if queries are experiencing data movement, you can look at the query plan.
186+
To see if queries are experiencing data movement, you can look at the query plan.
185187

186188
## Resolve a distribution column problem
187189

188-
It is not necessary to resolve all cases of data skew. Distributing data is a matter of finding the right balance between minimizing data skew and data movement. It is not always possible to minimize both data skew and data movement. Sometimes the benefit of having the minimal data movement might outweigh the impact of having data skew.
190+
It is not necessary to resolve all cases of data skew. Distributing data is a matter of finding the right balance between minimizing data skew and data movement. It is not always possible to minimize both data skew and data movement. Sometimes the benefit of having the minimal data movement might outweigh the effect of having data skew.
189191

190-
To decide if you should resolve data skew in a table, you should understand as much as possible about the data volumes and queries in your workload. You can use the steps in the [Query monitoring](sql-data-warehouse-manage-monitor.md) article to monitor the impact of skew on query performance. Specifically, look for how long it takes large queries to complete on individual distributions.
192+
To decide if you should resolve data skew in a table, you should understand as much as possible about the data volumes and queries in your workload. You can use the steps in the [Query monitoring](sql-data-warehouse-manage-monitor.md) article to monitor the effect of skew on query performance. Specifically, look for how long it takes large queries to complete on individual distributions.
191193

192-
Since you cannot change the distribution column(s) on an existing table, the typical way to resolve data skew is to re-create the table with a different distribution column(s).
194+
Since you cannot change the distribution columns on an existing table, the typical way to resolve data skew is to re-create the table with a different distribution columns.
193195

194196
<a id="re-create-the-table-with-a-new-distribution-column"></a>
197+
195198
### Re-create the table with a new distribution column set
196199

197-
This example uses [CREATE TABLE AS SELECT](/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?toc=/azure/synapse-analytics/sql-data-warehouse/toc.json&bc=/azure/synapse-analytics/sql-data-warehouse/breadcrumb/toc.json&view=azure-sqldw-latest&preserve-view=true) to re-create a table with a different hash distribution column or column(s).
200+
This example uses [CREATE TABLE AS SELECT](/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?toc=/azure/synapse-analytics/sql-data-warehouse/toc.json&bc=/azure/synapse-analytics/sql-data-warehouse/breadcrumb/toc.json&view=azure-sqldw-latest&preserve-view=true) to re-create a table with a different hash distribution column or columns.
198201

199-
First use `CREATE TABLE AS SELECT` (CTAS) the new table with the new key. Then re-create the statistics and finally, swap the tables by re-naming them.
202+
First use `CREATE TABLE AS SELECT` (CTAS) the new table with the new key. Then re-create the statistics and finally, swap the tables by renaming them.
200203

201204
```sql
202205
CREATE TABLE [dbo].[FactInternetSales_CustomerKey]

0 commit comments

Comments
 (0)