You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/sql/develop-tables-statistics.md
+31-14Lines changed: 31 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,11 +24,13 @@ The more the SQL pool resource knows about your data, the faster it can execute
24
24
25
25
The SQL pool query optimizer is a cost-based optimizer. It compares the cost of various query plans, and then chooses the plan with the lowest cost. In most cases, it chooses the plan that will execute the fastest.
26
26
27
-
For example, if the optimizer estimates that the date your query is filtering on will return one row it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
27
+
For example, if the optimizer estimates that the date your query is filtering on will return one row, it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
28
28
29
29
### Automatic creation of statistics
30
30
31
-
SQL pool will analyze incoming user queries for missing statistics when the database AUTO_CREATE_STATISTICS option is set to `ON`. If statistics are missing, the query optimizer creates statistics on individual columns in the query predicate or join condition. This function is used to improve cardinality estimates for the query plan.
31
+
SQL pool will analyze incoming user queries for missing statistics when the database AUTO_CREATE_STATISTICS option is set to `ON`. If statistics are missing, the query optimizer creates statistics on individual columns in the query predicate or join condition.
32
+
33
+
This function is used to improve cardinality estimates for the query plan.
32
34
33
35
> [!IMPORTANT]
34
36
> Automatic creation of statistics is currently turned on by default.
@@ -95,7 +97,9 @@ One of the first questions to ask when you're troubleshooting a query is, **"Are
95
97
96
98
This question isn't one that can be answered by the age of the data. An up-to-date statistics object might be old if there's been no material change to the underlying data. When the number of rows has changed substantially, or a material change in the distribution of values for a column occurs, *then* it's time to update statistics.
97
99
98
-
There isn't a dynamic management view available to determine if data within the table has changed since the last time statistics were updated. Knowing the age of your statistics can provide you with part of the picture. You can use the following query to determine the last time your statistics were updated on each table.
100
+
There isn't a dynamic management view available to determine if data within the table has changed since the last time statistics were updated. Knowing the age of your statistics can provide you with part of the picture.
101
+
102
+
You can use the following query to determine the last time your statistics were updated on each table.
99
103
100
104
> [!NOTE]
101
105
> If there is a material change in the distribution of values for a column, you should update statistics regardless of the last time they were updated.
@@ -131,9 +135,11 @@ WHERE
131
135
132
136
Statistics on a gender column in a customer table might never need to be updated. Assuming the distribution is constant between customers, adding new rows to the table variation isn't going to change the data distribution.
133
137
134
-
But, if your data warehouse contains only one gender and a new requirement results in multiple genders, then you need to update statistics on the gender column. For further information, review the [Statistics](/sql/relational-databases/statistics/statistics) article.
138
+
But, if your data warehouse contains only one gender and a new requirement results in multiple genders, then you need to update statistics on the gender column.
135
139
136
-
### Implementing statistics management
140
+
For further information, review the [Statistics](/sql/relational-databases/statistics/statistics) article.
141
+
142
+
### Implement statistics management
137
143
138
144
It's often a good idea to extend your data-loading process to ensure that statistics are updated at the end of the load. The data load is when tables most frequently change their size, distribution of values, or both. As such, the load process is a logical place to implement some management processes.
139
145
@@ -269,6 +275,7 @@ CREATE STATISTICS stats_col3 on dbo.table3 (col3);
269
275
#### Use a stored procedure to create statistics on all columns in a database
270
276
271
277
SQL pool doesn't have a system stored procedure equivalent to sp_create_stats in SQL Server. This stored procedure creates a single column statistics object on every column of the database that doesn't already have statistics.
278
+
272
279
The following example will help you get started with your database design. Feel free to adapt it to your needs:
273
280
274
281
```sql
@@ -412,7 +419,9 @@ For example:
412
419
UPDATE STATISTICS dbo.table1;
413
420
```
414
421
415
-
The UPDATE STATISTICS statement is easy to use. Just remember that it updates *all* statistics on the table, prompting more work than is necessary. If performance isn't an issue, this method is the easiest and most complete way to guarantee that statistics are up to date.
422
+
The UPDATE STATISTICS statement is easy to use. Just remember that it updates *all* statistics on the table, prompting more work than is necessary.
423
+
424
+
If performance isn't an issue, this method is the easiest and most complete way to guarantee that statistics are up to date.
416
425
417
426
> [!NOTE]
418
427
> When updating all statistics on a table, SQL pool does a scan to sample the table for each statistics object. If the table is large and has many columns and many statistics, it might be more efficient to update individual statistics based on need.
@@ -495,7 +504,9 @@ DBCC SHOW_STATISTICS() shows the data held within a statistics object. This data
495
504
- Density vector
496
505
- Histogram
497
506
498
-
The header is the metadata about the statistics. The histogram displays the distribution of values in the first key column of the statistics object. The density vector measures cross-column correlation. SQL pool computes cardinality estimates with any of the data in the statistics object.
507
+
The header is the metadata about the statistics. The histogram displays the distribution of values in the first key column of the statistics object.
508
+
509
+
The density vector measures cross-column correlation. SQL pool computes cardinality estimates with any of the data in the statistics object.
499
510
500
511
#### Show header, density, and histogram
501
512
@@ -549,7 +560,11 @@ Statistics are created per particular column for particular dataset (storage pat
549
560
550
561
### Why use statistics
551
562
552
-
The more SQL on-demand (preview) knows about your data, the faster it can execute queries against it. Collecting statistics on your data is one of the most important things you can do to optimize your queries. The SQL on-demand query optimizer is a cost-based optimizer. It compares the cost of various query plans, and then chooses the plan with the lowest cost. In most cases, it chooses the plan that will execute the fastest. For example, if the optimizer estimates that the date your query is filtering on will return one row it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
563
+
The more SQL on-demand (preview) knows about your data, the faster it can execute queries against it. Collecting statistics on your data is one of the most important things you can do to optimize your queries.
564
+
565
+
The SQL on-demand query optimizer is a cost-based optimizer. It compares the cost of various query plans, and then chooses the plan with the lowest cost. In most cases, it chooses the plan that will execute the fastest.
566
+
567
+
For example, if the optimizer estimates that the date your query is filtering on will return one row it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
553
568
554
569
### Automatic creation of statistics
555
570
@@ -564,9 +579,11 @@ Automatic creation of statistics is done synchronously so you may incur slightly
564
579
565
580
### Manual creation of statistics
566
581
567
-
SQL on-demand lets you create statistics manually. For CSV files, you have to create statistics manually because automatic creation of statistics isn't turned on for CSV files. See the examples below for instructions on how to manually create statistics.
582
+
SQL on-demand lets you create statistics manually. For CSV files, you have to create statistics manually because automatic creation of statistics isn't turned on for CSV files.
583
+
584
+
See the following examples for instructions on how to manually create statistics.
568
585
569
-
### Updating statistics
586
+
### Update statistics
570
587
571
588
Changes to data in files, deleting, and adding files result in data distribution changes and makes statistics out of date. In that case, statistics needs to be updated.
572
589
@@ -586,9 +603,9 @@ When the number of rows has changed substantially, or there's a material change
586
603
> [!NOTE]
587
604
> If there is a material change in the distribution of values for a column, you should update statistics regardless of the last time they were updated.
588
605
589
-
### Implementing statistics management
606
+
### Implement statistics management
590
607
591
-
You may want to extend your data pipeline to ensure that statistics are updated when data is significantly altered through addition, deletion, or change of files.
608
+
You may want to extend your data pipeline to ensure that statistics are updated when data is significantly changed through addition, deletion, or change of files.
592
609
593
610
The following guiding principles are provided for updating your statistics:
594
611
@@ -756,12 +773,12 @@ external_table
756
773
Specifies external table that statistics should be created.
757
774
758
775
FULLSCAN
759
-
Compute statistics by scanning all rows. FULLSCAN and SAMPLE 100 PERCENT have the same results. FULLSCAN cannot be used with the SAMPLE option.
776
+
Compute statistics by scanning all rows. FULLSCAN and SAMPLE 100 PERCENT have the same results. FULLSCAN can't be used with the SAMPLE option.
760
777
761
778
SAMPLE number PERCENT
762
779
Specifies the approximate percentage or number of rows in the table or indexed view for the query optimizer to use when it creates statistics. Number can be from 0 through 100.
763
780
764
-
SAMPLE cannot be used with the FULLSCAN option.
781
+
SAMPLE can't be used with the FULLSCAN option.
765
782
766
783
> [!NOTE]
767
784
> CSV sampling does not work at this time, only FULLSCAN is supported for CSV.
0 commit comments