Skip to content

Commit b5850f1

Browse files
authored
Merge pull request #115326 from Kat-Campise/sql_articles_5
sql articles 5
2 parents a308d4c + 9a3cc70 commit b5850f1

File tree

2 files changed

+96
-79
lines changed

2 files changed

+96
-79
lines changed

articles/synapse-analytics/sql/develop-tables-statistics.md

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,13 @@ The more the SQL pool resource knows about your data, the faster it can execute
2424

2525
The SQL pool query optimizer is a cost-based optimizer. It compares the cost of various query plans, and then chooses the plan with the lowest cost. In most cases, it chooses the plan that will execute the fastest.
2626

27-
For example, if the optimizer estimates that the date your query is filtering on will return one row it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
27+
For example, if the optimizer estimates that the date your query is filtering on will return one row, it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
2828

2929
### Automatic creation of statistics
3030

31-
SQL pool will analyze incoming user queries for missing statistics when the database AUTO_CREATE_STATISTICS option is set to `ON`. If statistics are missing, the query optimizer creates statistics on individual columns in the query predicate or join condition. This function is used to improve cardinality estimates for the query plan.
31+
SQL pool will analyze incoming user queries for missing statistics when the database AUTO_CREATE_STATISTICS option is set to `ON`. If statistics are missing, the query optimizer creates statistics on individual columns in the query predicate or join condition.
32+
33+
This function is used to improve cardinality estimates for the query plan.
3234

3335
> [!IMPORTANT]
3436
> Automatic creation of statistics is currently turned on by default.
@@ -95,7 +97,9 @@ One of the first questions to ask when you're troubleshooting a query is, **"Are
9597

9698
This question isn't one that can be answered by the age of the data. An up-to-date statistics object might be old if there's been no material change to the underlying data. When the number of rows has changed substantially, or a material change in the distribution of values for a column occurs, *then* it's time to update statistics.
9799

98-
There isn't a dynamic management view available to determine if data within the table has changed since the last time statistics were updated. Knowing the age of your statistics can provide you with part of the picture. You can use the following query to determine the last time your statistics were updated on each table.
100+
There isn't a dynamic management view available to determine if data within the table has changed since the last time statistics were updated. Knowing the age of your statistics can provide you with part of the picture.
101+
102+
You can use the following query to determine the last time your statistics were updated on each table.
99103

100104
> [!NOTE]
101105
> If there is a material change in the distribution of values for a column, you should update statistics regardless of the last time they were updated.
@@ -131,9 +135,11 @@ WHERE
131135

132136
Statistics on a gender column in a customer table might never need to be updated. Assuming the distribution is constant between customers, adding new rows to the table variation isn't going to change the data distribution.
133137

134-
But, if your data warehouse contains only one gender and a new requirement results in multiple genders, then you need to update statistics on the gender column. For further information, review the [Statistics](/sql/relational-databases/statistics/statistics) article.
138+
But, if your data warehouse contains only one gender and a new requirement results in multiple genders, then you need to update statistics on the gender column.
135139

136-
### Implementing statistics management
140+
For further information, review the [Statistics](/sql/relational-databases/statistics/statistics) article.
141+
142+
### Implement statistics management
137143

138144
It's often a good idea to extend your data-loading process to ensure that statistics are updated at the end of the load. The data load is when tables most frequently change their size, distribution of values, or both. As such, the load process is a logical place to implement some management processes.
139145

@@ -269,6 +275,7 @@ CREATE STATISTICS stats_col3 on dbo.table3 (col3);
269275
#### Use a stored procedure to create statistics on all columns in a database
270276

271277
SQL pool doesn't have a system stored procedure equivalent to sp_create_stats in SQL Server. This stored procedure creates a single column statistics object on every column of the database that doesn't already have statistics.
278+
272279
The following example will help you get started with your database design. Feel free to adapt it to your needs:
273280

274281
```sql
@@ -412,7 +419,9 @@ For example:
412419
UPDATE STATISTICS dbo.table1;
413420
```
414421

415-
The UPDATE STATISTICS statement is easy to use. Just remember that it updates *all* statistics on the table, prompting more work than is necessary. If performance isn't an issue, this method is the easiest and most complete way to guarantee that statistics are up to date.
422+
The UPDATE STATISTICS statement is easy to use. Just remember that it updates *all* statistics on the table, prompting more work than is necessary.
423+
424+
If performance isn't an issue, this method is the easiest and most complete way to guarantee that statistics are up to date.
416425

417426
> [!NOTE]
418427
> When updating all statistics on a table, SQL pool does a scan to sample the table for each statistics object. If the table is large and has many columns and many statistics, it might be more efficient to update individual statistics based on need.
@@ -495,7 +504,9 @@ DBCC SHOW_STATISTICS() shows the data held within a statistics object. This data
495504
- Density vector
496505
- Histogram
497506

498-
The header is the metadata about the statistics. The histogram displays the distribution of values in the first key column of the statistics object. The density vector measures cross-column correlation. SQL pool computes cardinality estimates with any of the data in the statistics object.
507+
The header is the metadata about the statistics. The histogram displays the distribution of values in the first key column of the statistics object.
508+
509+
The density vector measures cross-column correlation. SQL pool computes cardinality estimates with any of the data in the statistics object.
499510

500511
#### Show header, density, and histogram
501512

@@ -549,7 +560,11 @@ Statistics are created per particular column for particular dataset (storage pat
549560

550561
### Why use statistics
551562

552-
The more SQL on-demand (preview) knows about your data, the faster it can execute queries against it. Collecting statistics on your data is one of the most important things you can do to optimize your queries. The SQL on-demand query optimizer is a cost-based optimizer. It compares the cost of various query plans, and then chooses the plan with the lowest cost. In most cases, it chooses the plan that will execute the fastest. For example, if the optimizer estimates that the date your query is filtering on will return one row it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
563+
The more SQL on-demand (preview) knows about your data, the faster it can execute queries against it. Collecting statistics on your data is one of the most important things you can do to optimize your queries.
564+
565+
The SQL on-demand query optimizer is a cost-based optimizer. It compares the cost of various query plans, and then chooses the plan with the lowest cost. In most cases, it chooses the plan that will execute the fastest.
566+
567+
For example, if the optimizer estimates that the date your query is filtering on will return one row it will choose one plan. If it estimates that the selected date will return 1 million rows, it will return a different plan.
553568

554569
### Automatic creation of statistics
555570

@@ -564,9 +579,11 @@ Automatic creation of statistics is done synchronously so you may incur slightly
564579

565580
### Manual creation of statistics
566581

567-
SQL on-demand lets you create statistics manually. For CSV files, you have to create statistics manually because automatic creation of statistics isn't turned on for CSV files. See the examples below for instructions on how to manually create statistics.
582+
SQL on-demand lets you create statistics manually. For CSV files, you have to create statistics manually because automatic creation of statistics isn't turned on for CSV files.
583+
584+
See the following examples for instructions on how to manually create statistics.
568585

569-
### Updating statistics
586+
### Update statistics
570587

571588
Changes to data in files, deleting, and adding files result in data distribution changes and makes statistics out of date. In that case, statistics needs to be updated.
572589

@@ -586,9 +603,9 @@ When the number of rows has changed substantially, or there's a material change
586603
> [!NOTE]
587604
> If there is a material change in the distribution of values for a column, you should update statistics regardless of the last time they were updated.
588605
589-
### Implementing statistics management
606+
### Implement statistics management
590607

591-
You may want to extend your data pipeline to ensure that statistics are updated when data is significantly altered through addition, deletion, or change of files.
608+
You may want to extend your data pipeline to ensure that statistics are updated when data is significantly changed through addition, deletion, or change of files.
592609

593610
The following guiding principles are provided for updating your statistics:
594611

@@ -756,12 +773,12 @@ external_table
756773
Specifies external table that statistics should be created.
757774

758775
FULLSCAN
759-
Compute statistics by scanning all rows. FULLSCAN and SAMPLE 100 PERCENT have the same results. FULLSCAN cannot be used with the SAMPLE option.
776+
Compute statistics by scanning all rows. FULLSCAN and SAMPLE 100 PERCENT have the same results. FULLSCAN can't be used with the SAMPLE option.
760777

761778
SAMPLE number PERCENT
762779
Specifies the approximate percentage or number of rows in the table or indexed view for the query optimizer to use when it creates statistics. Number can be from 0 through 100.
763780

764-
SAMPLE cannot be used with the FULLSCAN option.
781+
SAMPLE can't be used with the FULLSCAN option.
765782

766783
> [!NOTE]
767784
> CSV sampling does not work at this time, only FULLSCAN is supported for CSV.

0 commit comments

Comments
 (0)