Skip to content

Commit 311b278

Browse files
Merge pull request #300357 from jovanpop-msft/patch-498514
Best practices - optimize delta lake
2 parents 6e5e7d8 + 7c7e31a commit 311b278

File tree

1 file changed

+20
-4
lines changed

1 file changed

+20
-4
lines changed

articles/synapse-analytics/sql/best-practices-serverless-sql-pool.md

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@ To minimize latency, colocate your Azure Storage account or Azure Cosmos DB anal
4545

4646
For optimal performance, if you access other storage accounts with serverless SQL pool, make sure they're in the same region. If they aren't in the same region, there will be increased latency for the data's network transfer between the remote region and the endpoint's region.
4747

48+
### Colocate your Azure Cosmos DB analytical storage and serverless SQL pool
49+
50+
Make sure your Azure Cosmos DB analytical storage is placed in the same region as an Azure Synapse workspace. Cross-region queries might cause huge latencies. Use the region property in the connection string to explicitly specify the region where the analytical store is placed (see [Query Azure Cosmos DB by using serverless SQL pool](query-cosmos-db-analytical-store.md#overview)): `account=<database account name>;database=<database name>;region=<region name>'`
51+
4852
### Azure Storage throttling
4953

5054
Multiple applications and services might access your storage account. Storage throttling occurs when the combined IOPS or throughput generated by applications, services, and serverless SQL pool workloads exceeds the limits of the storage account. As a result, you'll experience a significant negative effect on query performance.
@@ -64,10 +68,6 @@ If possible, you can prepare files for better performance:
6468
- It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
6569
- Partition your data by storing partitions to different folders or file names. See [Use filename and filepath functions to target specific partitions](#use-filename-and-filepath-functions-to-target-specific-partitions).
6670

67-
### Colocate your Azure Cosmos DB analytical storage and serverless SQL pool
68-
69-
Make sure your Azure Cosmos DB analytical storage is placed in the same region as an Azure Synapse workspace. Cross-region queries might cause huge latencies. Use the region property in the connection string to explicitly specify the region where the analytical store is placed (see [Query Azure Cosmos DB by using serverless SQL pool](query-cosmos-db-analytical-store.md#overview)): `account=<database account name>;database=<database name>;region=<region name>'`
70-
7171
## CSV optimizations
7272

7373
Here are best practices for using CSV files in serverless SQL pool.
@@ -80,6 +80,22 @@ You can use a performance-optimized parser when you query CSV files. For details
8080

8181
Serverless SQL pool relies on statistics to generate optimal query execution plans. Statistics are automatically created for columns using sampling and in most cases sampling percentage will be less than 100%. This flow is the same for every file format. Have in mind that when reading CSV with parser version 1.0 sampling isn't supported and automatic creation of statistics won't happen with sampling percentage less than 100%. For small tables with estimated low cardinality (number of rows) automatic statistics creation will be triggered with sampling percentage of 100%. That means that fullscan is triggered and automatic statistics are created even for CSV with parser version 1.0. In case statistics aren't automatically created, create statistics manually for columns that you use in queries, particularly those used in DISTINCT, JOIN, WHERE, ORDER BY, and GROUP BY. Check [statistics in serverless SQL pool](develop-tables-statistics.md#statistics-in-serverless-sql-pool) for details.
8282

83+
## Delta Lake optimizations
84+
85+
Here are best practices for using Delta Lake files in serverless SQL pool.
86+
87+
### Optimize checkpoints
88+
89+
Query performance of Delta Lake format is influenced by the number of JSON files in the _delta_log directory. To ensure optimal performance, avoid accumulating too many JSON files. Ideally, the log should contain only the latest Parquet checkpoint file with no additional JSON files. However, this setup may not be optimal for write-heavy workloads.
90+
91+
A balanced approach is to maintain around 10 JSON files between checkpoints, which typically offers good performance for both readers and writers. Be cautious of configurations that delay checkpoint creation, as they can lead to excessive JSON file accumulation and degrade query performance.
92+
93+
Set the following table property to ensure a checkpoint is created after every 10 JSON log files:
94+
95+
```sql
96+
ALTER TABLE tableName SET TBLPROPERTIES ('delta.checkpointInterval' = '10')
97+
```
98+
8399
## Data types
84100

85101
Here are best practices for using data types in serverless SQL pool.

0 commit comments

Comments
 (0)