Merge pull request #300357 from jovanpop-msft/patch-498514

prmerger-automator[bot] · web-flow · commit 311b27826316 · 2025-05-26T13:17:30.000Z
Best practices - optimize delta lake
diff --git a/articles/synapse-analytics/sql/best-practices-serverless-sql-pool.md b/articles/synapse-analytics/sql/best-practices-serverless-sql-pool.md
@@ -45,6 +45,10 @@ To minimize latency, colocate your Azure Storage account or Azure Cosmos DB anal
 
 For optimal performance, if you access other storage accounts with serverless SQL pool, make sure they're in the same region. If they aren't in the same region, there will be increased latency for the data's network transfer between the remote region and the endpoint's region.
 
+### Colocate your Azure Cosmos DB analytical storage and serverless SQL pool
+
+Make sure your Azure Cosmos DB analytical storage is placed in the same region as an Azure Synapse workspace. Cross-region queries might cause huge latencies. Use the region property in the connection string to explicitly specify the region where the analytical store is placed (see [Query Azure Cosmos DB by using serverless SQL pool](query-cosmos-db-analytical-store.md#overview)): `account=<database account name>;database=<database name>;region=<region name>'`
+
 ### Azure Storage throttling
 
 Multiple applications and services might access your storage account. Storage throttling occurs when the combined IOPS or throughput generated by applications, services, and serverless SQL pool workloads exceeds the limits of the storage account. As a result, you'll experience a significant negative effect on query performance.
@@ -64,10 +68,6 @@ If possible, you can prepare files for better performance:
 - It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
 - Partition your data by storing partitions to different folders or file names. See [Use filename and filepath functions to target specific partitions](#use-filename-and-filepath-functions-to-target-specific-partitions).
 
-### Colocate your Azure Cosmos DB analytical storage and serverless SQL pool
-
-Make sure your Azure Cosmos DB analytical storage is placed in the same region as an Azure Synapse workspace. Cross-region queries might cause huge latencies. Use the region property in the connection string to explicitly specify the region where the analytical store is placed (see [Query Azure Cosmos DB by using serverless SQL pool](query-cosmos-db-analytical-store.md#overview)): `account=<database account name>;database=<database name>;region=<region name>'`
-
 ## CSV optimizations
 
 Here are best practices for using CSV files in serverless SQL pool.
@@ -80,6 +80,22 @@ You can use a performance-optimized parser when you query CSV files. For details
 
 Serverless SQL pool relies on statistics to generate optimal query execution plans. Statistics are automatically created for columns using sampling and in most cases sampling percentage will be less than 100%. This flow is the same for every file format. Have in mind that when reading CSV with parser version 1.0 sampling isn't supported and automatic creation of statistics won't happen with sampling percentage less than 100%. For small tables with estimated low cardinality (number of rows) automatic statistics creation will be triggered with sampling percentage of 100%. That means that fullscan is triggered and automatic statistics are created even for CSV with parser version 1.0. In case statistics aren't automatically created, create statistics manually for columns that you use in queries, particularly those used in DISTINCT, JOIN, WHERE, ORDER BY, and GROUP BY. Check [statistics in serverless SQL pool](develop-tables-statistics.md#statistics-in-serverless-sql-pool) for details.
 
+## Delta Lake optimizations
+
+Here are best practices for using Delta Lake files in serverless SQL pool.
+
+### Optimize checkpoints
+
+Query performance of Delta Lake format is influenced by the number of JSON files in the _delta_log directory. To ensure optimal performance, avoid accumulating too many JSON files. Ideally, the log should contain only the latest Parquet checkpoint file with no additional JSON files. However, this setup may not be optimal for write-heavy workloads.
+
+A balanced approach is to maintain around 10 JSON files between checkpoints, which typically offers good performance for both readers and writers. Be cautious of configurations that delay checkpoint creation, as they can lead to excessive JSON file accumulation and degrade query performance.
+
+Set the following table property to ensure a checkpoint is created after every 10 JSON log files:
+
+```sql
+ALTER TABLE tableName SET TBLPROPERTIES ('delta.checkpointInterval' = '10')
+```
+
 ## Data types
 
 Here are best practices for using data types in serverless SQL pool.