Skip to content

Commit c25339a

Browse files
committed
additional best practices
1 parent 8b3b7c0 commit c25339a

File tree

2 files changed

+68
-6
lines changed

2 files changed

+68
-6
lines changed

articles/synapse-analytics/sql/best-practices-sql-on-demand.md

Lines changed: 63 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
title: Best practices for SQL on-demand (preview) in Azure Synapse Analytics
33
description: Recommendations and best practices you should know as you work with SQL on-demand (preview).
44
services: synapse-analytics
5-
author: mlee3gsd
5+
author: filippopovic
66
manager: craigg
77
ms.service: synapse-analytics
88
ms.topic: conceptual
99
ms.subservice:
10-
ms.date: 04/15/2020
11-
ms.author: martinle
12-
ms.reviewer: igorstan
10+
ms.date: 05/01/2020
11+
ms.author: fipopovi
12+
ms.reviewer: jrasnick
1313
---
1414

1515
# Best practices for SQL on-demand (preview) in Azure Synapse Analytics
@@ -45,17 +45,75 @@ If possible, you can prepare files for better performance:
4545
- It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
4646
- Partition your data by storing partitions to different folders or file names - check [use filename and filepath functions to target specific partitions](#use-fileinfo-and-filepath-functions-to-target-specific-partitions).
4747

48+
## Use appropriate data types
49+
50+
Data types used in your query affects performance. You can get better performance if you:
51+
52+
- Use the smallest data size that will accommodate the largest possible value.
53+
- If maximum character value length is 30 characters, use character data type of length 30.
54+
- If all character column values are of fixed size, use char or nchar. Otherwise, use varchar or nvarchar.
55+
- If maximum integer column value is 500, use smallint as it is smallest data type that can accommodate this value. You can find integer data type ranges [here](https://docs.microsoft.com/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver15).
56+
- If possible, use varchar and char instead of nvarchar and nchar.
57+
- Use integer-based data types if possible. Sort, join and group by operations are performed faster on integers than on characters data.
58+
- If you are using schema inference, [check inferred data type](#check-inferred-data-types).
59+
60+
## Check inferred data types
61+
62+
[Schema inference](query-parquet-files.md#automatic-schema-inference) helps you quickly write queries and explore data without knowing file schema. This comfort comes at expense of inferred data types being larger than they actually are. It happens whenthere is not enough information in source files to make sure appropriate data type is used. For example, Parquet files do not containt metadata about maximum character column length and SQL on-demand infers it as varchar(8000).
63+
64+
You can check resulting data types of your query using [sp_describe_first_results_set](https://docs.microsoft.com/sql/relational-databases/system-stored-procedures/sp-describe-first-result-set-transact-sql?view=sql-server-ver15):
65+
66+
Following example shows how you can optimize inferred data types. Procedure is used to show inferred data types.
67+
```sql
68+
EXEC sp_describe_first_result_set N'
69+
SELECT
70+
vendor_id, pickup_datetime, passenger_count
71+
FROM
72+
OPENROWSET(
73+
BULK ''https://sqlondemandstorage.blob.core.windows.net/parquet/taxi/*/*/*'',
74+
FORMAT=''PARQUET''
75+
) AS nyc';
76+
```
77+
78+
Here is the result set.
79+
80+
|is_hidden|column_ordinal|name|system_type_name|max_length|
81+
|----------------|---------------------|----------|--------------------|-------------------||
82+
|0|1|vendor_id|varchar(8000)|8000|
83+
|0|2|pickup_datetime|datetime2(7)|8|
84+
|0|3|passenger_count|int|4|
85+
86+
Once we know inferred data types for query we can specify appropriate data types:
87+
88+
```sql
89+
SELECT
90+
vendor_id, pickup_datetime, passenger_count
91+
FROM
92+
OPENROWSET(
93+
BULK 'https://sqlondemandstorage.blob.core.windows.net/parquet/taxi/*/*/*',
94+
FORMAT='PARQUET'
95+
)
96+
WITH (
97+
vendor_id varchar(4), -- we used length of 4 instead of inferred 8000
98+
pickup_datetime datetime2,
99+
passenger_count int
100+
) AS nyc;
101+
```
102+
48103
## Use fileinfo and filepath functions to target specific partitions
49104

50105
Data is often organized in partitions. You can instruct SQL on-demand to query particular folders and files. This function will reduce the number of files and amount of data the query needs to read and process. An added bonus is that you'll achieve better performance.
51106

52107
For more information, check [filename](develop-storage-files-overview.md#filename-function) and [filepath](develop-storage-files-overview.md#filepath-function) functions and examples on how to [query specific files](query-specific-files.md).
53108

109+
> [!TIP]
110+
> Always cast result of filepath and fileinfo functions to appropriate data types. If you use character data types, make sure appropriate length is used.
111+
54112
If your stored data isn't partitioned, consider partitioning it so you can use these functions to optimize queries targeting those files. When [querying partitioned Spark tables](develop-storage-files-spark-tables.md) from SQL on-demand, the query will automatically target only the files needed.
55113

56114
## Use CETAS to enhance query performance and joins
57115

58-
[CETAS](develop-tables-cetas.md) is one of the most important features available in SQL on-demand. CETAS is a parallel operation that creates external table metadata and exports the SELECT query results to a set of files in your storage account.
116+
[CETAS](develop-tables-cetas.md) is one of the most important features available in SQL on-demand. CETAS is a parallel operation that creates external table metadata and exports the SELECT query results to a set of files in your storage account.
59117

60118
You can use CETAS to store frequently used parts of queries, like joined reference tables, to a new set of files. Next, you can join to this single external table instead of repeating common joins in multiple queries.
61119

articles/synapse-analytics/sql/develop-storage-files-overview.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,10 +122,12 @@ BULK N'path_to_file(s)', FORMAT='PARQUET');
122122

123123
### Filename function
124124

125-
This function returns the file name that the row originates from.
125+
This function returns the file name that the row originates from.
126126

127127
To query specific files, read the Filename section in the [Query specific files](query-specific-files.md#filename) article.
128128

129+
Return data type is nvarchar(1024). For optimal performance, always cast result of filename function to appropriate data type. If you use character data type, make sure appropriate length is used.
130+
129131
### Filepath function
130132

131133
This function returns a full path or a part of path:
@@ -135,6 +137,8 @@ This function returns a full path or a part of path:
135137

136138
For additional information, read the Filepath section of the [Query specific files](query-specific-files.md#filepath) article.
137139

140+
Return data type is nvarchar(1024). For optimal performance, always cast result of filepath function to appropriate data type. If you use character data type, make sure appropriate length is used.
141+
138142
### Work with complex types and nested or repeated data structures
139143

140144
To enable a smooth experience when working with data stored in nested or repeated data types, such as in [Parquet](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types) files, SQL on-demand has added the extensions below.

0 commit comments

Comments
 (0)