Skip to content

Commit 989126f

Browse files
Merge pull request #113508 from filippopovic/sqlod-bestpractices-castvirtcolumns
Sqlod bestpractices
2 parents 97e66f7 + 9299c88 commit 989126f

File tree

2 files changed

+74
-6
lines changed

2 files changed

+74
-6
lines changed

articles/synapse-analytics/sql/best-practices-sql-on-demand.md

Lines changed: 67 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
title: Best practices for SQL on-demand (preview) in Azure Synapse Analytics
33
description: Recommendations and best practices you should know as you work with SQL on-demand (preview).
44
services: synapse-analytics
5-
author: mlee3gsd
5+
author: filippopovic
66
manager: craigg
77
ms.service: synapse-analytics
88
ms.topic: conceptual
99
ms.subservice:
10-
ms.date: 04/15/2020
11-
ms.author: martinle
12-
ms.reviewer: igorstan
10+
ms.date: 05/01/2020
11+
ms.author: fipopovi
12+
ms.reviewer: jrasnick
1313
---
1414

1515
# Best practices for SQL on-demand (preview) in Azure Synapse Analytics
@@ -45,17 +45,79 @@ If possible, you can prepare files for better performance:
4545
- It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
4646
- Partition your data by storing partitions to different folders or file names - check [use filename and filepath functions to target specific partitions](#use-fileinfo-and-filepath-functions-to-target-specific-partitions).
4747

48+
## Push wildcards to lower levels in path
49+
50+
You can use wildcards in your path to [query multiple files and folders](develop-storage-files-overview.md#query-multiple-files-or-folders). SQL on-demand lists files in your storage account starting from first * using storage API and eliminates files that do not match specified path. Reducing initial list of files can improve performance if there are many files that match specified path up to first wildcard.
51+
52+
## Use appropriate data types
53+
54+
Data types used in your query affects performance. You can get better performance if you:
55+
56+
- Use the smallest data size that will accommodate the largest possible value.
57+
- If maximum character value length is 30 characters, use character data type of length 30.
58+
- If all character column values are of fixed size, use char or nchar. Otherwise, use varchar or nvarchar.
59+
- If maximum integer column value is 500, use smallint as it is smallest data type that can accommodate this value. You can find integer data type ranges [here](https://docs.microsoft.com/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver15).
60+
- If possible, use varchar and char instead of nvarchar and nchar.
61+
- Use integer-based data types if possible. Sort, join and group by operations are performed faster on integers than on characters data.
62+
- If you are using schema inference, [check inferred data type](#check-inferred-data-types).
63+
64+
## Check inferred data types
65+
66+
[Schema inference](query-parquet-files.md#automatic-schema-inference) helps you quickly write queries and explore data without knowing file schema. This comfort comes at expense of inferred data types being larger than they actually are. It happens when there is not enough information in source files to make sure appropriate data type is used. For example, Parquet files do not contain metadata about maximum character column length and SQL on-demand infers it as varchar(8000).
67+
68+
You can check resulting data types of your query using [sp_describe_first_results_set](https://docs.microsoft.com/sql/relational-databases/system-stored-procedures/sp-describe-first-result-set-transact-sql?view=sql-server-ver15).
69+
70+
The following example shows how you can optimize inferred data types. Procedure is used to show inferred data types.
71+
```sql
72+
EXEC sp_describe_first_result_set N'
73+
SELECT
74+
vendor_id, pickup_datetime, passenger_count
75+
FROM
76+
OPENROWSET(
77+
BULK ''https://sqlondemandstorage.blob.core.windows.net/parquet/taxi/*/*/*'',
78+
FORMAT=''PARQUET''
79+
) AS nyc';
80+
```
81+
82+
Here is the result set.
83+
84+
|is_hidden|column_ordinal|name|system_type_name|max_length|
85+
|----------------|---------------------|----------|--------------------|-------------------||
86+
|0|1|vendor_id|varchar(8000)|8000|
87+
|0|2|pickup_datetime|datetime2(7)|8|
88+
|0|3|passenger_count|int|4|
89+
90+
Once we know inferred data types for query we can specify appropriate data types:
91+
92+
```sql
93+
SELECT
94+
vendor_id, pickup_datetime, passenger_count
95+
FROM
96+
OPENROWSET(
97+
BULK 'https://sqlondemandstorage.blob.core.windows.net/parquet/taxi/*/*/*',
98+
FORMAT='PARQUET'
99+
)
100+
WITH (
101+
vendor_id varchar(4), -- we used length of 4 instead of inferred 8000
102+
pickup_datetime datetime2,
103+
passenger_count int
104+
) AS nyc;
105+
```
106+
48107
## Use fileinfo and filepath functions to target specific partitions
49108

50109
Data is often organized in partitions. You can instruct SQL on-demand to query particular folders and files. This function will reduce the number of files and amount of data the query needs to read and process. An added bonus is that you'll achieve better performance.
51110

52111
For more information, check [filename](develop-storage-files-overview.md#filename-function) and [filepath](develop-storage-files-overview.md#filepath-function) functions and examples on how to [query specific files](query-specific-files.md).
53112

113+
> [!TIP]
114+
> Always cast result of filepath and fileinfo functions to appropriate data types. If you use character data types, make sure appropriate length is used.
115+
54116
If your stored data isn't partitioned, consider partitioning it so you can use these functions to optimize queries targeting those files. When [querying partitioned Spark tables](develop-storage-files-spark-tables.md) from SQL on-demand, the query will automatically target only the files needed.
55117

56118
## Use CETAS to enhance query performance and joins
57119

58-
[CETAS](develop-tables-cetas.md) is one of the most important features available in SQL on-demand. CETAS is a parallel operation that creates external table metadata and exports the SELECT query results to a set of files in your storage account.
120+
[CETAS](develop-tables-cetas.md) is one of the most important features available in SQL on-demand. CETAS is a parallel operation that creates external table metadata and exports the SELECT query results to a set of files in your storage account.
59121

60122
You can use CETAS to store frequently used parts of queries, like joined reference tables, to a new set of files. Next, you can join to this single external table instead of repeating common joins in multiple queries.
61123

articles/synapse-analytics/sql/develop-storage-files-overview.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,12 +120,16 @@ OPENROWSET(
120120
BULK N'path_to_file(s)', FORMAT='PARQUET');
121121
```
122122

123+
Make sure [appropriate inferred data types](best-practices-sql-on-demand.md#check-inferred-data-types) are used for optimal performance.
124+
123125
### Filename function
124126

125-
This function returns the file name that the row originates from.
127+
This function returns the file name that the row originates from.
126128

127129
To query specific files, read the Filename section in the [Query specific files](query-specific-files.md#filename) article.
128130

131+
Return data type is nvarchar(1024). For optimal performance, always cast result of filename function to appropriate data type. If you use character data type, make sure appropriate length is used.
132+
129133
### Filepath function
130134

131135
This function returns a full path or a part of path:
@@ -135,6 +139,8 @@ This function returns a full path or a part of path:
135139

136140
For additional information, read the Filepath section of the [Query specific files](query-specific-files.md#filepath) article.
137141

142+
Return data type is nvarchar(1024). For optimal performance, always cast result of filepath function to appropriate data type. If you use character data type, make sure appropriate length is used.
143+
138144
### Work with complex types and nested or repeated data structures
139145

140146
To enable a smooth experience when working with data stored in nested or repeated data types, such as in [Parquet](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types) files, SQL on-demand has added the extensions below.

0 commit comments

Comments
 (0)