You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/sql/query-delta-lake-format.md
+40-37Lines changed: 40 additions & 37 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ services: synapse analytics
5
5
ms.service: azure-synapse-analytics
6
6
ms.topic: how-to
7
7
ms.subservice: sql
8
-
ms.date: 12/17/2024
8
+
ms.date: 02/10/2025
9
9
author: jovanpop-msft
10
10
ms.author: jovanpop
11
11
ms.reviewer: whhender, wiassaf
@@ -18,7 +18,7 @@ Delta Lake is an open-source storage layer that brings ACID (atomicity, consiste
18
18
You can learn more from the [how to query delta lake tables video](https://www.youtube.com/watch?v=LSIVX0XxVfc).
19
19
20
20
> [!IMPORTANT]
21
-
> The serverless SQL pools can query [Delta Lake version 1.0](https://github.com/delta-io/delta/releases/tag/v1.0.1). The changes that are introduced since the [Delta Lake 1.2](https://github.com/delta-io/delta/releases/tag/v1.2.0) version like renaming columns are not supported in serverless. If you are using the higher versions of Delta with delete vectors, v2 checkpoints, and others, you should consider using other query engine like [Microsoft Fabric SQL endpoint for Lakehouses](/fabric/data-engineering/lakehouse-sql-analytics-endpoint).
21
+
> The serverless SQL pools can query [Delta Lake version 1.0](https://github.com/delta-io/delta/releases/tag/v1.0.1). The changes that have been introduced since the [Delta Lake 1.2](https://github.com/delta-io/delta/releases/tag/v1.2.0) version (like renaming columns) aren't supported in serverless. If you're using the higher versions of Delta with delete vectors, v2 checkpoints, and others, you should consider using other query engine like [Microsoft Fabric SQL endpoint for Lakehouses](/fabric/data-engineering/lakehouse-sql-analytics-endpoint).
22
22
23
23
The serverless SQL pool in Synapse workspace enables you to read the data stored in Delta Lake format, and serve it to reporting tools.
24
24
A serverless SQL pool can read Delta Lake files that are created using Apache Spark, Azure Databricks, or any other producer of the Delta Lake format.
@@ -28,18 +28,49 @@ Apache Spark pools in Azure Synapse enable data engineers to modify Delta Lake f
28
28
> [!IMPORTANT]
29
29
> Querying Delta Lake format using the serverless SQL pool is **Generally available** functionality. However, querying Spark Delta tables is still in public preview and not production ready. There are known issues that might happen if you query Delta tables created using the Spark pools. See the known issues in [Serverless SQL pool self-help](resources-self-help-sql-on-demand.md#delta-lake).
30
30
31
-
## Quickstart example
31
+
## Prerequisites
32
32
33
-
The [OPENROWSET](develop-openrowset.md) function enables you to read the content of Delta Lake files by providing the URL to your root folder.
33
+
> [!IMPORTANT]
34
+
> Data sources can be created only in custom databases (not in the master database or the databases replicated from Apache Spark pools).
35
+
36
+
To use the samples in this article, you'll need to complete the following steps:
37
+
1.**Create a database** with a datasource that references [NYC Yellow Taxi](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/) storage account.
38
+
1. Initialize the objects by executing [setup script](https://github.com/Azure-Samples/Synapse/blob/master/SQL/Samples/LdwSample/SampleDB.sql) on the database you created in step 1. This setup script will create the data sources, database scoped credentials, and external file formats that are used in these samples.
39
+
40
+
If you created your database, and switched the context to your database (using `USE database_name` statement or dropdown for selecting database in some query editor), you can create
41
+
your external data source containing the root URI to your data set and use it to query Delta Lake files. For example:
42
+
43
+
```sql
44
+
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage
45
+
WITH ( LOCATION ='https://<yourstorageaccount>.blob.core.windows.net/delta-lake/' );
46
+
GO
47
+
48
+
SELECT TOP 10*
49
+
FROM OPENROWSET(
50
+
BULK 'covid',
51
+
DATA_SOURCE ='DeltaLakeStorage',
52
+
FORMAT ='delta'
53
+
) as rows;
54
+
```
55
+
56
+
If a data source is protected with SAS key or custom identity, you can configure [data source with database scoped credential](develop-storage-files-storage-access-control.md?tabs=shared-access-signature#database-scoped-credential).
57
+
58
+
You can create an external data source with the location that points to the root folder of the storage. Once you've created the external data source, use the data source and the relative path to the file in the `OPENROWSET` function. This way you don't need to use the full absolute URI to your files. You can also then define custom credentials to access the storage location.
59
+
60
+
## Read Delta Lake folder
61
+
62
+
>[!IMPORTANT]
63
+
>Use the setup script in the [prerequisites](#prerequisites) to set up the sample data sources and tables.
34
64
35
-
### Read Delta Lake folder
65
+
The [OPENROWSET](develop-openrowset.md) function enables you to read the content of Delta Lake files by providing the URL to your root folder.
36
66
37
67
The easiest way to see to the content of your `DELTA` file is to provide the file URL to the [OPENROWSET](develop-openrowset.md) function and specify `DELTA` format. If the file is publicly available or if your Microsoft Entra identity can access this file, you should be able to see the content of the file using a query like the one shown in the following example:
@@ -50,7 +81,7 @@ The URI in the `OPENROWSET` function must reference the root Delta Lake folder t
50
81
> [!div class="mx-imgBorder"]
51
82
>
52
83
53
-
If you don't have this subfolder, you aren't using Delta Lake format. You can convert your plain Parquet files in the folder to Delta Lake format using the following Apache Spark Python script:
84
+
If you don't have this subfolder, you aren't using Delta Lake format. You can convert your plain Parquet files in the folder to Delta Lake format using a script like the following example Apache Spark Python script:
54
85
55
86
```python
56
87
%%pyspark
@@ -67,40 +98,12 @@ To improve the performance of your queries, consider specifying explicit types i
67
98
Make sure you can access your file. If your file is protected with SAS key or custom Azure identity, you'll need to set up a [server level credential for sql login](develop-storage-files-storage-access-control.md?tabs=shared-access-signature#server-level-credential).
68
99
69
100
> [!IMPORTANT]
70
-
> Ensure you are using a UTF-8 database collation (for example `Latin1_General_100_BIN2_UTF8`) because string values in Delta Lake files are encoded using UTF-8 encoding.
101
+
> Ensure you're using a UTF-8 database collation (for example `Latin1_General_100_BIN2_UTF8`) because string values in Delta Lake files are encoded using UTF-8 encoding.
71
102
> A mismatch between the text encoding in the Delta Lake file and the collation may cause unexpected conversion errors.
72
103
> You can easily change the default collation of the current database using the following T-SQL statement:
73
104
> `ALTER DATABASE CURRENT COLLATE Latin1_General_100_BIN2_UTF8;`
74
105
> For more information on collations, see [Collation types supported for Synapse SQL](reference-collation-types.md).
75
106
76
-
### Data source usage
77
-
78
-
The previous examples used the full path to the file. As an alternative, you can create an external data source with the location that points to the root folder of the storage. Once you've created the external data source, use the data source and the relative path to the file in the `OPENROWSET` function. This way you don't need to use the full absolute URI to your files. You can also then define custom credentials to access the storage location.
79
-
80
-
> [!IMPORTANT]
81
-
> Data sources can be created only in custom databases (not in the master database or the databases replicated from Apache Spark pools).
82
-
83
-
To use the samples below, you'll need to complete the following step:
84
-
1.**Create a database** with a datasource that references [NYC Yellow Taxi](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/) storage account.
85
-
1. Initialize the objects by executing [setup script](https://github.com/Azure-Samples/Synapse/blob/master/SQL/Samples/LdwSample/SampleDB.sql) on the database you created in step 1. This setup script will create the data sources, database scoped credentials, and external file formats that are used in these samples.
86
-
87
-
If you created your database, and switched the context to your database (using `USE database_name` statement or dropdown for selecting database in some query editor), you can create
88
-
your external data source containing the root URI to your data set and use it to query Delta Lake files:
89
-
90
-
```sql
91
-
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage
92
-
WITH ( LOCATION ='https://sqlondemandstorage.blob.core.windows.net/delta-lake/' );
93
-
GO
94
-
95
-
SELECT TOP 10*
96
-
FROM OPENROWSET(
97
-
BULK 'covid',
98
-
DATA_SOURCE ='DeltaLakeStorage',
99
-
FORMAT ='delta'
100
-
) as rows;
101
-
```
102
-
103
-
If a data source is protected with SAS key or custom identity, you can configure [data source with database scoped credential](develop-storage-files-storage-access-control.md?tabs=shared-access-signature#database-scoped-credential).
104
107
105
108
### Explicitly specify schema
106
109
@@ -122,7 +125,7 @@ FROM OPENROWSET(
122
125
With the explicit specification of the result set schema, you can minimize the type sizes and use the more precise types VARCHAR(6) for string columns instead of pessimistic VARCHAR(1000). Minimization of types might significantly improve performance of your queries.
123
126
124
127
> [!IMPORTANT]
125
-
> Make sure that you are explicitly specifying a UTF-8 collation (for example `Latin1_General_100_BIN2_UTF8`) for all string columns in `WITH` clause or set a UTF-8 collation at the database level.
128
+
> Make sure that you're explicitly specifying a UTF-8 collation (for example `Latin1_General_100_BIN2_UTF8`) for all string columns in `WITH` clause or set a UTF-8 collation at the database level.
126
129
> Mismatch between text encoding in the file and string column collation might cause unexpected conversion errors.
127
130
> You can easily change default collation of the current database using the following T-SQL statement:
128
131
> `alter database current collate Latin1_General_100_BIN2_UTF8`
0 commit comments