Skip to content

Commit 49e77c1

Browse files
authored
Merge pull request #294395 from whhender/set-up-script-updates
Updating for script problems
2 parents e6c2672 + f1ce1a9 commit 49e77c1

File tree

1 file changed

+40
-37
lines changed

1 file changed

+40
-37
lines changed

articles/synapse-analytics/sql/query-delta-lake-format.md

Lines changed: 40 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ services: synapse analytics
55
ms.service: azure-synapse-analytics
66
ms.topic: how-to
77
ms.subservice: sql
8-
ms.date: 12/17/2024
8+
ms.date: 02/10/2025
99
author: jovanpop-msft
1010
ms.author: jovanpop
1111
ms.reviewer: whhender, wiassaf
@@ -18,7 +18,7 @@ Delta Lake is an open-source storage layer that brings ACID (atomicity, consiste
1818
You can learn more from the [how to query delta lake tables video](https://www.youtube.com/watch?v=LSIVX0XxVfc).
1919

2020
> [!IMPORTANT]
21-
> The serverless SQL pools can query [Delta Lake version 1.0](https://github.com/delta-io/delta/releases/tag/v1.0.1). The changes that are introduced since the [Delta Lake 1.2](https://github.com/delta-io/delta/releases/tag/v1.2.0) version like renaming columns are not supported in serverless. If you are using the higher versions of Delta with delete vectors, v2 checkpoints, and others, you should consider using other query engine like [Microsoft Fabric SQL endpoint for Lakehouses](/fabric/data-engineering/lakehouse-sql-analytics-endpoint).
21+
> The serverless SQL pools can query [Delta Lake version 1.0](https://github.com/delta-io/delta/releases/tag/v1.0.1). The changes that have been introduced since the [Delta Lake 1.2](https://github.com/delta-io/delta/releases/tag/v1.2.0) version (like renaming columns) aren't supported in serverless. If you're using the higher versions of Delta with delete vectors, v2 checkpoints, and others, you should consider using other query engine like [Microsoft Fabric SQL endpoint for Lakehouses](/fabric/data-engineering/lakehouse-sql-analytics-endpoint).
2222
2323
The serverless SQL pool in Synapse workspace enables you to read the data stored in Delta Lake format, and serve it to reporting tools.
2424
A serverless SQL pool can read Delta Lake files that are created using Apache Spark, Azure Databricks, or any other producer of the Delta Lake format.
@@ -28,18 +28,49 @@ Apache Spark pools in Azure Synapse enable data engineers to modify Delta Lake f
2828
> [!IMPORTANT]
2929
> Querying Delta Lake format using the serverless SQL pool is **Generally available** functionality. However, querying Spark Delta tables is still in public preview and not production ready. There are known issues that might happen if you query Delta tables created using the Spark pools. See the known issues in [Serverless SQL pool self-help](resources-self-help-sql-on-demand.md#delta-lake).
3030
31-
## Quickstart example
31+
## Prerequisites
3232

33-
The [OPENROWSET](develop-openrowset.md) function enables you to read the content of Delta Lake files by providing the URL to your root folder.
33+
> [!IMPORTANT]
34+
> Data sources can be created only in custom databases (not in the master database or the databases replicated from Apache Spark pools).
35+
36+
To use the samples in this article, you'll need to complete the following steps:
37+
1. **Create a database** with a datasource that references [NYC Yellow Taxi](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/) storage account.
38+
1. Initialize the objects by executing [setup script](https://github.com/Azure-Samples/Synapse/blob/master/SQL/Samples/LdwSample/SampleDB.sql) on the database you created in step 1. This setup script will create the data sources, database scoped credentials, and external file formats that are used in these samples.
39+
40+
If you created your database, and switched the context to your database (using `USE database_name` statement or dropdown for selecting database in some query editor), you can create
41+
your external data source containing the root URI to your data set and use it to query Delta Lake files. For example:
42+
43+
```sql
44+
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage
45+
WITH ( LOCATION = 'https://<yourstorageaccount>.blob.core.windows.net/delta-lake/' );
46+
GO
47+
48+
SELECT TOP 10 *
49+
FROM OPENROWSET(
50+
BULK 'covid',
51+
DATA_SOURCE = 'DeltaLakeStorage',
52+
FORMAT = 'delta'
53+
) as rows;
54+
```
55+
56+
If a data source is protected with SAS key or custom identity, you can configure [data source with database scoped credential](develop-storage-files-storage-access-control.md?tabs=shared-access-signature#database-scoped-credential).
57+
58+
You can create an external data source with the location that points to the root folder of the storage. Once you've created the external data source, use the data source and the relative path to the file in the `OPENROWSET` function. This way you don't need to use the full absolute URI to your files. You can also then define custom credentials to access the storage location.
59+
60+
## Read Delta Lake folder
61+
62+
>[!IMPORTANT]
63+
>Use the setup script in the [prerequisites](#prerequisites) to set up the sample data sources and tables.
3464
35-
### Read Delta Lake folder
65+
The [OPENROWSET](develop-openrowset.md) function enables you to read the content of Delta Lake files by providing the URL to your root folder.
3666

3767
The easiest way to see to the content of your `DELTA` file is to provide the file URL to the [OPENROWSET](develop-openrowset.md) function and specify `DELTA` format. If the file is publicly available or if your Microsoft Entra identity can access this file, you should be able to see the content of the file using a query like the one shown in the following example:
3868

3969
```sql
4070
SELECT TOP 10 *
4171
FROM OPENROWSET(
42-
BULK 'https://sqlondemandstorage.blob.core.windows.net/delta-lake/covid/',
72+
BULK '/covid/',
73+
DATA_SOURCE = 'DeltaLakeStorage',
4374
FORMAT = 'delta') as rows;
4475
```
4576

@@ -50,7 +81,7 @@ The URI in the `OPENROWSET` function must reference the root Delta Lake folder t
5081
> [!div class="mx-imgBorder"]
5182
>![ECDC COVID-19 Delta Lake folder](./media/shared/covid-delta-lake-studio.png)
5283
53-
If you don't have this subfolder, you aren't using Delta Lake format. You can convert your plain Parquet files in the folder to Delta Lake format using the following Apache Spark Python script:
84+
If you don't have this subfolder, you aren't using Delta Lake format. You can convert your plain Parquet files in the folder to Delta Lake format using a script like the following example Apache Spark Python script:
5485

5586
```python
5687
%%pyspark
@@ -67,40 +98,12 @@ To improve the performance of your queries, consider specifying explicit types i
6798
Make sure you can access your file. If your file is protected with SAS key or custom Azure identity, you'll need to set up a [server level credential for sql login](develop-storage-files-storage-access-control.md?tabs=shared-access-signature#server-level-credential).
6899

69100
> [!IMPORTANT]
70-
> Ensure you are using a UTF-8 database collation (for example `Latin1_General_100_BIN2_UTF8`) because string values in Delta Lake files are encoded using UTF-8 encoding.
101+
> Ensure you're using a UTF-8 database collation (for example `Latin1_General_100_BIN2_UTF8`) because string values in Delta Lake files are encoded using UTF-8 encoding.
71102
> A mismatch between the text encoding in the Delta Lake file and the collation may cause unexpected conversion errors.
72103
> You can easily change the default collation of the current database using the following T-SQL statement:
73104
> `ALTER DATABASE CURRENT COLLATE Latin1_General_100_BIN2_UTF8;`
74105
> For more information on collations, see [Collation types supported for Synapse SQL](reference-collation-types.md).
75106
76-
### Data source usage
77-
78-
The previous examples used the full path to the file. As an alternative, you can create an external data source with the location that points to the root folder of the storage. Once you've created the external data source, use the data source and the relative path to the file in the `OPENROWSET` function. This way you don't need to use the full absolute URI to your files. You can also then define custom credentials to access the storage location.
79-
80-
> [!IMPORTANT]
81-
> Data sources can be created only in custom databases (not in the master database or the databases replicated from Apache Spark pools).
82-
83-
To use the samples below, you'll need to complete the following step:
84-
1. **Create a database** with a datasource that references [NYC Yellow Taxi](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/) storage account.
85-
1. Initialize the objects by executing [setup script](https://github.com/Azure-Samples/Synapse/blob/master/SQL/Samples/LdwSample/SampleDB.sql) on the database you created in step 1. This setup script will create the data sources, database scoped credentials, and external file formats that are used in these samples.
86-
87-
If you created your database, and switched the context to your database (using `USE database_name` statement or dropdown for selecting database in some query editor), you can create
88-
your external data source containing the root URI to your data set and use it to query Delta Lake files:
89-
90-
```sql
91-
CREATE EXTERNAL DATA SOURCE DeltaLakeStorage
92-
WITH ( LOCATION = 'https://sqlondemandstorage.blob.core.windows.net/delta-lake/' );
93-
GO
94-
95-
SELECT TOP 10 *
96-
FROM OPENROWSET(
97-
BULK 'covid',
98-
DATA_SOURCE = 'DeltaLakeStorage',
99-
FORMAT = 'delta'
100-
) as rows;
101-
```
102-
103-
If a data source is protected with SAS key or custom identity, you can configure [data source with database scoped credential](develop-storage-files-storage-access-control.md?tabs=shared-access-signature#database-scoped-credential).
104107

105108
### Explicitly specify schema
106109

@@ -122,7 +125,7 @@ FROM OPENROWSET(
122125
With the explicit specification of the result set schema, you can minimize the type sizes and use the more precise types VARCHAR(6) for string columns instead of pessimistic VARCHAR(1000). Minimization of types might significantly improve performance of your queries.
123126

124127
> [!IMPORTANT]
125-
> Make sure that you are explicitly specifying a UTF-8 collation (for example `Latin1_General_100_BIN2_UTF8`) for all string columns in `WITH` clause or set a UTF-8 collation at the database level.
128+
> Make sure that you're explicitly specifying a UTF-8 collation (for example `Latin1_General_100_BIN2_UTF8`) for all string columns in `WITH` clause or set a UTF-8 collation at the database level.
126129
> Mismatch between text encoding in the file and string column collation might cause unexpected conversion errors.
127130
> You can easily change default collation of the current database using the following T-SQL statement:
128131
> `alter database current collate Latin1_General_100_BIN2_UTF8`

0 commit comments

Comments
 (0)