Skip to content

Commit 9503354

Browse files
committed
Freshness and formatting
1 parent 5a9477d commit 9503354

File tree

1 file changed

+21
-24
lines changed

1 file changed

+21
-24
lines changed

articles/synapse-analytics/sql/develop-tables-external-tables.md

Lines changed: 21 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,15 @@ description: Reading or writing data files with external tables in Synapse SQL
44
author: jovanpop-msft
55
ms.author: jovanpop
66
ms.service: azure-synapse-analytics
7-
ms.topic: overview
7+
ms.topic: concept-article
88
ms.subservice: sql
9-
ms.date: 02/17/2023
9+
ms.date: 01/08/2025
1010
ms.reviewer: wiassaf
1111
---
1212

1313
# Use external tables with Synapse SQL
1414

15-
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. You can use external tables to read data from files or write data to files in Azure Storage.
16-
17-
With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or serverless SQL pool.
15+
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. You can use external tables to read data from files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or serverless SQL pool.
1816

1917
Depending on the type of the external data source, you can use two types of external tables:
2018
- **Hadoop external tables** that you can use to read and export data in various data formats such as CSV, Parquet, and ORC. Hadoop external tables are available in dedicated SQL pools, but they aren't available in serverless SQL pools.
@@ -28,15 +26,15 @@ The key differences between Hadoop and native external tables:
2826
| Serverless SQL pool | Not available | Available |
2927
| Supported formats | Delimited/CSV, Parquet, ORC, Hive RC, and RC | Serverless SQL pool: Delimited/CSV, Parquet, and [Delta Lake](query-delta-lake-format.md)<br/>Dedicated SQL pool: Parquet (preview) |
3028
| [Folder partition elimination](#folder-partition-elimination) | No | Partition elimination is available only in the partitioned tables created on Parquet or CSV formats that are synchronized from Apache Spark pools. You might create external tables on Parquet partitioned folders, but the partitioning columns are inaccessible and ignored, while the partition elimination won't be applied. Don't create [external tables on Delta Lake folders](create-use-external-tables.md#delta-tables-on-partitioned-folders) because they aren't supported. Use [Delta partitioned views](create-use-views.md#delta-lake-partitioned-views) if you need to query partitioned Delta Lake data. |
31-
| [File elimination](#file-elimination) (predicate pushdown) | No | Yes in serverless SQL pool. For the string pushdown, you need to use `Latin1_General_100_BIN2_UTF8` collation on the `VARCHAR` columns to enable pushdown. For more information on collations, refer to [Collation types supported for Synapse SQL](reference-collation-types.md).|
32-
| Custom format for location | No | Yes, using wildcards like `/year=*/month=*/day=*` for Parquet or CSV formats. Custom folder paths are not available in Delta Lake. In the serverless SQL pool, you can also use recursive wildcards `/logs/**` to reference Parquet or CSV files in any sub-folder beneath the referenced folder. |
29+
| [File elimination](#file-elimination) (predicate pushdown) | No | Yes in serverless SQL pool. For the string pushdown, you need to use `Latin1_General_100_BIN2_UTF8` collation on the `VARCHAR` columns to enable pushdown. For more information on collations, see [Collation types supported for Synapse SQL](reference-collation-types.md).|
30+
| Custom format for location | No | Yes, using wildcards like `/year=*/month=*/day=*` for Parquet or CSV formats. Custom folder paths aren't available in Delta Lake. In the serverless SQL pool, you can also use recursive wildcards `/logs/**` to reference Parquet or CSV files in any subfolder beneath the referenced folder. |
3331
| Recursive folder scan | Yes | Yes. In serverless SQL pools must be specified `/**` at the end of the location path. In Dedicated pool the folders are always scanned recursively. |
3432
| Storage authentication | Storage Access Key(SAK), Microsoft Entra passthrough, Managed identity, custom application Microsoft Entra identity | [Shared Access Signature(SAS)](develop-storage-files-storage-access-control.md?tabs=shared-access-signature), [Microsoft Entra passthrough](develop-storage-files-storage-access-control.md?tabs=user-identity), [Managed identity](develop-storage-files-storage-access-control.md?tabs=managed-identity), [Custom application Microsoft Entra identity](develop-storage-files-storage-access-control.md?tabs=service-principal). |
3533
| Column mapping | Ordinal - the columns in the external table definition are mapped to the columns in the underlying Parquet files by position. | Serverless pool: by name. The columns in the external table definition are mapped to the columns in the underlying Parquet files by column name matching. <br/> Dedicated pool: ordinal matching. The columns in the external table definition are mapped to the columns in the underlying Parquet files by position.|
36-
| CETAS (exporting/transformation) | Yes | CETAS with the native tables as a target works only in the serverless SQL pool. You cannot use the dedicated SQL pools to export data using native tables. |
34+
| CETAS (exporting/transformation) | Yes | CETAS with the native tables as a target works only in the serverless SQL pool. You can't use the dedicated SQL pools to export data using native tables. |
3735

3836
> [!NOTE]
39-
> The native external tables are the recommended solution in the pools where they are generally available. If you need to access external data, always use the native tables in serverless pools. In dedicated pools, you should switch to the native tables for reading Parquet files once they are in GA. Use the Hadoop tables only if you need to access some types that are not supported in native external tables (for example - ORC, RC), or if the native version is not available.
37+
> The native external tables are the recommended solution in the pools where they're generally available. If you need to access external data, always use the native tables in serverless pools. In dedicated pools, you should switch to the native tables for reading Parquet files once they are in GA. Use the Hadoop tables only if you need to access some types that aren't supported in native external tables (for example - ORC, RC), or if the native version isn't available.
4038
4139
## External tables in dedicated SQL pool and serverless SQL pool
4240

@@ -47,28 +45,28 @@ You can use external tables to:
4745
- Import data from Azure Blob Storage and Azure Data Lake Storage and store it in a dedicated SQL pool (only Hadoop tables in dedicated pool).
4846

4947
> [!NOTE]
50-
> When used in conjunction with the [CREATE TABLE AS SELECT](../sql-data-warehouse/sql-data-warehouse-develop-ctas.md?context=/azure/synapse-analytics/context/context) statement, selecting from an external table imports data into a table within the **dedicated** SQL pool.
51-
>
52-
> If performance of Hadoop external tables in the dedicated pools do not satisfy your performance goals, consider loading external data into the Datawarehouse tables using the [COPY statement](/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest&preserve-view=true).
53-
>
48+
> When used with the [CREATE TABLE AS SELECT](../sql-data-warehouse/sql-data-warehouse-develop-ctas.md?context=/azure/synapse-analytics/context/context) statement, selecting from an external table imports data into a table within the **dedicated** SQL pool.
49+
>
50+
> If performance of Hadoop external tables in the dedicated pools does not satisfy your performance goals, consider loading external data into the Data warehouse tables using the [COPY statement](/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest&preserve-view=true).
51+
>
5452
> For a loading tutorial, see [Use PolyBase to load data from Azure Blob Storage](../sql-data-warehouse/load-data-from-azure-blob-storage-using-copy.md?bc=%2fazure%2fsynapse-analytics%2fbreadcrumb%2ftoc.json&toc=%2fazure%2fsynapse-analytics%2ftoc.json).
5553
5654
You can create external tables in Synapse SQL pools via the following steps:
5755

5856
1. [CREATE EXTERNAL DATA SOURCE](/sql/t-sql/statements/create-external-data-source-transact-sql?view=azure-sqldw-latest&preserve-view=true) to reference an external Azure storage and specify the credential that should be used to access the storage.
59-
2. [CREATE EXTERNAL FILE FORMAT](/sql/t-sql/statements/create-external-file-format-transact-sql?view=azure-sqldw-latest&preserve-view=true) to describe format of CSV or Parquet files.
60-
3. [CREATE EXTERNAL TABLE](/sql/t-sql/statements/create-external-table-transact-sql?view=azure-sqldw-latest&preserve-view=true) on top of the files placed on the data source with the same file format.
61-
57+
1. [CREATE EXTERNAL FILE FORMAT](/sql/t-sql/statements/create-external-file-format-transact-sql?view=azure-sqldw-latest&preserve-view=true) to describe format of CSV or Parquet files.
58+
1. [CREATE EXTERNAL TABLE](/sql/t-sql/statements/create-external-table-transact-sql?view=azure-sqldw-latest&preserve-view=true) on top of the files placed on the data source with the same file format.
59+
6260
### Folder partition elimination
6361

64-
The native external tables in Synapse pools are able to ignore the files placed in the folders that are not relevant for the queries. If your files are stored in a folder hierarchy (for example - `/year=2020/month=03/day=16`) and the values for `year`, `month`, and `day` are exposed as the columns, the queries that contain filters like `year=2020` will read the files only from the subfolders placed within the `year=2020` folder. The files and folders placed in other folders (`year=2021` or `year=2022`) will be ignored in this query. This elimination is known as **partition elimination**.
62+
The native external tables in Synapse pools are able to ignore the files placed in the folders that aren't relevant for the queries. If your files are stored in a folder hierarchy (for example - `/year=2020/month=03/day=16`) and the values for `year`, `month`, and `day` are exposed as the columns, the queries that contain filters like `year=2020` will read the files only from the subfolders placed within the `year=2020` folder. The files and folders placed in other folders (`year=2021` or `year=2022`) will be ignored in this query. This elimination is known as **partition elimination**.
6563

66-
The folder partition elimination is available in the native external tables that are synchronized from the Synapse Spark pools. If you have partitioned data set and you would like to leverage the partition elimination with the external tables that you create, use [the partitioned views](create-use-views.md#partitioned-views) instead of the external tables.
64+
The folder partition elimination is available in the native external tables that are synchronized from the Synapse Spark pools. If you have partitioned data set and you would like to use the partition elimination with the external tables that you create, use [the partitioned views](create-use-views.md#partitioned-views) instead of the external tables.
6765

6866
### File elimination
6967

70-
Some data formats such as Parquet and Delta contain file statistics for each column (for example, min/max values for each column). The queries that filter data will not read the files where the required column values do not exist. The query will first explore min/max values for the columns used in the query predicate to find the files that do not contain the required data. These files will be ignored and eliminated from the query plan.
71-
This technique is also known as filter predicate pushdown and it can improve the performance of your queries. Filter pushdown is available in the serverless SQL pools on Parquet and Delta formats. To leverage filter pushdown for the string types, use the VARCHAR type with the `Latin1_General_100_BIN2_UTF8` collation. For more information on collations, refer to [Collation types supported for Synapse SQL](reference-collation-types.md).
68+
Some data formats such as Parquet and Delta contain file statistics for each column (for example, min/max values for each column). The queries that filter data won't read the files where the required column values don't exist. The query will first explore min/max values for the columns used in the query predicate to find the files that don't contain the required data. These files are ignored and eliminated from the query plan.
69+
This technique is also known as filter predicate pushdown and it can improve the performance of your queries. Filter pushdown is available in the serverless SQL pools on Parquet and Delta formats. To apply filter pushdown for the string types, use the VARCHAR type with the `Latin1_General_100_BIN2_UTF8` collation. For more information on collations, see [Collation types supported for Synapse SQL](reference-collation-types.md).
7270

7371
### Security
7472

@@ -77,7 +75,6 @@ External tables access underlying Azure storage using the database scoped creden
7775
- Data source without credential enables external tables to access publicly available files on Azure storage.
7876
- Data source can have a credential that enables external tables to access only the files on Azure storage using SAS token or workspace Managed Identity - For examples, see [the Develop storage files storage access control](develop-storage-files-storage-access-control.md#examples) article.
7977

80-
8178
### Example for CREATE EXTERNAL DATA SOURCE
8279

8380
#### [Hadoop](#tab/hadoop)
@@ -121,7 +118,7 @@ CREATE EXTERNAL DATA SOURCE SqlOnDemandDemo WITH (
121118
);
122119
```
123120
> [!NOTE]
124-
> The SQL users needs to have proper permissions on database scoped credentials to access the data source in Azure Synapse Analytics Serverless SQL Pool. [Access external storage using serverless SQL pool in Azure Synapse Analytics](./develop-storage-files-overview.md?tabs=impersonation#permissions).
121+
> The SQL users need to have proper permissions on database scoped credentials to access the data source in Azure Synapse Analytics Serverless SQL Pool. [Access external storage using serverless SQL pool in Azure Synapse Analytics](./develop-storage-files-overview.md?tabs=impersonation#permissions).
125122
The following example creates an external data source for Azure Data Lake Gen2 pointing to the publicly available New York data set:
126123

127124
```sql
@@ -172,7 +169,7 @@ SELECT TOP 1 * FROM census_external_table
172169

173170
## Create and query external tables from a file in Azure Data Lake
174171

175-
Using Data Lake exploration capabilities of Synapse Studio you can now create and query an external table using Synapse SQL pool with a simple right-click on the file. The one-click gesture to create external tables from the ADLS Gen2 storage account is only supported for Parquet files.
172+
Using Data Lake exploration capabilities of Synapse Studio you can now create and query an external table using Synapse SQL pool with a right-click on the file. The one-click gesture to create external tables from the ADLS Gen2 storage account is only supported for Parquet files.
176173

177174
### Prerequisites
178175

@@ -201,6 +198,6 @@ The external table is now created, for future exploration of the content of this
201198
> [!div class="mx-imgBorder"]
202199
>![externaltable5](./media/develop-tables-external-tables/external-table-5.png)
203200
204-
## Next steps
201+
## Related content
205202

206203
See the [CETAS](develop-tables-cetas.md) article for how to save query results to an external table in Azure Storage. Or you can start querying [Apache Spark for Azure Synapse external tables](develop-storage-files-spark-tables.md).

0 commit comments

Comments
 (0)