Merge pull request #291858 from whhender/december-synapse-freshness

prmerger-automator[bot] · web-flow · commit 1c1594231998 · 2024-12-12T15:41:57.000Z
December synapse freshness part 1
diff --git a/articles/synapse-analytics/get-started-analyze-sql-pool.md b/articles/synapse-analytics/get-started-analyze-sql-pool.md
@@ -4,29 +4,38 @@ description: In this tutorial, use the NYC Taxi sample data to explore SQL pool'
 author: whhender
 ms.author: whhender
 ms.reviewer: whhender, wiassaf
-ms.date: 10/16/2023
+ms.date: 12/11/2024
 ms.service: azure-synapse-analytics
 ms.subservice: sql
 ms.topic: tutorial
 ms.custom: engagement-fy23
 ---
 
-# Analyze data with dedicated SQL pools
+# Tutorial: Analyze data with dedicated SQL pools
 
 In this tutorial, use the NYC Taxi data to explore a dedicated SQL pool's capabilities.
 
+> [!div class="checklist"]
+> * [Deploy a dedicated SQL pool]
+> * [Load data into the pool]
+> * [Explore the data you've loaded]
+
+## Prerequisites
+
+* This tutorial assumes you've completed the steps in the rest of the quickstarts. Specifically it uses the 'contosodatalake' resource created in [the Create a Synapse Workspace quickstart.](get-started-create-workspace.md#place-sample-data-into-the-primary-storage-account)
+
 ## Create a dedicated SQL pool
 
 1. In Synapse Studio, on the left-side pane, select **Manage** > **SQL pools** under **Analytics pools**.
 1. Select **New**.
 1. For **Dedicated SQL pool name** select `SQLPOOL1`.
 1. For **Performance level** choose **DW100C**.
-1. Select **Review + create** > **Create**. Your dedicated SQL pool will be ready in a few minutes. 
+1. Select **Review + create** > **Create**. Your dedicated SQL pool will be ready in a few minutes.
 
 Your dedicated SQL pool is associated with a SQL database that's also called `SQLPOOL1`.
 
 1. Navigate to **Data** > **Workspace**.
-1. You should see a database named **SQLPOOL1**. If you do not see it, select **Refresh**.
+1. You should see a database named **SQLPOOL1**. If you don't see it, select **Refresh**.
 
 A dedicated SQL pool consumes billable resources as long as it's active. You can pause the pool later to reduce costs.
 
@@ -83,13 +92,20 @@ A dedicated SQL pool consumes billable resources as long as it's active. You can
         ,IDENTITY_INSERT = 'OFF'
     )
     ```
+
+    >[!TIP]
+    >If you get an error that reads `Login failed for user '<token-identified principal>'`, you need to set your Entra Id admin. 
+    > 1. In the Azure Portal, search for your synapse workspace.
+    > 1. Under **Settings** select **Microsoft Entra ID**.
+    > 1. Select **Set admin** and set a Microsoft Entra ID admin.
+
 1. Select the **Run** button to execute the script.
 1. This script finishes in less than 60 seconds. It loads 2 million rows of NYC Taxi data into a table called `dbo.NYCTaxiTripSmall`.
 
 ## Explore the NYC Taxi data in the dedicated SQL pool
 
 1. In Synapse Studio, go to the **Data** hub.
-1. Go to **SQLPOOL1** > **Tables**. 
+1. Go to **SQLPOOL1** > **Tables**. (If you don't see it in the menu, refresh the page.)
 1. Right-click the **dbo.NYCTaxiTripSmall** table and select **New SQL Script** > **Select TOP 100 Rows**.
 1. Wait while a new SQL script is created and runs.
 1. At the top of the SQL script **Connect to** is automatically set to the SQL pool called **SQLPOOL1**.
@@ -110,7 +126,16 @@ A dedicated SQL pool consumes billable resources as long as it's active. You can
 
     This query creates a table `dbo.PassengerCountStats` with aggregate data from the `trip_distance` field, then queries the new table. The data shows how the total trip distances and average trip distance relate to the number of passengers.
 1. In the SQL script result window, change the **View** to **Chart** to see a visualization of the results as a line chart. Change **Category column** to `PassengerCount`.
-    
+
+## Clean up
+
+Pause your dedicated SQL Pool to reduce costs.
+
+1. Navigate to **Manage** in your synapse workspace.
+1. Select **SQL pools**.
+1. Hover over SQLPOOL1 and select the **Pause** button.
+1. Confirm to pause.
+
 ## Next step
 
 > [!div class="nextstepaction"]
diff --git a/articles/synapse-analytics/spark/apache-spark-pool-configurations.md b/articles/synapse-analytics/spark/apache-spark-pool-configurations.md
@@ -1,14 +1,14 @@
 ---
 title: Apache Spark pool concepts
 description: Introduction to Apache Spark pool sizes and configurations in Azure Synapse Analytics.
-ms.topic: conceptual
+ms.topic: concept-article
 ms.service: azure-synapse-analytics
 ms.subservice: spark
 ms.custom: references_regions
 author: guyhay
 ms.author: guyhay
 ms.reviewer: whhender
-ms.date: 09/07/2022 
+ms.date: 12/06/2024
 ---
 
 # Apache Spark pool configurations in Azure Synapse Analytics
@@ -53,7 +53,7 @@ Autoscale for Apache Spark pools allows automatic scale up and down of compute r
 Apache Spark pools now support elastic pool storage. Elastic pool storage allows the Spark engine to monitor worker node temporary storage and attach extra disks if needed. Apache Spark pools utilize temporary disk storage while the pool is instantiated. Spark jobs write shuffle map outputs, shuffle data and spilled data to local VM disks. Examples of operations that could utilize local disk are sort, cache, and persist. When temporary VM disk space runs out, Spark jobs could fail due to “Out of Disk Space” error (java.io.IOException: No space left on device). With “Out of Disk Space” errors, much of the burden to prevent jobs from failing shifts to the customer to reconfigure the Spark jobs (for example, tweak the number of partitions) or clusters (for example, add more nodes to the cluster). These errors might not be consistent, and the user might end up experimenting heavily by running production jobs. This process can be expensive for the user in multiple dimensions:
 
 * Wasted time. Customers are required to experiment heavily with job configurations via trial and error and are expected to understand Spark’s internal metrics to make the correct decision.
-* Wasted resources. Since production jobs can process varying amount of data, Spark jobs can fail non-deterministically if resources aren't over-provisioned. For instance, consider the problem of data skew, which could result in a few nodes requiring more disk space than others. Currently in Synapse, each node in a cluster gets the same size of disk space and increasing disk space across all nodes isn't an ideal solution and leads to tremendous waste.
+* Wasted resources. Since production jobs can process varying amount of data, Spark jobs can fail nondeterministically if resources aren't over-provisioned. For instance, consider the problem of data skew, which could result in a few nodes requiring more disk space than others. Currently in Synapse, each node in a cluster gets the same size of disk space and increasing disk space across all nodes isn't an ideal solution and leads to tremendous waste.
 * Slowdown in job execution. In the hypothetical scenario where we solve the problem by autoscaling nodes (assuming costs aren't an issue to the end customer), adding a compute node is still expensive (takes a few minutes) as opposed to adding storage (takes a few seconds).
 
 No action is required by you, plus you should see fewer job failures as a result.
@@ -65,7 +65,7 @@ No action is required by you, plus you should see fewer job failures as a result
 
 The automatic pause feature releases resources after a set idle period, reducing the overall cost of an Apache Spark pool. The number of minutes of idle time can be set once this feature is enabled. The automatic pause feature is independent of the autoscale feature. Resources can be paused whether the autoscale is enabled or disabled. This setting can be altered after pool creation although active sessions will need to be restarted.
 
-## Next steps
+## Related content
 
 * [Azure Synapse Analytics](../index.yml)
 * [Apache Spark Documentation](https://spark.apache.org/docs/3.2.1/)
diff --git a/articles/synapse-analytics/sql/create-use-external-tables.md b/articles/synapse-analytics/sql/create-use-external-tables.md
@@ -3,9 +3,9 @@ title: Create and use external tables in Synapse SQL pool
 description: In this section, you'll learn how to create and use external tables in Synapse SQL pool.
 author: vvasic-msft
 ms.service: azure-synapse-analytics
-ms.topic: overview
+ms.topic: how-to
 ms.subservice: sql
-ms.date: 02/02/2022
+ms.date: 12/11/2024
 ms.author: vvasic
 ms.reviewer: whhender, wiassaf
 ---
@@ -78,14 +78,12 @@ The queries in this article will be executed on your sample database and use the
 
 ## External table on a file
 
-You can create external tables that access data on an Azure storage account that allows access to users with some Microsoft Entra identity or SAS key. You can create external tables the same way you create regular SQL Server external tables. 
+You can create external tables that access data on an Azure storage account that allows access to users with some Microsoft Entra identity or SAS key. You can create external tables the same way you create regular SQL Server external tables.
 
-The following query creates an external table that reads *population.csv* file from SynapseSQL demo Azure storage account that is referenced using `sqlondemanddemo` data source and protected with database scoped credential called `sqlondemand`. 
-
-Data source and database scoped credential are created in [setup script](https://github.com/Azure-Samples/Synapse/blob/master/SQL/Samples/LdwSample/SampleDB.sql).
+The following query creates an external table that reads *population.csv* file from SynapseSQL demo Azure storage account that is referenced using `sqlondemanddemo` data source and protected with database scoped credential called `sqlondemand`.
 
 > [!NOTE]
-> Change the first line in the query, i.e., [mydbname], so you're using the database you created. 
+> Change the first line in the query, i.e., [mydbname], so you're using the database you created.
 
 ```sql
 USE [mydbname];
@@ -128,15 +126,15 @@ CREATE EXTERNAL TABLE Taxi (
 );
 ```
 
-You can specify the pattern that the files must satisfy in order to be referenced by the external table. The pattern is required only for Parquet and CSV tables. If you are using Delta Lake format, you need to specify just a root folder, and the external table will automatically find the pattern.
+You can specify the pattern that the files must satisfy in order to be referenced by the external table. The pattern is required only for Parquet and CSV tables. If you're using Delta Lake format, you need to specify just a root folder, and the external table will automatically find the pattern.
 
 > [!NOTE]
 > The table is created on partitioned folder structure, but you cannot leverage some partition elimination. If you want to get better performance by skipping the files that do not satisfy some criterion (like specific year or month in this case), use [views on external data](create-use-views.md#partitioned-views).
 
 ## External table on appendable files
 
-The files that are referenced by an external table should not be changed while the query is running. In the long-running query, SQL pool may retry reads, read parts of the files, or even read the file multiple times. Changes of the file content would cause wrong results. Therefore, the SQL pool fails the query if detects that the modification time of any file is changed during the query execution.
-In some scenarios you might want to create a table on the files that are constantly appended. To avoid the query failures due to constantly appended files, you can specify that the external table should ignore potentially inconsistent reads using the `TABLE_OPTIONS` setting.
+The files that are referenced by an external table shouldn't be changed while the query is running. In the long-running query, SQL pool could retry reads, read parts of the files, or even read the file multiple times. Changes of the file content would cause wrong results. Therefore, the SQL pool fails the query if detects that the modification time of any file is changed during the query execution.
+In some scenarios, you might want to create a table on the files that are constantly appended. To avoid the query failures due to constantly appended files, you can specify that the external table should ignore potentially inconsistent reads using the `TABLE_OPTIONS` setting.
 
 
 ```sql
@@ -155,7 +153,7 @@ WITH (
 );
 ```
 
-The `ALLOW_INCONSISTENT_READS` read option will disable file modification time check during the query lifecycle and read whatever is available in the files that are referenced by the external table. In appendable files, the existing content is not updated, and only new rows are added. Therefore, the probability of wrong results is minimized compared to the updateable files. This option might enable you to read the frequently appended files without handling the errors.
+The `ALLOW_INCONSISTENT_READS` read option will disable file modification time check during the query lifecycle and read whatever is available in the files that are referenced by the external table. In appendable files, the existing content isn't updated, and only new rows are added. Therefore, the probability of wrong results is minimized compared to the updateable files. This option might enable you to read the frequently appended files without handling the errors.
 
 This option is available only in the external tables created on CSV file format.
 
@@ -183,11 +181,11 @@ CREATE EXTERNAL TABLE Covid (
 );
 ```
 
-External tables cannot be created on a partitioned folder. Review the other known issues on [Synapse serverless SQL pool self-help page](resources-self-help-sql-on-demand.md#delta-lake).
+External tables can't be created on a partitioned folder. Review the other known issues on [Synapse serverless SQL pool self-help page](resources-self-help-sql-on-demand.md#delta-lake).
 
 ### Delta tables on partitioned folders
 
-External tables in serverless SQL pools do not support partitioning on Delta Lake format. Use [Delta partitioned views](create-use-views.md#delta-lake-partitioned-views) instead of tables if you have partitioned Delta Lake data sets. 
+External tables in serverless SQL pools don't support partitioning on Delta Lake format. Use [Delta partitioned views](create-use-views.md#delta-lake-partitioned-views) instead of tables if you have partitioned Delta Lake data sets. 
  
 > [!IMPORTANT]
 > Do not create external tables on partitioned Delta Lake folders even if you see that they might work in some cases. Using unsupported features like external tables on partitioned delta folders might cause issues or instability of the serverless pool. Azure support will not be able to resolve any issue if it is using tables on partitioned folders. You would be asked to transition to [Delta partitioned views](create-use-views.md#delta-lake-partitioned-views) and rewrite your code to use only the supported feature before proceeding with issue resolution.
@@ -216,6 +214,7 @@ ORDER BY
 
 Performance of this query might vary depending on region. Your workspace might not be placed in the same region as the Azure storage accounts used in these samples. For production workloads, place your Synapse workspace and Azure storage in the same region.
 
-## Next steps
+## Next step
 
-For information on how to store results of a query to storage, refer to [Store query results to the storage](../sql/create-external-table-as-select.md) article.
+> [!div class="nextstepaction"]
+> [Store query results to the storage](../sql/create-external-table-as-select.md)
diff --git a/articles/synapse-analytics/sql/create-use-views.md b/articles/synapse-analytics/sql/create-use-views.md
@@ -3,9 +3,9 @@ title: Create and use views in serverless SQL pool
 description: In this section, you'll learn how to create and use views to wrap serverless SQL pool queries. Views will allow you to reuse those queries. Views are also needed if you want to use tools, such as Power BI, in conjunction with serverless SQL pool.
 author: azaricstefan
 ms.service: azure-synapse-analytics
-ms.topic: overview
+ms.topic: how-to
 ms.subservice: sql
-ms.date: 05/20/2020
+ms.date: 12/06/2024
 ms.author: stefanazaric
 ms.reviewer: whhender, wiassaf
 ---
@@ -53,7 +53,7 @@ The view uses an `EXTERNAL DATA SOURCE` with a root URL of your storage, as a `D
 
 ### Delta Lake views
 
-If you are creating the views on top of Delta Lake folder, you need to specify the location to the root folder after the `BULK` option instead of specifying the file path.
+If you're creating the views on top of Delta Lake folder, you need to specify the location to the root folder after the `BULK` option instead of specifying the file path.
 
 > [!div class="mx-imgBorder"]
 >![ECDC COVID-19 Delta Lake folder](./media/shared/covid-delta-lake-studio.png)
@@ -100,7 +100,7 @@ When using JOINs in SQL queries, declare the filter predicate as NVARCHAR to red
 
 ### Delta Lake partitioned views
 
-If you are creating the partitioned views on top of Delta Lake storage, you can specify just a root Delta Lake folder and don't need to explicitly expose the partitioning columns using the `FILEPATH` function:
+If you're creating the partitioned views on top of Delta Lake storage, you can specify just a root Delta Lake folder and don't need to explicitly expose the partitioning columns using the `FILEPATH` function:
 
 ```sql
 CREATE OR ALTER VIEW YellowTaxiView
@@ -124,7 +124,7 @@ For more information, review [Synapse serverless SQL pool self-help page](resour
 
 ## JSON views
 
-The views are the good choice if you need to do some additional processing on top of the result set that is fetched from the files. One example might be parsing JSON files where we need to apply the JSON functions to extract the values from the JSON documents:
+The views are the good choice if you need to do some extra processing on top of the result set that is fetched from the files. One example might be parsing JSON files where we need to apply the JSON functions to extract the values from the JSON documents:
 
 ```sql
 CREATE OR ALTER VIEW CovidCases
@@ -191,12 +191,6 @@ ORDER BY
 
 When you query the view, you may encounter errors or unexpected results. This probably means that the view references columns or objects that were modified or no longer exist. You need to manually adjust the view definition to align with the underlying schema changes.
 
-## Next steps
+## Related content
 
 For information on how to query different file types, refer to the [Query single CSV file](query-single-csv-file.md), [Query Parquet files](query-parquet-files.md), and [Query JSON files](query-json-files.md) articles.
-
-- [What's new in Azure Synapse Analytics?](../whats-new.md). 
-- [Best practices for serverless SQL pool in Azure Synapse Analytics](best-practices-serverless-sql-pool.md)
-- [Troubleshoot serverless SQL pool in Azure Synapse Analytics](resources-self-help-sql-on-demand.md)
-- [Troubleshoot a slow query on a dedicated SQL Pool](/troubleshoot/azure/synapse-analytics/dedicated-sql/troubleshoot-dsql-perf-slow-query)
-- [Synapse Studio troubleshooting](../troubleshoot/troubleshoot-synapse-studio.md)
diff --git a/articles/synapse-analytics/sql/query-parquet-files.md b/articles/synapse-analytics/sql/query-parquet-files.md