You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/spark/apache-spark-pool-configurations.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,14 @@
1
1
---
2
2
title: Apache Spark pool concepts
3
3
description: Introduction to Apache Spark pool sizes and configurations in Azure Synapse Analytics.
4
-
ms.topic: conceptual
4
+
ms.topic: concept-article
5
5
ms.service: azure-synapse-analytics
6
6
ms.subservice: spark
7
7
ms.custom: references_regions
8
8
author: guyhay
9
9
ms.author: guyhay
10
10
ms.reviewer: whhender
11
-
ms.date: 09/07/2022
11
+
ms.date: 12/06/2024
12
12
---
13
13
14
14
# Apache Spark pool configurations in Azure Synapse Analytics
@@ -53,7 +53,7 @@ Autoscale for Apache Spark pools allows automatic scale up and down of compute r
53
53
Apache Spark pools now support elastic pool storage. Elastic pool storage allows the Spark engine to monitor worker node temporary storage and attach extra disks if needed. Apache Spark pools utilize temporary disk storage while the pool is instantiated. Spark jobs write shuffle map outputs, shuffle data and spilled data to local VM disks. Examples of operations that could utilize local disk are sort, cache, and persist. When temporary VM disk space runs out, Spark jobs could fail due to “Out of Disk Space” error (java.io.IOException: No space left on device). With “Out of Disk Space” errors, much of the burden to prevent jobs from failing shifts to the customer to reconfigure the Spark jobs (for example, tweak the number of partitions) or clusters (for example, add more nodes to the cluster). These errors might not be consistent, and the user might end up experimenting heavily by running production jobs. This process can be expensive for the user in multiple dimensions:
54
54
55
55
* Wasted time. Customers are required to experiment heavily with job configurations via trial and error and are expected to understand Spark’s internal metrics to make the correct decision.
56
-
* Wasted resources. Since production jobs can process varying amount of data, Spark jobs can fail non-deterministically if resources aren't over-provisioned. For instance, consider the problem of data skew, which could result in a few nodes requiring more disk space than others. Currently in Synapse, each node in a cluster gets the same size of disk space and increasing disk space across all nodes isn't an ideal solution and leads to tremendous waste.
56
+
* Wasted resources. Since production jobs can process varying amount of data, Spark jobs can fail nondeterministically if resources aren't over-provisioned. For instance, consider the problem of data skew, which could result in a few nodes requiring more disk space than others. Currently in Synapse, each node in a cluster gets the same size of disk space and increasing disk space across all nodes isn't an ideal solution and leads to tremendous waste.
57
57
* Slowdown in job execution. In the hypothetical scenario where we solve the problem by autoscaling nodes (assuming costs aren't an issue to the end customer), adding a compute node is still expensive (takes a few minutes) as opposed to adding storage (takes a few seconds).
58
58
59
59
No action is required by you, plus you should see fewer job failures as a result.
@@ -65,7 +65,7 @@ No action is required by you, plus you should see fewer job failures as a result
65
65
66
66
The automatic pause feature releases resources after a set idle period, reducing the overall cost of an Apache Spark pool. The number of minutes of idle time can be set once this feature is enabled. The automatic pause feature is independent of the autoscale feature. Resources can be paused whether the autoscale is enabled or disabled. This setting can be altered after pool creation although active sessions will need to be restarted.
0 commit comments