Skip to content

Commit afc8cd6

Browse files
authored
Merge pull request #198967 from TimShererWithAquent/t581147p
Edits to improve SEO and usability
2 parents eaa1155 + 81f649a commit afc8cd6

File tree

2 files changed

+24
-27
lines changed

2 files changed

+24
-27
lines changed
Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,70 @@
11
---
2-
title: What is Apache Spark
2+
title: Apache Spark in Azure Synapse Analytics overview
33
description: This article provides an introduction to Apache Spark in Azure Synapse Analytics and the different scenarios in which you can use Spark.
44
services: synapse-analytics
55
author: juluczni
66
ms.author: juluczni
77
ms.service: synapse-analytics
88
ms.topic: overview
99
ms.subservice: spark
10-
ms.date: 02/15/2022
10+
ms.date: 05/23/2022
1111
ms.reviewer: euang
12+
ms.custom: kr2b-contr-experiment
1213
---
1314

1415
# Apache Spark in Azure Synapse Analytics
1516

16-
Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Apache Spark in Azure Synapse Analytics is one of Microsoft's implementations of Apache Spark in the cloud. Azure Synapse makes it easy to create and configure a serverless Apache Spark pool in Azure. Spark pools in Azure Synapse are compatible with Azure Storage and Azure Data Lake Generation 2 Storage. So you can use Spark pools to process your data stored in Azure.
17+
Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big data analytic applications. Apache Spark in Azure Synapse Analytics is one of Microsoft's implementations of Apache Spark in the cloud. Azure Synapse makes it easy to create and configure a serverless Apache Spark pool in Azure. Spark pools in Azure Synapse are compatible with Azure Storage and Azure Data Lake Generation 2 Storage. So you can use Spark pools to process your data stored in Azure.
1718

18-
![Spark: a unified framework](./media/apache-spark-overview/spark-overview.png)
19+
![Diagram shows Spark SQL, Spark MLib, and GraphX linked to the Spark core engine, above a YARN layer over storage services.](./media/apache-spark-overview/spark-overview.png)
1920

2021
## What is Apache Spark
2122

2223
Apache Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly. In-memory computing is much faster than disk-based applications. Spark also integrates with multiple programming languages to let you manipulate distributed data sets like local collections. There's no need to structure everything as map and reduce operations.
2324

24-
![Traditional MapReduce vs. Spark](./media/apache-spark-overview/map-reduce-vs-spark.png)
25+
![Diagram shows Traditional MapReduce, with disk-based apps and Spark, with cache-based operations.](./media/apache-spark-overview/map-reduce-vs-spark.png)
2526

2627
Spark pools in Azure Synapse offer a fully managed Spark service. The benefits of creating a Spark pool in Azure Synapse Analytics are listed here.
2728

2829
| Feature | Description |
2930
| --- | --- |
30-
| Speed and efficiency |Spark instances start in approximately 2 minutes for fewer than 60 nodes and approximately 5 minutes for more than 60 nodes. The instance shuts down, by default, 5 minutes after the last job executed unless it is kept alive by a notebook connection. |
31+
| Speed and efficiency |Spark instances start in approximately 2 minutes for fewer than 60 nodes and approximately 5 minutes for more than 60 nodes. The instance shuts down, by default, 5 minutes after the last job runs unless it's kept alive by a notebook connection. |
3132
| Ease of creation |You can create a new Spark pool in Azure Synapse in minutes using the Azure portal, Azure PowerShell, or the Synapse Analytics .NET SDK. See [Get started with Spark pools in Azure Synapse Analytics](../quickstart-create-apache-spark-pool-studio.md). |
32-
| Ease of use |Synapse Analytics includes a custom notebook derived from [Nteract](https://nteract.io/). You can use these notebooks for interactive data processing and visualization.|
33+
| Ease of use |Synapse Analytics includes a custom notebook derived from [nteract](https://nteract.io/). You can use these notebooks for interactive data processing and visualization.|
3334
| REST APIs |Spark in Azure Synapse Analytics includes [Apache Livy](https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server), a REST API-based Spark job server to remotely submit and monitor jobs. |
34-
| Support for Azure Data Lake Storage Generation 2| Spark pools in Azure Synapse can use Azure Data Lake Storage Generation 2 as well as BLOB storage. For more information on Data Lake Storage, see [Overview of Azure Data Lake Storage](../../data-lake-store/data-lake-store-overview.md). |
35+
| Support for Azure Data Lake Storage Generation 2| Spark pools in Azure Synapse can use Azure Data Lake Storage Generation 2 and BLOB storage. For more information on Data Lake Storage, see [Overview of Azure Data Lake Storage](../../data-lake-store/data-lake-store-overview.md). |
3536
| Integration with third-party IDEs | Azure Synapse provides an IDE plugin for [JetBrains' IntelliJ IDEA](https://www.jetbrains.com/idea/) that is useful to create and submit applications to a Spark pool. |
36-
| Pre-loaded Anaconda libraries |Spark pools in Azure Synapse come with Anaconda libraries pre-installed. [Anaconda](https://docs.continuum.io/anaconda/) provides close to 200 libraries for machine learning, data analysis, visualization, etc. |
37+
| Preloaded Anaconda libraries |Spark pools in Azure Synapse come with Anaconda libraries preinstalled. [Anaconda](https://docs.continuum.io/anaconda/) provides close to 200 libraries for machine learning, data analysis, visualization, and other technologies. |
3738
| Scalability | Apache Spark in Azure Synapse pools can have Auto-Scale enabled, so that pools scale by adding or removing nodes as needed. Also, Spark pools can be shut down with no loss of data since all the data is stored in Azure Storage or Data Lake Storage. |
3839

39-
Spark pools in Azure Synapse include the following components that are available on the pools by default.
40+
Spark pools in Azure Synapse include the following components that are available on the pools by default:
4041

4142
- [Spark Core](https://spark.apache.org/docs/2.4.5/). Includes Spark Core, Spark SQL, GraphX, and MLlib.
4243
- [Anaconda](https://docs.continuum.io/anaconda/)
4344
- [Apache Livy](https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server)
44-
- [Nteract notebook](https://nteract.io/)
45+
- [nteract notebook](https://nteract.io/)
4546

4647
## Spark pool architecture
4748

48-
It is easy to understand the components of Spark by understanding how Spark runs on Azure Synapse Analytics.
49+
Spark applications run as independent sets of processes on a pool, coordinated by the `SparkContext` object in your main program, called the *driver program*.
4950

50-
Spark applications run as independent sets of processes on a pool, coordinated by the SparkContext object in your main program (called the driver program).
51+
The `SparkContext` can connect to the cluster manager, which allocates resources across applications. The cluster manager is [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). Once connected, Spark acquires executors on nodes in the pool, which are processes that run computations and store data for your application. Next, it sends your application code, defined by JAR or Python files passed to `SparkContext`, to the executors. Finally, `SparkContext` sends tasks to the executors to run.
5152

52-
The SparkContext can connect to the cluster manager, which allocates resources across applications. The cluster manager is [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html). Once connected, Spark acquires executors on nodes in the pool, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
53+
The `SparkContext` runs the user's main function and executes the various parallel operations on the nodes. Then, the `SparkContext` collects the results of the operations. The nodes read and write data from and to the file system. The nodes also cache transformed data in-memory as Resilient Distributed Datasets (RDDs).
5354

54-
The SparkContext runs the user's main function and executes the various parallel operations on the nodes. Then, the SparkContext collects the results of the operations. The nodes read and write data from and to the file system. The nodes also cache transformed data in-memory as Resilient Distributed Datasets (RDDs).
55-
56-
The SparkContext connects to the Spark pool and is responsible for converting an application to a directed acyclic graph (DAG). The graph consists of individual tasks that get executed within an executor process on the nodes. Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.
55+
The `SparkContext` connects to the Spark pool and is responsible for converting an application to a directed acyclic graph (DAG). The graph consists of individual tasks that run within an executor process on the nodes. Each application gets its own executor processes, which stay up during the whole application and run tasks in multiple threads.
5756

5857
## Apache Spark in Azure Synapse Analytics use cases
5958

6059
Spark pools in Azure Synapse Analytics enable the following key scenarios:
6160

62-
### Data Engineering/Data Preparation
61+
- Data Engineering/Data Preparation
6362

64-
Apache Spark includes many language features to support preparation and processing of large volumes of data so that it can be made more valuable and then consumed by other services within Azure Synapse Analytics. This is enabled through multiple languages (C#, Scala, PySpark, Spark SQL) and supplied libraries for processing and connectivity.
63+
Apache Spark includes language features to support preparation and processing of large volumes of data so that it can be made more valuable and then consumed by other services within Azure Synapse Analytics. This approach is enabled through multiple languages, including C#, Scala, PySpark, and Spark SQL, and supplied libraries for processing and connectivity.
6564

66-
### Machine Learning
65+
- Machine Learning
6766

68-
Apache Spark comes with [MLlib](https://spark.apache.org/mllib/), a machine learning library built on top of Spark that you can use from a Spark pool in Azure Synapse Analytics. Spark pools in Azure Synapse Analytics also include Anaconda, a Python distribution with a variety of packages for data science including machine learning. When combined with built-in support for notebooks, you have an environment for creating machine learning applications.
67+
Apache Spark comes with [MLlib](https://spark.apache.org/mllib/), a machine learning library built on top of Spark that you can use from a Spark pool in Azure Synapse Analytics. Spark pools in Azure Synapse Analytics also include Anaconda, a Python distribution with various packages for data science including machine learning. When combined with built-in support for notebooks, you have an environment for creating machine learning applications.
6968

7069
## Where do I start
7170

@@ -77,10 +76,10 @@ Use the following articles to learn more about Apache Spark in Azure Synapse Ana
7776
- [Apache Spark official documentation](https://spark.apache.org/docs/2.4.5/)
7877

7978
> [!NOTE]
80-
> Some of the official Apache Spark documentation relies on using the spark console, this is not available on Azure Synapse Spark, use the notebook or IntelliJ experiences instead
79+
> Some of the official Apache Spark documentation relies on using the Spark console, which is not available on Azure Synapse Spark. Use the notebook or IntelliJ experiences instead.
8180
8281
## Next steps
8382

84-
In this overview, you get a basic understanding of Apache Spark in Azure Synapse Analytics. Advance to the next article to learn how to create a Spark pool in Azure Synapse Analytics:
83+
This overview provided a basic understanding of Apache Spark in Azure Synapse Analytics. Advance to the next article to learn how to create a Spark pool in Azure Synapse Analytics:
8584

8685
- [Create a Spark pool in Azure Synapse](../quickstart-create-apache-spark-pool-portal.md)

articles/synapse-analytics/toc.yml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -660,10 +660,8 @@ items:
660660
href: data-explorer/functions-library/toc.yml
661661
- name: Apache Spark
662662
items:
663-
- name: Overview
664-
items:
665-
- name: Apache Spark overview
666-
href: ./spark/apache-spark-overview.md
663+
- name: Apache Spark overview
664+
href: ./spark/apache-spark-overview.md
667665
- name: Quickstarts
668666
items:
669667
- name: Create a notebook

0 commit comments

Comments
 (0)