|
1 | 1 | ---
|
2 | 2 | title: 'Quickstart: Get started analyzing with Spark'
|
3 |
| -description: In this tutorial, you'll learn to analyze data with Apache Spark. |
| 3 | +description: In this tutorial, you'll learn to analyze some sample data with Apache Spark in Azure Synapse Analytics. |
4 | 4 | author: whhender
|
5 | 5 | ms.author: whhender
|
6 | 6 | ms.reviewer: whhender
|
7 | 7 | ms.service: azure-synapse-analytics
|
8 | 8 | ms.subservice: spark
|
9 |
| -ms.topic: tutorial |
10 |
| -ms.date: 11/18/2022 |
| 9 | +ms.topic: quickstart |
| 10 | +ms.date: 11/15/2024 |
11 | 11 | ---
|
12 | 12 |
|
13 |
| -# Analyze with Apache Spark |
| 13 | +# Quickstart: Analyze with Apache Spark |
14 | 14 |
|
15 | 15 | In this tutorial, you'll learn the basic steps to load and analyze data with Apache Spark for Azure Synapse.
|
16 | 16 |
|
| 17 | +## Prerequisites |
| 18 | + |
| 19 | +Make sure you have [placed the sample data in the primary storage account](get-started-create-workspace.md#place-sample-data-into-the-primary-storage-account). |
| 20 | + |
17 | 21 | ## Create a serverless Apache Spark pool
|
18 | 22 |
|
19 | 23 | 1. In Synapse Studio, on the left-side pane, select **Manage** > **Apache Spark pools**.
|
20 |
| -1. Select **New** |
| 24 | +1. Select **New** |
21 | 25 | 1. For **Apache Spark pool name** enter **Spark1**.
|
22 | 26 | 1. For **Node size** enter **Small**.
|
23 | 27 | 1. For **Number of nodes** Set the minimum to 3 and the maximum to 3
|
24 | 28 | 1. Select **Review + create** > **Create**. Your Apache Spark pool will be ready in a few seconds.
|
25 | 29 |
|
26 |
| -## Understanding serverless Apache Spark pools |
| 30 | +## Understand serverless Apache Spark pools |
27 | 31 |
|
28 | 32 | A serverless Spark pool is a way of indicating how a user wants to work with Spark. When you start using a pool, a Spark session is created if needed. The pool controls how many Spark resources will be used by that session and how long the session will last before it automatically pauses. You pay for spark resources used during that session and not for the pool itself. This way a Spark pool lets you use Apache Spark without managing clusters. This is similar to how a serverless SQL pool works.
|
29 | 33 |
|
@@ -63,6 +67,7 @@ Data is available via the dataframe named **df**. Load it into a Spark database
|
63 | 67 | spark.sql("CREATE DATABASE IF NOT EXISTS nyctaxi")
|
64 | 68 | df.write.mode("overwrite").saveAsTable("nyctaxi.trip")
|
65 | 69 | ```
|
| 70 | + |
66 | 71 | ## Analyze the NYC Taxi data using Spark and notebooks
|
67 | 72 |
|
68 | 73 | 1. Create a new code cell and enter the following code.
|
@@ -93,7 +98,7 @@ Data is available via the dataframe named **df**. Load it into a Spark database
|
93 | 98 |
|
94 | 99 | 1. In the cell results, select **Chart** to see the data visualized.
|
95 | 100 |
|
96 |
| -## Next steps |
| 101 | +## Next step |
97 | 102 |
|
98 | 103 | > [!div class="nextstepaction"]
|
99 | 104 | > [Analyze data with dedicated SQL pool](get-started-analyze-sql-pool.md)
|
0 commit comments