Skip to content

Commit 06b2060

Browse files
authored
Merge pull request #90998 from dagiro/freshness15
freshness15
2 parents 9c0fdf5 + 35021a9 commit 06b2060

File tree

2 files changed

+22
-22
lines changed

2 files changed

+22
-22
lines changed

articles/hdinsight/hdinsight-apache-kafka-spark-structured-streaming.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,13 @@
22
title: 'Tutorial Apache Spark Structured Streaming with Apache Kafka - Azure HDInsight'
33
description: Learn how to use Apache Spark streaming to get data into or out of Apache Kafka. In this tutorial, you stream data using a Jupyter notebook from Spark on HDInsight.
44
author: hrasheed-msft
5+
ms.author: hrasheed
56
ms.reviewer: jasonh
6-
77
ms.service: hdinsight
88
ms.custom: hdinsightactive,seodec18
99
ms.topic: tutorial
10-
ms.date: 05/22/2019
11-
ms.author: hrasheed
10+
ms.date: 10/08/2019
11+
1212
#Customer intent: As a developer, I want to learn how to use Spark Structured Streaming with Kafka on HDInsight.
1313
---
1414

@@ -38,8 +38,8 @@ When you are done with the steps in this document, remember to delete the cluste
3838

3939
> [!IMPORTANT]
4040
> The steps in this document require an Azure resource group that contains both a Spark on HDInsight and a Kafka on HDInsight cluster. These clusters are both located within an Azure Virtual Network, which allows the Spark cluster to directly communicate with the Kafka cluster.
41-
>
42-
> For your convenience, this document links to a template that can create all the required Azure resources.
41+
>
42+
> For your convenience, this document links to a template that can create all the required Azure resources.
4343
>
4444
> For more information on using HDInsight in a virtual network, see the [Plan a virtual network for HDInsight](hdinsight-plan-virtual-network-deployment.md) document.
4545
@@ -91,7 +91,7 @@ In both snippets, data is read from Kafka and written to file. The differences b
9191
| `write` | `writeStream` |
9292
| `save` | `start` |
9393

94-
The streaming operation also uses `awaitTermination(30000)`, which stops the stream after 30,000 ms.
94+
The streaming operation also uses `awaitTermination(30000)`, which stops the stream after 30,000 ms.
9595

9696
To use Structured Streaming with Kafka, your project must have a dependency on the `org.apache.spark : spark-sql-kafka-0-10_2.11` package. The version of this package should match the version of Spark on HDInsight. For Spark 2.2.0 (available in HDInsight 3.6), you can find the dependency information for different project types at [https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar](https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar).
9797

@@ -109,7 +109,7 @@ For the Jupyter Notebook used with this tutorial, the following cell loads this
109109

110110
## Create the clusters
111111

112-
Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. Anything that uses Kafka must be in the same Azure virtual network. In this tutorial, both the Kafka and Spark clusters are located in the same Azure virtual network.
112+
Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. Anything that uses Kafka must be in the same Azure virtual network. In this tutorial, both the Kafka and Spark clusters are located in the same Azure virtual network.
113113

114114
The following diagram shows how communication flows between Spark and Kafka:
115115

@@ -148,12 +148,12 @@ To create an Azure Virtual Network, and then create the Kafka and Spark clusters
148148
| Cluster Login Password | The admin user password for the clusters. |
149149
| SSH User Name | The SSH user to create for the clusters. |
150150
| SSH Password | The password for the SSH user. |
151-
151+
152152
![Screenshot of the customized template](./media/hdinsight-apache-kafka-spark-structured-streaming/spark-kafka-template.png)
153153

154-
3. Read the **Terms and Conditions**, and then select **I agree to the terms and conditions stated above**
154+
3. Read the **Terms and Conditions**, and then select **I agree to the terms and conditions stated above**.
155155

156-
4. Finally, check **Pin to dashboard** and then select **Purchase**.
156+
4. Select **Purchase**.
157157

158158
> [!NOTE]
159159
> It can take up to 20 minutes to create the clusters.
@@ -181,11 +181,11 @@ This example demonstrates how to use Spark Structured Streaming with Kafka on HD
181181
182182
3. Select **New > Spark** to create a notebook.
183183
184-
4. Load packages used by the Notebook by entering the following information in a Notebook cell. Run the command by using **CTRL + ENTER**.
184+
4. Spark streaming has microbatching, which means data comes as batches and executers run on the batches of data. If the executor has idle timeout less than the time it takes to process the batch then the executors would be constantly added and removed. If the executors idle timeout is greater than the batch duration, the executor never gets removed. Hence **we recommend that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.**
185185
186-
Spark streaming has microbatching, which means data comes as batches and executers run on the batches of data. If the executor has idle timeout less than the time it takes to process the batch then the executors would be constantly added and removed. If the executors idle timeout is greater than the batch duration, the executor never gets removed. Hence **we recommend that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.**
186+
Load packages used by the Notebook by entering the following information in a Notebook cell. Run the command by using **CTRL + ENTER**.
187187
188-
```
188+
```configuration
189189
%%configure -f
190190
{
191191
"conf": {
@@ -213,10 +213,10 @@ Spark streaming has microbatching, which means data comes as batches and execute
213213
// Load the data from the New York City Taxi data REST API for 2016 Green Taxi Trip Data
214214
val url="https://data.cityofnewyork.us/resource/pqfs-mqru.json"
215215
val result = scala.io.Source.fromURL(url).mkString
216-
216+
217217
// Create a dataframe from the JSON data
218218
val taxiDF = spark.read.json(Seq(result).toDS)
219-
219+
220220
// Display the dataframe containing trip data
221221
taxiDF.show()
222222
```
@@ -227,7 +227,7 @@ Spark streaming has microbatching, which means data comes as batches and execute
227227
// The Kafka broker hosts and topic used to write to Kafka
228228
val kafkaBrokers="YOUR_KAFKA_BROKER_HOSTS"
229229
val kafkaTopic="tripdata"
230-
230+
231231
println("Finished setting Kafka broker and topic configuration.")
232232
```
233233
@@ -247,7 +247,7 @@ Spark streaming has microbatching, which means data comes as batches and execute
247247
import org.apache.spark.sql._
248248
import org.apache.spark.sql.types._
249249
import org.apache.spark.sql.functions._
250-
250+
251251
// Define a schema for the data
252252
val schema = (new StructType).add("dropoff_latitude", StringType).add("dropoff_longitude", StringType).add("extra", StringType).add("fare_amount", StringType).add("improvement_surcharge", StringType).add("lpep_dropoff_datetime", StringType).add("lpep_pickup_datetime", StringType).add("mta_tax", StringType).add("passenger_count", StringType).add("payment_type", StringType).add("pickup_latitude", StringType).add("pickup_longitude", StringType).add("ratecodeid", StringType).add("store_and_fwd_flag", StringType).add("tip_amount", StringType).add("tolls_amount", StringType).add("total_amount", StringType).add("trip_distance", StringType).add("trip_type", StringType).add("vendorid", StringType)
253253
// Reproduced here for readability
@@ -272,7 +272,7 @@ Spark streaming has microbatching, which means data comes as batches and execute
272272
// .add("trip_distance", StringType)
273273
// .add("trip_type", StringType)
274274
// .add("vendorid", StringType)
275-
275+
276276
println("Schema declared")
277277
```
278278
@@ -281,10 +281,10 @@ Spark streaming has microbatching, which means data comes as batches and execute
281281
```scala
282282
// Read a batch from Kafka
283283
val kafkaDF = spark.read.format("kafka").option("kafka.bootstrap.servers", kafkaBrokers).option("subscribe", kafkaTopic).option("startingOffsets", "earliest").load()
284-
284+
285285
// Select data and write to file
286286
val query = kafkaDF.select(from_json(col("value").cast("string"), schema) as "trip").write.format("parquet").option("path","/example/batchtripdata").option("checkpointLocation", "/batchcheckpoint").save()
287-
287+
288288
println("Wrote data to file")
289289
```
290290
@@ -300,7 +300,7 @@ Spark streaming has microbatching, which means data comes as batches and execute
300300
```scala
301301
// Stream from Kafka
302302
val kafkaStreamDF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafkaBrokers).option("subscribe", kafkaTopic).option("startingOffsets", "earliest").load()
303-
303+
304304
// Select data from the stream and write to file
305305
kafkaStreamDF.select(from_json(col("value").cast("string"), schema) as "trip").writeStream.format("parquet").option("path","/example/streamingtripdata").option("checkpointLocation", "/streamcheckpoint").start.awaitTermination(30000)
306306
println("Wrote data to file")
@@ -325,7 +325,7 @@ To remove the resource group using the Azure portal:
325325
326326
> [!WARNING]
327327
> HDInsight cluster billing starts once a cluster is created and stops when the cluster is deleted. Billing is pro-rated per minute, so you should always delete your cluster when it is no longer in use.
328-
>
328+
>
329329
> Deleting a Kafka on HDInsight cluster deletes any data stored in Kafka.
330330
331331
## Next steps
128 KB
Loading

0 commit comments

Comments
 (0)