Skip to content

Commit 5fee166

Browse files
authored
Merge pull request #96716 from dagiro/freshness62
freshness62
2 parents ee4b0c0 + ce2e617 commit 5fee166

File tree

1 file changed

+8
-7
lines changed

1 file changed

+8
-7
lines changed

articles/hdinsight/spark/apache-spark-streaming-overview.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
---
22
title: Spark Streaming in Azure HDInsight
33
description: How to use Apache Spark Streaming applications on HDInsight Spark clusters.
4-
ms.service: hdinsight
54
author: hrasheed-msft
65
ms.author: hrasheed
76
ms.reviewer: jasonh
8-
ms.custom: hdinsightactive
7+
ms.service: hdinsight
98
ms.topic: conceptual
10-
ms.date: 03/11/2019
9+
ms.custom: hdinsightactive
10+
ms.date: 11/20/2019
1111
---
12+
1213
# Overview of Apache Spark Streaming
1314

1415
[Apache Spark](https://spark.apache.org/) Streaming provides data stream processing on HDInsight Spark clusters, with a guarantee that any input event is processed exactly once, even if a node failure occurs. A Spark Stream is a long-running job that receives input data from a wide variety of sources, including Azure Event Hubs, an Azure IoT Hub, [Apache Kafka](https://kafka.apache.org/), [Apache Flume](https://flume.apache.org/), Twitter, [ZeroMQ](http://zeromq.org/), raw TCP sockets, or from monitoring [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) filesystems. Unlike a solely event-driven process, a Spark Stream batches input data into time windows, such as a 2-second slice, and then transforms each batch of data using map, reduce, join, and extract operations. The Spark Stream then writes the transformed data out to filesystems, databases, dashboards, and the console.
@@ -21,9 +22,9 @@ Spark Streaming applications must wait a fraction of a second to collect each *m
2122

2223
Spark Streaming represents a continuous stream of incoming data using a *discretized stream* called a DStream. A DStream can be created from input sources such as Event Hubs or Kafka, or by applying transformations on another DStream.
2324

24-
A DStream provides a layer of abstraction on top of the raw event data.
25+
A DStream provides a layer of abstraction on top of the raw event data.
2526

26-
Start with a single event, say a temperature reading from a connected thermostat. When this event arrives at your Spark Streaming application, the event is stored in a reliable way, where it is replicated on multiple nodes. This fault-tolerance ensures that the failure of any single node will not result in the loss of your event. The Spark core uses a data structure that distributes data across multiple nodes in the cluster, where each node generally maintains its own data in-memory for best performance. This data structure is called a *resilient distributed dataset* (RDD).
27+
Start with a single event, say a temperature reading from a connected thermostat. When this event arrives at your Spark Streaming application, the event is stored in a reliable way, where it's replicated on multiple nodes. This fault-tolerance ensures that the failure of any single node won't result in the loss of your event. The Spark core uses a data structure that distributes data across multiple nodes in the cluster, where each node generally maintains its own data in-memory for best performance. This data structure is called a *resilient distributed dataset* (RDD).
2728

2829
Each RDD represents events collected over a user-defined timeframe called the *batch interval*. As each batch interval elapses, a new RDD is produced that contains all the data from that interval. The continuous set of RDDs are collected into a DStream. For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. When processing the DStream, the temperature event appears in one of these batches. A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.
2930

@@ -133,7 +134,7 @@ stream.foreachRDD { rdd =>
133134
val _sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd.sparkContext)
134135
_sqlContext.createDataFrame(rdd).toDF("value", "time")
135136
.registerTempTable("demo_numbers")
136-
}
137+
}
137138
138139
// Start the stream processing
139140
ssc.start()
@@ -208,7 +209,7 @@ stream.window(org.apache.spark.streaming.Minutes(1)).foreachRDD { rdd =>
208209
val _sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd.sparkContext)
209210
_sqlContext.createDataFrame(rdd).toDF("value", "time")
210211
.registerTempTable("demo_numbers")
211-
}
212+
}
212213
213214
// Start the stream processing
214215
ssc.start()

0 commit comments

Comments
 (0)