You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-streaming-overview.md
+8-7Lines changed: 8 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,15 @@
1
1
---
2
2
title: Spark Streaming in Azure HDInsight
3
3
description: How to use Apache Spark Streaming applications on HDInsight Spark clusters.
4
-
ms.service: hdinsight
5
4
author: hrasheed-msft
6
5
ms.author: hrasheed
7
6
ms.reviewer: jasonh
8
-
ms.custom: hdinsightactive
7
+
ms.service: hdinsight
9
8
ms.topic: conceptual
10
-
ms.date: 03/11/2019
9
+
ms.custom: hdinsightactive
10
+
ms.date: 11/20/2019
11
11
---
12
+
12
13
# Overview of Apache Spark Streaming
13
14
14
15
[Apache Spark](https://spark.apache.org/) Streaming provides data stream processing on HDInsight Spark clusters, with a guarantee that any input event is processed exactly once, even if a node failure occurs. A Spark Stream is a long-running job that receives input data from a wide variety of sources, including Azure Event Hubs, an Azure IoT Hub, [Apache Kafka](https://kafka.apache.org/), [Apache Flume](https://flume.apache.org/), Twitter, [ZeroMQ](http://zeromq.org/), raw TCP sockets, or from monitoring [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) filesystems. Unlike a solely event-driven process, a Spark Stream batches input data into time windows, such as a 2-second slice, and then transforms each batch of data using map, reduce, join, and extract operations. The Spark Stream then writes the transformed data out to filesystems, databases, dashboards, and the console.
@@ -21,9 +22,9 @@ Spark Streaming applications must wait a fraction of a second to collect each *m
21
22
22
23
Spark Streaming represents a continuous stream of incoming data using a *discretized stream* called a DStream. A DStream can be created from input sources such as Event Hubs or Kafka, or by applying transformations on another DStream.
23
24
24
-
A DStream provides a layer of abstraction on top of the raw event data.
25
+
A DStream provides a layer of abstraction on top of the raw event data.
25
26
26
-
Start with a single event, say a temperature reading from a connected thermostat. When this event arrives at your Spark Streaming application, the event is stored in a reliable way, where it is replicated on multiple nodes. This fault-tolerance ensures that the failure of any single node will not result in the loss of your event. The Spark core uses a data structure that distributes data across multiple nodes in the cluster, where each node generally maintains its own data in-memory for best performance. This data structure is called a *resilient distributed dataset* (RDD).
27
+
Start with a single event, say a temperature reading from a connected thermostat. When this event arrives at your Spark Streaming application, the event is stored in a reliable way, where it's replicated on multiple nodes. This fault-tolerance ensures that the failure of any single node won't result in the loss of your event. The Spark core uses a data structure that distributes data across multiple nodes in the cluster, where each node generally maintains its own data in-memory for best performance. This data structure is called a *resilient distributed dataset* (RDD).
27
28
28
29
Each RDD represents events collected over a user-defined timeframe called the *batch interval*. As each batch interval elapses, a new RDD is produced that contains all the data from that interval. The continuous set of RDDs are collected into a DStream. For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. When processing the DStream, the temperature event appears in one of these batches. A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.
29
30
@@ -133,7 +134,7 @@ stream.foreachRDD { rdd =>
133
134
val _sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd.sparkContext)
0 commit comments