You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Comparison between Apache Storm and Spark Streaming, Spark Structured Streaming
28
29
@@ -43,7 +44,8 @@ Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured S
43
44
44
45
Storm provides a model that processes each single event. This means that all incoming records will be processed as soon as they arrive. Spark Streaming applications must wait a fraction of a second to collect each micro-batch of events before sending that batch on for processing. In contrast, an event-driven application processes each event immediately. Spark Streaming latency is typically under a few seconds. The benefits of the micro-batch approach are more efficient data processing and simpler aggregate calculations.
45
46
46
-

47
+
> [!div class="mx-imgBorder"]
48
+
> 
47
49
48
50
## Storm architecture and components
49
51
@@ -54,7 +56,8 @@ Storm topologies are composed of multiple components that are arranged in a dire
54
56
|Spout|Brings data into a topology. They emit one or more streams into the topology.|
55
57
|Bolt|Consumes streams emitted from spouts or other bolts. Bolts might optionally emit streams into the topology. Bolts are also responsible for writing data to external services or storage, such as HDFS, Kafka, or HBase.|
56
58
57
-

59
+
> [!div class="mx-imgBorder"]
60
+
> 
58
61
59
62
Storm consists of the following three daemons, which keep the Storm cluster functioning.
60
63
@@ -64,23 +67,28 @@ Storm consists of the following three daemons, which keep the Storm cluster func
64
67
|Zookeeper|Used for cluster coordination.|
65
68
|Supervisor|Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. Each worker process executes a subset of a topology. User’s application logic (Spouts and Bolt) run here.|
66
69
67
-

70
+
> [!div class="mx-imgBorder"]
71
+
> 
68
72
69
-
## Spark Streaming / Spark Structured Streaming
73
+
## Spark Streaming architecture and components
70
74
71
-
* When Spark Streaming is launched, the driver launches the task in Executor.
72
-
* Executor receives a stream from a streaming data source.
73
-
* When the Executor receives data streams, it splits the stream into blocks and keeps them in memory.
74
-
* Blocks of data are replicated to other Executors.
75
+
The following steps summarize how components work together in Spark Streaming (DStreams) and Spark Structured Streaming:
76
+
77
+
* When Spark Streaming is launched, the driver launches the task in the executor.
78
+
* The executor receives a stream from a streaming data source.
79
+
* When the executor receives data streams, it splits the stream into blocks and keeps them in memory.
80
+
* Blocks of data are replicated to other executors.
75
81
* The processed data is then stored in the target data store.
76
82
77
-

83
+
> [!div class="mx-imgBorder"]
84
+
> 
78
85
79
-
## Spark Streaming – DStream
86
+
## Spark Streaming (DStream) workflow
80
87
81
88
As each batch interval elapses, a new RDD is produced that contains all the data from that interval. The continuous sets of RDDs are collected into a DStream. For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. When processing the DStream, the temperature event appears in one of these batches. A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.
For details on the different transformations available with Spark Streaming, see [Transformations on DStreams](https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams).
86
94
@@ -94,9 +102,11 @@ The query output yields a *results table*, which contains the results of your qu
94
102
95
103
The timing of when data is processed from the input table is controlled by the trigger interval. By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.
96
104
97
-

105
+
> [!div class="mx-imgBorder"]
106
+
> 
98
107
99
-

108
+
> [!div class="mx-imgBorder"]
109
+
> 
100
110
101
111
## General migration flow
102
112
@@ -106,33 +116,34 @@ The recommended migration flow from Storm to Spark assumes the following initial
106
116
* Kafka and Storm are deployed on the same virtual network
107
117
* The data processed by Storm is written to a data sink, such as Azure Storage or Azure Data Lake Storage Gen2.
108
118
109
-

119
+
> [!div class="mx-imgBorder"]
120
+
> 
110
121
111
122
To migrate your application from Storm to one of the Spark streaming APIs, do the following:
112
123
113
124
1.**Deploy a new cluster.** Deploy a new HDInsight 4.0 Spark cluster in the same virtual network and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.
114
125
115
-

126
+
> [!div class="mx-imgBorder"]
127
+
> 
116
128
117
129
1.**Stop consuming on the old Storm cluster.** In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.
118
130
119
-

131
+
> [!div class="mx-imgBorder"]
132
+
> 
120
133
121
134
1.**Start consuming on the new Spark cluster.** Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. At this time, the process is taken over by consuming from the latest Kafka offset.
122
135
123
-

136
+
> [!div class="mx-imgBorder"]
137
+
> 
124
138
125
139
1.**Remove the old cluster as needed.** Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.
126
140
127
-

141
+
> [!div class="mx-imgBorder"]
142
+
> 
128
143
129
144
## Next steps
130
145
131
146
For more information about Storm, Spark Streaming, and Spark Structured Streaming, see the following documents:
0 commit comments