You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document describes how to migrate Apache Storm workloads on HDInsight 3.6 to HDInsight 4.0. HDInsight 4.0 doesn't support the Apache Storm cluster type and you will need to migrate to another streaming data platform. Two suitable options are Apache Spark Streaming and Spark Structured Streaming. This document describes the differences between these platforms and also recommends a workflow for migrating Apache Storm workloads.
15
14
16
15
## Storm migration paths in HDInsight
17
16
18
-
HDInsight 4.0 supports Spark Streaming and Spark Structured Streaming as streaming processing platform. Other options include Azure Stream Analytics and other OSS with user management. This document provides a guide for migrating to Spark Streaming and Spark Structured Streaming.
17
+
If you want to migrate from Apache Storm on HDInsight 3.6 you have multiple options:
18
+
19
+
* Spark Streaming on HDInsight 4.0
20
+
* Spark Structured Streaming on HDInsight 4.0
21
+
* Azure Stream Analytics
22
+
23
+
This document provides a guide for migrating from Apache Storm to Spark Streaming and Spark Structured Streaming.
@@ -32,7 +37,7 @@ Apache Storm can provide different levels of guaranteed message processing. For
32
37
33
38
### Spark streaming vs Spark structured streaming
34
39
35
-
Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured Streaming will continue to receive enhancements and maintenance, while DStreams will be in maintenance mode only. Structured Streaming does not have as many features as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
40
+
Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured Streaming will continue to receive enhancements and maintenance, while DStreams will be in maintenance mode only. **Note: need links to emphasize this point**. Structured Streaming does not have as many features as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
36
41
37
42
## Streaming (Single event) processing vs Micro-Batch processing
38
43
@@ -51,7 +56,7 @@ Storm topologies are composed of multiple components that are arranged in a dire
51
56
52
57

53
58
54
-
Storm consists of the following three daemons which keep the Storm cluster functioning.
59
+
Storm consists of the following three daemons, which keep the Storm cluster functioning.
55
60
56
61
|Daemon |Description |
57
62
|---|---|
@@ -71,86 +76,53 @@ Storm consists of the following three daemons which keep the Storm cluster funct
71
76
72
77

73
78
74
-
## Spark Streaming – Dstream
79
+
## Spark Streaming – DStream
75
80
76
81
As each batch interval elapses, a new RDD is produced that contains all the data from that interval. The continuous sets of RDDs are collected into a DStream. For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. When processing the DStream, the temperature event appears in one of these batches. A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.

83
-
84
-
The following functions are available for processing Dstream. See [Overview of Apache Spark Streaming](../spark/apache-spark-streaming-overview.md) for details.
85
-
86
-
**Transformations on Dstreams**
87
-
* map(func)
88
-
* flatMap(func)
89
-
* filter(func)
90
-
* repartition(numPartitions)
91
-
* union(otherStream)
92
-
* count()
93
-
* reduce(func)
94
-
* countByValue()
95
-
* reduceByKey(func, [numTasks])
96
-
* join(otherStream, [numTasks])
97
-
* cogroup(otherStream, [numTasks])
98
-
* transform(func)
99
-
* updateStateByKey(func)
100
-
* etc
85
+
For details on the different transformations available with Spark Streaming, see [Transformations on DStreams](https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams).
101
86
102
-

If the built-in operations don't meet the data transformation requirements, you can use UDF (User-Defined Functions).
89
+
Spark Structured Streaming represents a stream of data as a table that is unbounded in depth. The table continues to grow as new data arrives. This input table is continuously processed by a long-running query, and the results are sent to an output table.
112
90
113
-
## Spark Structured Streaming
91
+
In Structured Streaming, data arrives at the system and is immediately ingested into an input table. You write queries (using the DataFrame and Dataset APIs) that perform operations against this input table.
114
92
115
-
Spark Structured Streaming represents a stream of data as a table that is unbounded in depth, that is, the table continues to grow as new data arrives. This input table is continuously processed by a long-running query, and the results sent to an output table.
93
+
The query output yields a *results table*, which contains the results of your query. You can draw data from the results table for an external datastore, such a relational database.
116
94
117
-
In Structured Streaming, data arrives at the system and is immediately ingested into an input table. You write queries (using the DataFrame and Dataset APIs) that perform operations against this input table. The query output yields another table, the results table. The results table contains the results of your query, from which you draw data for an external datastore, such a relational database. The timing of when data is processed from the input table is controlled by the trigger interval. By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.
95
+
The timing of when data is processed from the input table is controlled by the trigger interval. By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.
118
96
119
97

120
98
121
99

122
100
123
101
## General migration flow
124
102
125
-
Presumed current environment:
103
+
The recommended migration flow from Storm to Spark assumes the following initial architecture:
126
104
127
-
* Kafka is used as the streaming data source,
128
-
* Kafka and Storm are deployed on the same virtual network,
129
-
* The data processed by Storm is written to data sink, such as Azure storage, ADLS, and so on.
105
+
* Kafka is used as the streaming data source
106
+
* Kafka and Storm are deployed on the same virtual network
107
+
* The data processed by Storm is written to a data sink, such as Azure Storage or Azure Data Lake Storage Gen2.
130
108
131
109

132
110
133
-
1. Deploy new HDInsight 4.0 Spark cluster, deploy code, and test it.
111
+
To migrate your application from Storm to one of the Spark streaming APIs, do the following:
134
112
135
-
Deploy a new HDInsight 4.0 Spark cluster in the same VNet and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.
113
+
1.**Deploy a new cluster.**Deploy a new HDInsight 4.0 Spark cluster in the same virtual network and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.
136
114
137
115

138
116
139
-
1. Stop consuming on the current Storm cluster.
140
-
141
-
In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.
117
+
1.**Stop consuming on the old Storm cluster.** In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.
142
118
143
119

144
120
145
-
1. Start consuming on the new Spark cluster.
146
-
147
-
Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. At this time, the process is taken over by consuming from the latest Kafka offset.
121
+
1.**Start consuming on the new Spark cluster.** Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. At this time, the process is taken over by consuming from the latest Kafka offset.
148
122
149
123

150
124
151
-
1. Remove the old cluster as needed.
152
-
153
-
Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.
125
+
1.**Remove the old cluster as needed.** Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.
154
126
155
127

0 commit comments