Skip to content

Commit 7a7f1f4

Browse files
committed
updates
1 parent 3582270 commit 7a7f1f4

File tree

1 file changed

+25
-53
lines changed

1 file changed

+25
-53
lines changed

articles/hdinsight/storm/migrate-storm-to-spark.md

Lines changed: 25 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,19 @@ ms.service: hdinsight
88
ms.topic: conceptual
99
ms.date: 12/05/2019
1010
---
11-
1211
# Migrate Azure HDInsight 3.6 Apache Storm to HDInsight 4.0 Apache Spark
1312

1413
This document describes how to migrate Apache Storm workloads on HDInsight 3.6 to HDInsight 4.0. HDInsight 4.0 doesn't support the Apache Storm cluster type and you will need to migrate to another streaming data platform. Two suitable options are Apache Spark Streaming and Spark Structured Streaming. This document describes the differences between these platforms and also recommends a workflow for migrating Apache Storm workloads.
1514

1615
## Storm migration paths in HDInsight
1716

18-
HDInsight 4.0 supports Spark Streaming and Spark Structured Streaming as streaming processing platform. Other options include Azure Stream Analytics and other OSS with user management. This document provides a guide for migrating to Spark Streaming and Spark Structured Streaming.
17+
If you want to migrate from Apache Storm on HDInsight 3.6 you have multiple options:
18+
19+
* Spark Streaming on HDInsight 4.0
20+
* Spark Structured Streaming on HDInsight 4.0
21+
* Azure Stream Analytics
22+
23+
This document provides a guide for migrating from Apache Storm to Spark Streaming and Spark Structured Streaming.
1924

2025
![HDInsight Storm migration path](./media/migrate-storm-to-spark/storm-migration-path.png)
2126

@@ -32,7 +37,7 @@ Apache Storm can provide different levels of guaranteed message processing. For
3237

3338
### Spark streaming vs Spark structured streaming
3439

35-
Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured Streaming will continue to receive enhancements and maintenance, while DStreams will be in maintenance mode only. Structured Streaming does not have as many features as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
40+
Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured Streaming will continue to receive enhancements and maintenance, while DStreams will be in maintenance mode only. **Note: need links to emphasize this point**. Structured Streaming does not have as many features as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
3641

3742
## Streaming (Single event) processing vs Micro-Batch processing
3843

@@ -51,7 +56,7 @@ Storm topologies are composed of multiple components that are arranged in a dire
5156

5257
![interaction of storm components](./media/migrate-storm-to-spark/apache-storm-components.png)
5358

54-
Storm consists of the following three daemons which keep the Storm cluster functioning.
59+
Storm consists of the following three daemons, which keep the Storm cluster functioning.
5560

5661
|Daemon |Description |
5762
|---|---|
@@ -71,86 +76,53 @@ Storm consists of the following three daemons which keep the Storm cluster funct
7176

7277
![spark streaming path to output](./media/migrate-storm-to-spark/spark-streaming-to-output.png)
7378

74-
## Spark Streaming – Dstream
79+
## Spark Streaming – DStream
7580

7681
As each batch interval elapses, a new RDD is produced that contains all the data from that interval. The continuous sets of RDDs are collected into a DStream. For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. When processing the DStream, the temperature event appears in one of these batches. A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.
7782

7883
![spark streaming processing batches](./media/migrate-storm-to-spark/spark-streaming-batches.png)
7984

80-
## Data transformations on Spark Streaming
81-
82-
![spark streaming data transformations](./media/migrate-storm-to-spark/spark-streaming-transformations.png)
83-
84-
The following functions are available for processing Dstream. See [Overview of Apache Spark Streaming](../spark/apache-spark-streaming-overview.md) for details.
85-
86-
**Transformations on Dstreams**
87-
* map(func)
88-
* flatMap(func)
89-
* filter(func)
90-
* repartition(numPartitions)
91-
* union(otherStream)
92-
* count()
93-
* reduce(func)
94-
* countByValue()
95-
* reduceByKey(func, [numTasks])
96-
* join(otherStream, [numTasks])
97-
* cogroup(otherStream, [numTasks])
98-
* transform(func)
99-
* updateStateByKey(func)
100-
* etc
85+
For details on the different transformations available with Spark Streaming, see [Transformations on DStreams](https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams).
10186

102-
![transformations on dstreams in HDInsight](./media/migrate-storm-to-spark/transformations-on-dstreams.png)
103-
104-
**Window Functions**
105-
* window(windowLength, slideInterval)
106-
* countByWindow(windowLength, slideInterval)
107-
* reduceByWindow(func, windowLength, slideInterval)
108-
* reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
109-
* countByValueAndWindow(windowLength, slideInterval, [numTasks])
87+
## Spark Structured Streaming
11088

111-
If the built-in operations don't meet the data transformation requirements, you can use UDF (User-Defined Functions).
89+
Spark Structured Streaming represents a stream of data as a table that is unbounded in depth. The table continues to grow as new data arrives. This input table is continuously processed by a long-running query, and the results are sent to an output table.
11290

113-
## Spark Structured Streaming
91+
In Structured Streaming, data arrives at the system and is immediately ingested into an input table. You write queries (using the DataFrame and Dataset APIs) that perform operations against this input table.
11492

115-
Spark Structured Streaming represents a stream of data as a table that is unbounded in depth, that is, the table continues to grow as new data arrives. This input table is continuously processed by a long-running query, and the results sent to an output table.
93+
The query output yields a *results table*, which contains the results of your query. You can draw data from the results table for an external datastore, such a relational database.
11694

117-
In Structured Streaming, data arrives at the system and is immediately ingested into an input table. You write queries (using the DataFrame and Dataset APIs) that perform operations against this input table. The query output yields another table, the results table. The results table contains the results of your query, from which you draw data for an external datastore, such a relational database. The timing of when data is processed from the input table is controlled by the trigger interval. By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.
95+
The timing of when data is processed from the input table is controlled by the trigger interval. By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.
11896

11997
![processing of data in structured streaming](./media/migrate-storm-to-spark/structured-streaming-data-processing.png)
12098

12199
![programming model for structured streaming](./media/migrate-storm-to-spark/structured-streaming-model.png)
122100

123101
## General migration flow
124102

125-
Presumed current environment:
103+
The recommended migration flow from Storm to Spark assumes the following initial architecture:
126104

127-
* Kafka is used as the streaming data source,
128-
* Kafka and Storm are deployed on the same virtual network,
129-
* The data processed by Storm is written to data sink, such as Azure storage, ADLS, and so on.
105+
* Kafka is used as the streaming data source
106+
* Kafka and Storm are deployed on the same virtual network
107+
* The data processed by Storm is written to a data sink, such as Azure Storage or Azure Data Lake Storage Gen2.
130108

131109
![diagram of presumed current environment](./media/migrate-storm-to-spark/presumed-current-environment.png)
132110

133-
1. Deploy new HDInsight 4.0 Spark cluster, deploy code, and test it.
111+
To migrate your application from Storm to one of the Spark streaming APIs, do the following:
134112

135-
Deploy a new HDInsight 4.0 Spark cluster in the same VNet and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.
113+
1. **Deploy a new cluster.** Deploy a new HDInsight 4.0 Spark cluster in the same virtual network and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.
136114

137115
![new spark deployment in HDInsight](./media/migrate-storm-to-spark/new-spark-deployment.png)
138116

139-
1. Stop consuming on the current Storm cluster.
140-
141-
In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.
117+
1. **Stop consuming on the old Storm cluster.** In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.
142118

143119
![stop consuming on current cluster](./media/migrate-storm-to-spark/stop-consuming-current-cluster.png)
144120

145-
1. Start consuming on the new Spark cluster.
146-
147-
Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. At this time, the process is taken over by consuming from the latest Kafka offset.
121+
1. **Start consuming on the new Spark cluster.** Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. At this time, the process is taken over by consuming from the latest Kafka offset.
148122

149123
![start consuming on new cluster](./media/migrate-storm-to-spark/start-consuming-new-cluster.png)
150124

151-
1. Remove the old cluster as needed.
152-
153-
Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.
125+
1. **Remove the old cluster as needed.** Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.
154126

155127
![remove old HDInsight clusters as needed](./media/migrate-storm-to-spark/remove-old-clusters1.png)
156128

0 commit comments

Comments
 (0)