Skip to content

Commit 498dbab

Browse files
committed
updates
1 parent 7a7f1f4 commit 498dbab

File tree

2 files changed

+36
-25
lines changed

2 files changed

+36
-25
lines changed
-41.6 KB
Loading

articles/hdinsight/storm/migrate-storm-to-spark.md

Lines changed: 36 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 12/05/2019
9+
ms.date: 01/16/2019
1010
---
1111
# Migrate Azure HDInsight 3.6 Apache Storm to HDInsight 4.0 Apache Spark
1212

@@ -22,7 +22,8 @@ If you want to migrate from Apache Storm on HDInsight 3.6 you have multiple opti
2222

2323
This document provides a guide for migrating from Apache Storm to Spark Streaming and Spark Structured Streaming.
2424

25-
![HDInsight Storm migration path](./media/migrate-storm-to-spark/storm-migration-path.png)
25+
> [!div class="mx-imgBorder"]
26+
> ![HDInsight Storm migration path](./media/migrate-storm-to-spark/storm-migration-path.png)
2627
2728
## Comparison between Apache Storm and Spark Streaming, Spark Structured Streaming
2829

@@ -43,7 +44,8 @@ Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured S
4344

4445
Storm provides a model that processes each single event. This means that all incoming records will be processed as soon as they arrive. Spark Streaming applications must wait a fraction of a second to collect each micro-batch of events before sending that batch on for processing. In contrast, an event-driven application processes each event immediately. Spark Streaming latency is typically under a few seconds. The benefits of the micro-batch approach are more efficient data processing and simpler aggregate calculations.
4546

46-
![streaming and micro-batch processing](./media/migrate-storm-to-spark/streaming-and-micro-batch-processing.png)
47+
> [!div class="mx-imgBorder"]
48+
> ![streaming and micro-batch processing](./media/migrate-storm-to-spark/streaming-and-micro-batch-processing.png)
4749
4850
## Storm architecture and components
4951

@@ -54,7 +56,8 @@ Storm topologies are composed of multiple components that are arranged in a dire
5456
|Spout|Brings data into a topology. They emit one or more streams into the topology.|
5557
|Bolt|Consumes streams emitted from spouts or other bolts. Bolts might optionally emit streams into the topology. Bolts are also responsible for writing data to external services or storage, such as HDFS, Kafka, or HBase.|
5658

57-
![interaction of storm components](./media/migrate-storm-to-spark/apache-storm-components.png)
59+
> [!div class="mx-imgBorder"]
60+
> ![interaction of storm components](./media/migrate-storm-to-spark/apache-storm-components.png)
5861
5962
Storm consists of the following three daemons, which keep the Storm cluster functioning.
6063

@@ -64,23 +67,28 @@ Storm consists of the following three daemons, which keep the Storm cluster func
6467
|Zookeeper|Used for cluster coordination.|
6568
|Supervisor|Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. Each worker process executes a subset of a topology. User’s application logic (Spouts and Bolt) run here.|
6669

67-
![nimbus, zookeeper, and supervisor daemons](./media/migrate-storm-to-spark/nimbus-zookeeper-supervisor.png)
70+
> [!div class="mx-imgBorder"]
71+
> ![nimbus, zookeeper, and supervisor daemons](./media/migrate-storm-to-spark/nimbus-zookeeper-supervisor.png)
6872
69-
## Spark Streaming / Spark Structured Streaming
73+
## Spark Streaming architecture and components
7074

71-
* When Spark Streaming is launched, the driver launches the task in Executor.
72-
* Executor receives a stream from a streaming data source.
73-
* When the Executor receives data streams, it splits the stream into blocks and keeps them in memory.
74-
* Blocks of data are replicated to other Executors.
75+
The following steps summarize how components work together in Spark Streaming (DStreams) and Spark Structured Streaming:
76+
77+
* When Spark Streaming is launched, the driver launches the task in the executor.
78+
* The executor receives a stream from a streaming data source.
79+
* When the executor receives data streams, it splits the stream into blocks and keeps them in memory.
80+
* Blocks of data are replicated to other executors.
7581
* The processed data is then stored in the target data store.
7682

77-
![spark streaming path to output](./media/migrate-storm-to-spark/spark-streaming-to-output.png)
83+
> [!div class="mx-imgBorder"]
84+
> ![spark streaming path to output](./media/migrate-storm-to-spark/spark-streaming-to-output.png)
7885
79-
## Spark Streaming DStream
86+
## Spark Streaming (DStream) workflow
8087

8188
As each batch interval elapses, a new RDD is produced that contains all the data from that interval. The continuous sets of RDDs are collected into a DStream. For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. When processing the DStream, the temperature event appears in one of these batches. A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.
8289

83-
![spark streaming processing batches](./media/migrate-storm-to-spark/spark-streaming-batches.png)
90+
> [!div class="mx-imgBorder"]
91+
> ![spark streaming processing batches](./media/migrate-storm-to-spark/spark-streaming-batches.png)
8492
8593
For details on the different transformations available with Spark Streaming, see [Transformations on DStreams](https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams).
8694

@@ -94,9 +102,11 @@ The query output yields a *results table*, which contains the results of your qu
94102

95103
The timing of when data is processed from the input table is controlled by the trigger interval. By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.
96104

97-
![processing of data in structured streaming](./media/migrate-storm-to-spark/structured-streaming-data-processing.png)
105+
> [!div class="mx-imgBorder"]
106+
> ![processing of data in structured streaming](./media/migrate-storm-to-spark/structured-streaming-data-processing.png)
98107
99-
![programming model for structured streaming](./media/migrate-storm-to-spark/structured-streaming-model.png)
108+
> [!div class="mx-imgBorder"]
109+
> ![programming model for structured streaming](./media/migrate-storm-to-spark/structured-streaming-model.png)
100110
101111
## General migration flow
102112

@@ -106,33 +116,34 @@ The recommended migration flow from Storm to Spark assumes the following initial
106116
* Kafka and Storm are deployed on the same virtual network
107117
* The data processed by Storm is written to a data sink, such as Azure Storage or Azure Data Lake Storage Gen2.
108118

109-
![diagram of presumed current environment](./media/migrate-storm-to-spark/presumed-current-environment.png)
119+
> [!div class="mx-imgBorder"]
120+
> ![diagram of presumed current environment](./media/migrate-storm-to-spark/presumed-current-environment.png)
110121
111122
To migrate your application from Storm to one of the Spark streaming APIs, do the following:
112123

113124
1. **Deploy a new cluster.** Deploy a new HDInsight 4.0 Spark cluster in the same virtual network and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.
114125

115-
![new spark deployment in HDInsight](./media/migrate-storm-to-spark/new-spark-deployment.png)
126+
> [!div class="mx-imgBorder"]
127+
> ![new spark deployment in HDInsight](./media/migrate-storm-to-spark/new-spark-deployment.png)
116128
117129
1. **Stop consuming on the old Storm cluster.** In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.
118130

119-
![stop consuming on current cluster](./media/migrate-storm-to-spark/stop-consuming-current-cluster.png)
131+
> [!div class="mx-imgBorder"]
132+
> ![stop consuming on current cluster](./media/migrate-storm-to-spark/stop-consuming-current-cluster.png)
120133
121134
1. **Start consuming on the new Spark cluster.** Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. At this time, the process is taken over by consuming from the latest Kafka offset.
122135

123-
![start consuming on new cluster](./media/migrate-storm-to-spark/start-consuming-new-cluster.png)
136+
> [!div class="mx-imgBorder"]
137+
> ![start consuming on new cluster](./media/migrate-storm-to-spark/start-consuming-new-cluster.png)
124138
125139
1. **Remove the old cluster as needed.** Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.
126140

127-
![remove old HDInsight clusters as needed](./media/migrate-storm-to-spark/remove-old-clusters1.png)
141+
> [!div class="mx-imgBorder"]
142+
> ![remove old HDInsight clusters as needed](./media/migrate-storm-to-spark/remove-old-clusters1.png)
128143
129144
## Next steps
130145

131146
For more information about Storm, Spark Streaming, and Spark Structured Streaming, see the following documents:
132147

133-
* [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
134148
* [Overview of Apache Spark Streaming](../spark/apache-spark-streaming-overview.md)
135-
* [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
136-
* [Overview of Apache Spark Structured Streaming](../spark/apache-spark-structured-streaming-overview.md)
137-
* [What is Apache Storm on Azure HDInsight?](./apache-storm-overview.md)
138-
* [Azure HDInsight release notes](../hdinsight-version-release.md)
149+
* [Overview of Apache Spark Structured Streaming](../spark/apache-spark-structured-streaming-overview.md)

0 commit comments

Comments
 (0)