MicrosoftDocs
diff --git a/‎articles/hdinsight/storm/media/migrate-storm-to-spark/apache-storm-components.png
-41.6 KB b/‎articles/hdinsight/storm/media/migrate-storm-to-spark/apache-storm-components.png
-41.6 KB
diff --git a/‎articles/hdinsight/storm/migrate-storm-to-spark.md
Lines changed: 36 additions & 25 deletions b/‎articles/hdinsight/storm/migrate-storm-to-spark.md
Lines changed: 36 additions & 25 deletions
@@ -6,7 +6,7 @@ ms.author: hrasheed
 ms.reviewer: jasonh
 ms.service: hdinsight
 ms.topic: conceptual
-ms.date: 12/05/2019
+ms.date: 01/16/2019
 ---
 # Migrate Azure HDInsight 3.6 Apache Storm to HDInsight 4.0 Apache Spark
 
@@ -22,7 +22,8 @@ If you want to migrate from Apache Storm on HDInsight 3.6 you have multiple opti
 
 This document provides a guide for migrating from Apache Storm to Spark Streaming and Spark Structured Streaming.
 
-![HDInsight Storm migration path](./media/migrate-storm-to-spark/storm-migration-path.png)
+> [!div class="mx-imgBorder"]
+> ![HDInsight Storm migration path](./media/migrate-storm-to-spark/storm-migration-path.png)
 
 ## Comparison between Apache Storm and Spark Streaming, Spark Structured Streaming
 
@@ -43,7 +44,8 @@ Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured S
 
 Storm provides a model that processes each single event. This means that all incoming records will be processed as soon as they arrive. Spark Streaming applications must wait a fraction of a second to collect each micro-batch of events before sending that batch on for processing. In contrast, an event-driven application processes each event immediately. Spark Streaming latency is typically under a few seconds. The benefits of the micro-batch approach are more efficient data processing and simpler aggregate calculations.
 
-![streaming and micro-batch processing](./media/migrate-storm-to-spark/streaming-and-micro-batch-processing.png)
+> [!div class="mx-imgBorder"]
+> ![streaming and micro-batch processing](./media/migrate-storm-to-spark/streaming-and-micro-batch-processing.png)
 
 ## Storm architecture and components
 
@@ -54,7 +56,8 @@ Storm topologies are composed of multiple components that are arranged in a dire
 |Spout|Brings data into a topology. They emit one or more streams into the topology.|
 |Bolt|Consumes streams emitted from spouts or other bolts. Bolts might optionally emit streams into the topology. Bolts are also responsible for writing data to external services or storage, such as HDFS, Kafka, or HBase.|
 
-![interaction of storm components](./media/migrate-storm-to-spark/apache-storm-components.png)
+> [!div class="mx-imgBorder"]
+> ![interaction of storm components](./media/migrate-storm-to-spark/apache-storm-components.png)
 
 Storm consists of the following three daemons, which keep the Storm cluster functioning.
 
@@ -64,23 +67,28 @@ Storm consists of the following three daemons, which keep the Storm cluster func
 |Zookeeper|Used for cluster coordination.|
 |Supervisor|Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. Each worker process executes a subset of a topology. User’s application logic (Spouts and Bolt) run here.|
 
-![nimbus, zookeeper, and supervisor daemons](./media/migrate-storm-to-spark/nimbus-zookeeper-supervisor.png)
+> [!div class="mx-imgBorder"]
+> ![nimbus, zookeeper, and supervisor daemons](./media/migrate-storm-to-spark/nimbus-zookeeper-supervisor.png)
 
-## Spark Streaming / Spark Structured Streaming
+## Spark Streaming architecture and components
 
-* When Spark Streaming is launched, the driver launches the task in Executor.
-* Executor receives a stream from a streaming data source.
-* When the Executor receives data streams, it splits the stream into blocks and keeps them in memory.
-* Blocks of data are replicated to other Executors.
+The following steps summarize how components work together in Spark Streaming (DStreams) and Spark Structured Streaming:
+
+* When Spark Streaming is launched, the driver launches the task in the executor.
+* The executor receives a stream from a streaming data source.
+* When the executor receives data streams, it splits the stream into blocks and keeps them in memory.
+* Blocks of data are replicated to other executors.
 * The processed data is then stored in the target data store.
 
-![spark streaming path to output](./media/migrate-storm-to-spark/spark-streaming-to-output.png)
+> [!div class="mx-imgBorder"]
+> ![spark streaming path to output](./media/migrate-storm-to-spark/spark-streaming-to-output.png)
 
-## Spark Streaming – DStream
+## Spark Streaming (DStream) workflow
 
 As each batch interval elapses, a new RDD is produced that contains all the data from that interval. The continuous sets of RDDs are collected into a DStream. For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. When processing the DStream, the temperature event appears in one of these batches. A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.
 
-![spark streaming processing batches](./media/migrate-storm-to-spark/spark-streaming-batches.png)
+> [!div class="mx-imgBorder"]
+> ![spark streaming processing batches](./media/migrate-storm-to-spark/spark-streaming-batches.png)
 
 For details on the different transformations available with Spark Streaming, see [Transformations on DStreams](https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams).
 
@@ -94,9 +102,11 @@ The query output yields a *results table*, which contains the results of your qu
 
 The timing of when data is processed from the input table is controlled by the trigger interval. By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.
 
-![processing of data in structured streaming](./media/migrate-storm-to-spark/structured-streaming-data-processing.png)
+> [!div class="mx-imgBorder"]
+> ![processing of data in structured streaming](./media/migrate-storm-to-spark/structured-streaming-data-processing.png)
 
-![programming model for structured streaming](./media/migrate-storm-to-spark/structured-streaming-model.png)
+> [!div class="mx-imgBorder"]
+> ![programming model for structured streaming](./media/migrate-storm-to-spark/structured-streaming-model.png)
 
 ## General migration flow
 
@@ -106,33 +116,34 @@ The recommended migration flow from Storm to Spark assumes the following initial
 * Kafka and Storm are deployed on the same virtual network
 * The data processed by Storm is written to a data sink, such as Azure Storage or Azure Data Lake Storage Gen2.
 
-![diagram of presumed current environment](./media/migrate-storm-to-spark/presumed-current-environment.png)
+    > [!div class="mx-imgBorder"]
+    > ![diagram of presumed current environment](./media/migrate-storm-to-spark/presumed-current-environment.png)
 
 To migrate your application from Storm to one of the Spark streaming APIs, do the following:
 
 1. **Deploy a new cluster.** Deploy a new HDInsight 4.0 Spark cluster in the same virtual network and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.
 
-    ![new spark deployment in HDInsight](./media/migrate-storm-to-spark/new-spark-deployment.png)
+    > [!div class="mx-imgBorder"]
+    > ![new spark deployment in HDInsight](./media/migrate-storm-to-spark/new-spark-deployment.png)
 
 1. **Stop consuming on the old Storm cluster.** In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.
 
-    ![stop consuming on current cluster](./media/migrate-storm-to-spark/stop-consuming-current-cluster.png)
+    > [!div class="mx-imgBorder"]
+    > ![stop consuming on current cluster](./media/migrate-storm-to-spark/stop-consuming-current-cluster.png)
 
 1. **Start consuming on the new Spark cluster.** Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. At this time, the process is taken over by consuming from the latest Kafka offset.
 
-    ![start consuming on new cluster](./media/migrate-storm-to-spark/start-consuming-new-cluster.png)
+    > [!div class="mx-imgBorder"]
+    > ![start consuming on new cluster](./media/migrate-storm-to-spark/start-consuming-new-cluster.png)
 
 1. **Remove the old cluster as needed.** Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.
 
-    ![remove old HDInsight clusters as needed](./media/migrate-storm-to-spark/remove-old-clusters1.png)
+    > [!div class="mx-imgBorder"]
+    > ![remove old HDInsight clusters as needed](./media/migrate-storm-to-spark/remove-old-clusters1.png)
 
 ## Next steps
 
 For more information about Storm, Spark Streaming, and Spark Structured Streaming, see the following documents:
 
-* [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
 * [Overview of Apache Spark Streaming](../spark/apache-spark-streaming-overview.md)
-* [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
-* [Overview of Apache Spark Structured Streaming](../spark/apache-spark-structured-streaming-overview.md)
-* [What is Apache Storm on Azure HDInsight?](./apache-storm-overview.md)
-* [Azure HDInsight release notes](../hdinsight-version-release.md)
+* [Overview of Apache Spark Structured Streaming](../spark/apache-spark-structured-streaming-overview.md)