Skip to content

Commit 3582270

Browse files
committed
initial draft
1 parent d90236e commit 3582270

File tree

1 file changed

+12
-53
lines changed

1 file changed

+12
-53
lines changed

articles/hdinsight/storm/migrate-storm-to-spark.md

Lines changed: 12 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -32,27 +32,15 @@ Apache Storm can provide different levels of guaranteed message processing. For
3232

3333
### Spark streaming vs Spark structured streaming
3434

35-
Spark Structured Streaming is replacing Spark Streaming (DStreams). Going forward, Structured Streaming will receive enhancements and maintenance, while DStreams will be in maintenance mode only. Structured Streaming is currently not as feature-complete as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
35+
Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured Streaming will continue to receive enhancements and maintenance, while DStreams will be in maintenance mode only. Structured Streaming does not have as many features as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
3636

3737
## Streaming (Single event) processing vs Micro-Batch processing
3838

3939
Storm provides a model that processes each single event. This means that all incoming records will be processed as soon as they arrive. Spark Streaming applications must wait a fraction of a second to collect each micro-batch of events before sending that batch on for processing. In contrast, an event-driven application processes each event immediately. Spark Streaming latency is typically under a few seconds. The benefits of the micro-batch approach are more efficient data processing and simpler aggregate calculations.
4040

4141
![streaming and micro-batch processing](./media/migrate-storm-to-spark/streaming-and-micro-batch-processing.png)
4242

43-
## Storm architecture
44-
45-
Storm consists of the following three daemons.
46-
47-
|Daemon |Description |
48-
|---|---|
49-
|Nimbus|Similar to Hadoop JobTracker, it's responsible for distributing code around the cluster and assigning tasks to machines and monitoring for failures.|
50-
|Zookeeper|Used for cluster coordination.|
51-
|Supervisor|Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. Each worker process executes a subset of a topology. User’s application logic (Spouts and Bolt) run here.|
52-
53-
![nimbus, zookeeper, and supervisor daemons](./media/migrate-storm-to-spark/nimbus-zookeeper-supervisor.png)
54-
55-
## Storm concept
43+
## Storm architecture and components
5644

5745
Storm topologies are composed of multiple components that are arranged in a directed acyclic graph (DAG). Data flows between the components in the graph. Each component consumes one or more data streams, and can optionally emit one or more streams.
5846

@@ -63,6 +51,16 @@ Storm topologies are composed of multiple components that are arranged in a dire
6351

6452
![interaction of storm components](./media/migrate-storm-to-spark/apache-storm-components.png)
6553

54+
Storm consists of the following three daemons which keep the Storm cluster functioning.
55+
56+
|Daemon |Description |
57+
|---|---|
58+
|Nimbus|Similar to Hadoop JobTracker, it's responsible for distributing code around the cluster and assigning tasks to machines and monitoring for failures.|
59+
|Zookeeper|Used for cluster coordination.|
60+
|Supervisor|Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. Each worker process executes a subset of a topology. User’s application logic (Spouts and Bolt) run here.|
61+
62+
![nimbus, zookeeper, and supervisor daemons](./media/migrate-storm-to-spark/nimbus-zookeeper-supervisor.png)
63+
6664
## Spark Streaming / Spark Structured Streaming
6765

6866
* When Spark Streaming is launched, the driver launches the task in Executor.
@@ -122,45 +120,6 @@ In Structured Streaming, data arrives at the system and is immediately ingested
122120

123121
![programming model for structured streaming](./media/migrate-storm-to-spark/structured-streaming-model.png)
124122

125-
## Spark structured streaming
126-
127-
You can write the basic operations of Spark Structured Streaming code as follows. See [Overview of Apache Spark Structured Streaming](../spark/apache-spark-structured-streaming-overview.md) for more details.
128-
129-
```spark
130-
case class DeviceData(device: String, deviceType: String, signal: Double, time: DateTime)
131-
val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: string }
132-
val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming Dataset with IOT device data
133-
// Select the devices which have signal more than 10
134-
df.select("device").where("signal > 10")
135-
// using untyped APIs
136-
ds.filter(_.signal > 10).map(_.device) // using typed APIs
137-
// Running count of the number of updates for each device type
138-
df.groupBy("deviceType").count() // using untyped API
139-
// Running average signal for each device type
140-
import org.apache.spark.sql.expressions.scalalang.typed
141-
ds.groupByKey(_.deviceType).agg(typed.avg(_.signal)) // using typed API
142-
```
143-
144-
SQL commands:
145-
146-
```spark
147-
df.createOrReplaceTempView("updates")
148-
spark.sql("select count(*) from updates") // returns another streaming DF
149-
```
150-
151-
Window operation:
152-
153-
```spark
154-
val windowedCounts = words.groupBy(
155-
window($"timestamp", "10 minutes", "5 minutes"),
156-
$"word“
157-
).count()
158-
```
159-
160-
![diagram of structured streaming results](./media/migrate-storm-to-spark/structured-streaming-results.png)
161-
162-
If the built-in operations don't meet the data transformation requirements, you can use UDF (User-Defined Functions).
163-
164123
## General migration flow
165124

166125
Presumed current environment:

0 commit comments

Comments
 (0)