You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/storm/migrate-storm-to-spark.md
+12-53Lines changed: 12 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,27 +32,15 @@ Apache Storm can provide different levels of guaranteed message processing. For
32
32
33
33
### Spark streaming vs Spark structured streaming
34
34
35
-
Spark Structured Streaming is replacing Spark Streaming (DStreams). Going forward, Structured Streaming will receive enhancements and maintenance, while DStreams will be in maintenance mode only. Structured Streaming is currently not as feature-complete as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
35
+
Spark Structured Streaming is replacing Spark Streaming (DStreams). Structured Streaming will continue to receive enhancements and maintenance, while DStreams will be in maintenance mode only. Structured Streaming does not have as many features as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.
36
36
37
37
## Streaming (Single event) processing vs Micro-Batch processing
38
38
39
39
Storm provides a model that processes each single event. This means that all incoming records will be processed as soon as they arrive. Spark Streaming applications must wait a fraction of a second to collect each micro-batch of events before sending that batch on for processing. In contrast, an event-driven application processes each event immediately. Spark Streaming latency is typically under a few seconds. The benefits of the micro-batch approach are more efficient data processing and simpler aggregate calculations.
40
40
41
41

42
42
43
-
## Storm architecture
44
-
45
-
Storm consists of the following three daemons.
46
-
47
-
|Daemon |Description |
48
-
|---|---|
49
-
|Nimbus|Similar to Hadoop JobTracker, it's responsible for distributing code around the cluster and assigning tasks to machines and monitoring for failures.|
50
-
|Zookeeper|Used for cluster coordination.|
51
-
|Supervisor|Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. Each worker process executes a subset of a topology. User’s application logic (Spouts and Bolt) run here.|
52
-
53
-

54
-
55
-
## Storm concept
43
+
## Storm architecture and components
56
44
57
45
Storm topologies are composed of multiple components that are arranged in a directed acyclic graph (DAG). Data flows between the components in the graph. Each component consumes one or more data streams, and can optionally emit one or more streams.
58
46
@@ -63,6 +51,16 @@ Storm topologies are composed of multiple components that are arranged in a dire
63
51
64
52

65
53
54
+
Storm consists of the following three daemons which keep the Storm cluster functioning.
55
+
56
+
|Daemon |Description |
57
+
|---|---|
58
+
|Nimbus|Similar to Hadoop JobTracker, it's responsible for distributing code around the cluster and assigning tasks to machines and monitoring for failures.|
59
+
|Zookeeper|Used for cluster coordination.|
60
+
|Supervisor|Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. Each worker process executes a subset of a topology. User’s application logic (Spouts and Bolt) run here.|
61
+
62
+

63
+
66
64
## Spark Streaming / Spark Structured Streaming
67
65
68
66
* When Spark Streaming is launched, the driver launches the task in Executor.
@@ -122,45 +120,6 @@ In Structured Streaming, data arrives at the system and is immediately ingested
122
120
123
121

124
122
125
-
## Spark structured streaming
126
-
127
-
You can write the basic operations of Spark Structured Streaming code as follows. See [Overview of Apache Spark Structured Streaming](../spark/apache-spark-structured-streaming-overview.md) for more details.
128
-
129
-
```spark
130
-
case class DeviceData(device: String, deviceType: String, signal: Double, time: DateTime)
131
-
val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: string }
132
-
val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming Dataset with IOT device data
133
-
// Select the devices which have signal more than 10
134
-
df.select("device").where("signal > 10")
135
-
// using untyped APIs
136
-
ds.filter(_.signal > 10).map(_.device) // using typed APIs
137
-
// Running count of the number of updates for each device type
138
-
df.groupBy("deviceType").count() // using untyped API
0 commit comments