You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: How to set up Apache Spark Streaming to process an event once and only once.
4
-
ms.service: hdinsight
5
4
author: hrasheed-msft
6
5
ms.author: hrasheed
7
6
ms.reviewer: jasonh
7
+
ms.service: hdinsight
8
8
ms.custom: hdinsightactive
9
9
ms.topic: conceptual
10
-
ms.date: 11/06/2018
10
+
ms.date: 11/15/2018
11
11
---
12
+
12
13
# Create Apache Spark Streaming jobs with exactly-once event processing
13
14
14
-
Stream processing applications take different approaches to how they handle re-processing messages after some failure in the system:
15
+
Stream processing applications take different approaches to how they handle reprocessing messages after some failure in the system:
15
16
16
17
* At least once: Each message is guaranteed to be processed, but it may get processed more than once.
17
-
* At most once: Each message may or may not be processed. If a message is processed, it is only processed once.
18
+
* At most once: Each message may or may not be processed. If a message is processed, it's only processed once.
18
19
* Exactly once: Each message is guaranteed to be processed once and only once.
19
20
20
21
This article shows you how to configure Spark Streaming to achieve exactly-once processing.
@@ -45,13 +46,13 @@ In Spark Streaming, sources like Event Hubs and Kafka have *reliable receivers*,
45
46
46
47
Spark Streaming supports the use of a Write-Ahead Log, where each received event is first written to Spark's checkpoint directory in fault-tolerant storage and then stored in a Resilient Distributed Dataset (RDD). In Azure, the fault-tolerant storage is HDFS backed by either Azure Storage or Azure Data Lake Storage. In your Spark Streaming application, the Write-Ahead Log is enabled for all receivers by setting the `spark.streaming.receiver.writeAheadLog.enable` configuration setting to `true`. The Write-Ahead Log provides fault tolerance for failures of both the driver and the executors.
47
48
48
-
For workers running tasks against the event data, each RDD is by definition both replicated and distributed across multiple workers. If a task fails because the worker running it crashed, the task will be restarted on another worker that has a replica of the event data, so the event is not lost.
49
+
For workers running tasks against the event data, each RDD is by definition both replicated and distributed across multiple workers. If a task fails because the worker running it crashed, the task will be restarted on another worker that has a replica of the event data, so the event isn't lost.
49
50
50
51
### Use checkpoints for drivers
51
52
52
53
The job drivers need to be restartable. If the driver running your Spark Streaming application crashes, it takes down with it all running receivers, tasks, and any RDDs storing event data. In this case, you need to be able to save the progress of the job so you can resume it later. This is accomplished by checkpointing the Directed Acyclic Graph (DAG) of the DStream periodically to fault-tolerant storage. The DAG metadata includes the configuration used to create the streaming application, the operations that define the application, and any batches that are queued but not yet completed. This metadata enables a failed driver to be restarted from the checkpoint information. When the driver restarts, it will launch new receivers that themselves recover the event data back into RDDs from the Write-Ahead Log.
53
54
54
-
Checkpoints are enabled in Spark Streaming in two steps.
55
+
Checkpoints are enabled in Spark Streaming in two steps.
55
56
56
57
1. In the StreamingContext object, configure the storage path for the checkpoints:
57
58
@@ -73,13 +74,13 @@ Checkpoints are enabled in Spark Streaming in two steps.
73
74
74
75
###Use idempotent sinks
75
76
76
-
The destination sink to which your job writes results must be able to handle the situation where it isgiventhe same result more than once. The sink must be able to detect such duplicate results and ignore them. An*idempotent* sink can be called multiple times with the same data with no change of state.
77
+
The destination sink to which your job writes results must be able to handle the situation where it'sgiventhe same result more than once. The sink must be able to detect such duplicate results and ignore them. An*idempotent* sink can be called multiple times with the same data with no change of state.
77
78
78
-
You can create idempotent sinks by implementing logic that first checks for the existence of the incoming result in the datastore. If the result already exists, the write should appear to succeed from the perspective of your Spark job, but in reality your data store ignored the duplicate data. If the result does not exist, then the sink should insert thisnew result into its storage.
79
+
You can create idempotent sinks by implementing logic that first checks for the existence of the incoming result in the datastore. If the result already exists, the write should appear to succeed from the perspective of your Spark job, but in reality your data store ignored the duplicate data. If the result doesn'texist, then the sink should insert thisnew result into its storage.
79
80
80
81
For example, you could use a stored procedure withAzureSQLDatabase that inserts events into a table. This stored procedure first looks up the event by key fields, and only when no matching event found is the record inserted into the table.
81
82
82
-
Another example is to use a partitioned file system, like AzureStorage blobs or AzureDataLakeStorage. Inthiscase your sink logic does not need to check for the existence of a file. If the file representing the event exists, it is simply overwritten with the same data. Otherwise, a new file is created at the computed path.
83
+
Another example is to use a partitioned file system, like AzureStorage blobs or AzureDataLakeStorage. Inthiscase, your sink logic doesn'tneed to check for the existence of a file. If the file representing the event exists, it's simply overwritten with the same data. Otherwise, a new file is created at the computed path.
0 commit comments