You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cognitive-services/Anomaly-Detector/tutorials/anomaly-detection-streaming-databricks.md
+42-49Lines changed: 42 additions & 49 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ ms.author: aahi
14
14
15
15
# Tutorial: Anomaly detection on streaming data using Azure Databricks
16
16
17
-
Microsoft Power BI Desktop is a free application that lets you connect to, transform, and visualize your data. The Anomaly Detector API, part of Azure Cognitive Services, provides a way of monitoring your time series data. Use this tutorial to run anomaly detection on a stream of data in near real-time using Azure Databricks. You'll ingest twitter data using Azure Event Hubs, and import them into Azure Databricks using the Spark Event Hubs connector. Afterwards, you'll use the API to detect anomalies on the streamed data.
17
+
[Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/)is a fast, easy, and collaborative Apache Spark–based analytics service. The Anomaly Detector API, part of Azure Cognitive Services, provides a way of monitoring your time series data. Use this tutorial to run anomaly detection on a stream of data in near real-time using Azure Databricks. You'll ingest twitter data using Azure Event Hubs, and import them into Azure Databricks using the Spark Event Hubs connector. Afterwards, you'll use the API to detect anomalies on the streamed data.
18
18
19
19
The following illustration shows the application flow:
20
20
@@ -75,7 +75,7 @@ In this section, you create an Azure Databricks workspace using the [Azure porta
75
75
76
76
Select **Create**.
77
77
78
-
4. The account creation takes a few minutes.
78
+
4. The workspace creation takes a few minutes.
79
79
80
80
## Create a Spark cluster in Databricks
81
81
@@ -95,7 +95,8 @@ In this section, you create an Azure Databricks workspace using the [Azure porta
95
95
* For this article, create a cluster with **5.2** runtime. Do NOT select **5.3** runtime.
96
96
* Make sure the **Terminate after \_\_ minutes of inactivity** checkbox is selected. Provide a duration (in minutes) to terminate the cluster, if the cluster isn't being used.
97
97
98
-
Select **Create cluster**. Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.
98
+
Select **Create cluster**.
99
+
4. The cluster creation takes several minutes. Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.
99
100
100
101
## Create a Twitter application
101
102
@@ -123,7 +124,7 @@ In this tutorial, you use the Twitter APIs to send tweets to Event Hubs. You als
2. In the New Library page, for **Source** select **Maven Coordinate**. For **Coordinate**, enter the coordinate for the package you want to add. Here is the Maven coordinates for the libraries used in this tutorial:
127
+
2. In the New Library page, for **Source** select **Maven**. For **Coordinates**, enter the coordinate for the package you want to add. Here is the Maven coordinates for the libraries used in this tutorial:
* Twitter API - `org.twitter4j:twitter4j-core:4.0.7`
@@ -168,26 +169,22 @@ In this tutorial, you use the [Azure Cognitive Services Anomaly Detector APIs](.
168
169
169
170
Select **Create**.
170
171
171
-
5. After the resource is created, from the **Overview** tab, select **Show access keys**.
172
+
5. After the resource is created, from the **Overview** tab, copy and save the **Endpoint** URL, as shown in the screenshot. Then select **Show access keys**.
7. Save the values for the endpoint URL and the access key, you retrieved in this step. You need it later in this tutorial.
182
-
183
180
## Create notebooks in Databricks
184
181
185
182
In this section, you create two notebooks in Databricks workspace with the following names
186
183
187
184
-**SendTweetsToEventHub** - A producer notebook you use to get tweets from Twitter and stream them to Event Hubs.
188
185
-**AnalyzeTweetsFromEventHub** - A consumer notebook you use to read the tweets from Event Hubs and run anomaly detection.
189
186
190
-
1. In the left pane, select **Workspace**. From the **Workspace** drop-down, select **Create**, and then select **Notebook**.
187
+
1. In the Azure Databricks workspace, select **Workspace** from the left pane. From the **Workspace** drop-down, select **Create**, and then select **Notebook**.
191
188
192
189

193
190
@@ -201,7 +198,7 @@ In this section, you create two notebooks in Databricks workspace with the follo
201
198
202
199
## Send tweets to Event Hubs
203
200
204
-
In the **SendTweetsToEventHub** notebook, paste the following code, and replace the placeholder with values for your Event Hubs namespace and Twitter application that you created earlier. This notebook streams tweets with the keyword "Azure" into Event Hubs in real time.
201
+
In the **SendTweetsToEventHub** notebook, paste the following code, and replace the placeholder with values for your Event Hubs namespace and Twitter application that you created earlier. This notebook extracts creation time and number of "Like"s from tweets with the keyword "Azure" and stream those as events into Event Hubs in real time.
205
202
206
203
```scala
207
204
//
@@ -298,7 +295,7 @@ eventHubClient.get().close()
298
295
pool.shutdown()
299
296
```
300
297
301
-
To run the notebook, press **SHIFT + ENTER**. You see an output as shown in the following snippet. Each event in the output is a tweet that is ingested into the Event Hubs.
298
+
To run the notebook, press **SHIFT + ENTER**. You see an output as shown in the following snippet. Each event in the output is a combination of timestamp and number of "Like"s ingested into the Event Hubs.
302
299
303
300
Sent event: {"timestamp":"2019-04-24T09:39:40.000Z","favorite":0}
304
301
@@ -321,7 +318,7 @@ To run the notebook, press **SHIFT + ENTER**. You see an output as shown in the
321
318
322
319
## Read tweets from Event Hubs
323
320
324
-
In the **AnalyzeTweetsFromEventHub** notebook, paste the following code, and replace the placeholder with values for your Azure Event Hubs that you created earlier. This notebook reads the tweets that you earlier streamed into Event Hubs using the **SendTweetsToEventHub** notebook.
321
+
In the **AnalyzeTweetsFromEventHub** notebook, paste the following code, and replace the placeholder with values for your Anomaly Detector resource that you created earlier. This notebook reads the tweets that you earlier streamed into Event Hubs using the **SendTweetsToEventHub** notebook.
Then load data from event hub for anomaly detection.
501
+
Then load data from event hub for anomaly detection. Replace the placeholder with values for your Azure Event Hubs that you created earlier.
506
502
507
503
```scala
508
504
//
@@ -540,7 +536,7 @@ display(msgStream)
540
536
541
537
```
542
538
543
-
The output now resembles the following image. Pay attention to that your date in the table might be different from the date in this tutorial as the data is real time.
539
+
The output now resembles the following image. Note that your date in the table might be different from the date in this tutorial as the data is real time.
544
540

545
541
546
542
You have now streamed data from Azure Event Hubs into Azure Databricks at near real time using the Event Hubs connector for Apache Spark. For more information on how to use the Event Hubs connector for Spark, see the [connector documentation](https://github.com/Azure/azure-event-hubs-spark/tree/master/docs).
@@ -580,6 +576,8 @@ groupTime average
580
576
```
581
577
582
578
Then get the aggregated output result to Delta. Because anomaly detection requires a longer history window, we're using Delta to keep the history data for the point you want to detect.
579
+
Replace the "[Placeholder: table name]" with a qualified Delta table name to be created (for example, "tweets"). Replace "[Placeholder: folder name for checkpoints]" with a string value that's unique each time you run this code (for example, "etl-from-eventhub-20190605").
580
+
To learn more about Delta Lake on Azure Databricks, please refer to [Delta Lake Guide](https://docs.azuredatabricks.net/delta/index.html)
583
581
584
582
585
583
```scala
@@ -595,6 +593,7 @@ groupStream.writeStream
595
593
596
594
```
597
595
596
+
Replace the "[Placeholder: table name]" with the same Delta table name you've selected above.
598
597
```scala
599
598
//
600
599
// Show Aggregate Result
@@ -621,26 +620,35 @@ groupTime average
621
620
622
621
```
623
622
624
-
Now the aggregated time series data is continuously ingested into the Delta. Then you can schedule a job every hour to detect the anomaly of latest point.
623
+
Now the aggregated time series data is continuously ingested into the Delta. Then you can schedule an hourly job to detect the anomaly of latest point.
624
+
Replace the "[Placeholder: table name]" with the same Delta table name you've selected above.
valadResult= spark.sql("SELECT '"+ endTime.toString +"' as timestamp, anomalydetect(groupTime, average) as anomaly FROM series")
660
+
valadResult= spark.sql("SELECT '"+ endTime.toString +"' as datetime, anomalydetect(groupTime, average) as anomaly FROM series")
653
661
adResult.show()
654
662
```
655
663
Result as below:
@@ -661,28 +669,13 @@ Result as below:
661
669
|2019-04-16T00:00:00Z| false|
662
670
+--------------------+-------+
663
671
664
-
```
665
-
Output the anomaly detection result back to the Delta.
666
-
```scala
667
-
//
668
-
// Output Batch AD Result to delta
669
-
//
670
-
671
-
adResult.writeStream
672
-
.format("delta")
673
-
.outputMode("complete")
674
-
.option("checkpointLocation", "/delta/[Placeholder: table name]/_checkpoints/[Placeholder: folder name for checkpoints]")
675
-
.table("[Placeholder: table name]")
676
-
677
-
```
678
-
679
672
680
-
That's it! Using Azure Databricks, you have successfully streamed data into Azure Event Hubs, consumed the stream data using the Event Hubs connector, and then ran anomaly detection on streaming data in near real time.
673
+
That's it! Using Azure Databricks, you have successfully streamed data into Azure Event Hubs, consumed the stream data using the Event Hubs connector, and then run anomaly detection on streaming data in near real time.
681
674
Although in this tutorial, the granularity is hourly, you can always change the granularity to meet your need.
682
675
683
676
## Clean up resources
684
677
685
-
After you have finished running the tutorial, you can terminate the cluster. To do so, from the Azure Databricks workspace, from the left pane, select **Clusters**. For the cluster you want to terminate, move the cursor over the ellipsis under **Actions** column, and select the **Terminate** icon.
678
+
After you have finished running the tutorial, you can terminate the cluster. To do so, in the Azure Databricks workspace, select **Clusters** from the left pane. For the cluster you want to terminate, move the cursor over the ellipsis under **Actions** column, and select the **Terminate** icon and then select **Confirm**.
686
679
687
680

0 commit comments