Skip to content

Commit 43feab2

Browse files
authored
Merge pull request #205730 from ShawnJackson/stream-analytics-job
edit pass: Three articles about Stream Analytics job metrics
2 parents f95e169 + 58b80a7 commit 43feab2

File tree

3 files changed

+131
-132
lines changed

3 files changed

+131
-132
lines changed
Lines changed: 43 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,90 +1,92 @@
11
---
2-
title: Analyze Azure Stream Analytics job performance by using metric dimensions
3-
description: This article describes how to analyze stream analytics job with metric dimension.
2+
title: Analyze Stream Analytics job performance by using metrics and dimensions
3+
description: This article describes how to use Azure Stream Analytics metrics and dimensions to analyze a job's performance.
44
author: xujiang1
55
ms.author: xujiang1
66
ms.service: stream-analytics
77
ms.topic: troubleshooting
88
ms.custom:
99
ms.date: 07/07/2022
1010
---
11-
# Analyze Stream Analytics job performance with metrics dimensions
11+
# Analyze Stream Analytics job performance by using metrics and dimensions
1212

13-
To understand the Stream Analytics jobs health, it's important to know how to utilize the jobs metrics and dimensions. You can use Azure portal or VS code ASA extension or SDK to get and view the metrics and dimensions, which you're interested in.
13+
To understand an Azure Stream Analytics job's health, it's important to know how to use the job's metrics and dimensions. You can use the Azure portal, the Visual Studio Code Stream Analytics extension, or an SDK to get the metrics and dimensions that you're interested in.
1414

15-
This article demonstrates how to use Stream Analytics job metrics and dimensions to analyze the jobs performance through the Azure portal.
15+
This article demonstrates how to use Stream Analytics job metrics and dimensions to analyze a job's performance through the Azure portal.
1616

17-
Watermark delay and backlogged input events are the main metrics to determine performance of your Streaming analytics job. If your jobs watermark delay is continuously increasing and inputs events are backlogged, it implies that your job is unable to keep up with the rate of input events and produce outputs in a timely manner. Let’s look at several examples to analyze the job’s performance through the watermark delay metric data as a starting point.
17+
Watermark delay and backlogged input events are the main metrics to determine the performance of your Stream Analytics job. If your job's watermark delay is continuously increasing and input events are backlogged, your job can't keep up with the rate of input events and produce outputs in a timely manner.
1818

19-
## No input for certain partition causes job watermark delay increasing
19+
Let's look at several examples to analyze a job's performance through the **Watermark Delay** metric data as a starting point.
2020

21-
If your embarrassingly parallel job’s watermark delay is steadily increased, you can go to **Metrics** and follow these steps to find out if the root cause is due to no data in some partitions of your input source.
22-
1. First, you can check which partition has the watermark delay increasing by selecting watermark delay metric and splitting it by “Partition ID” dimension. For example, you identify that the partition#465 has high watermark delay.
21+
## No input for a certain partition increases job watermark delay
2322

24-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/01-watermark-delay-splitting-with-partition-id.png" alt-text="Diagram that show the watermark delay splitting with Partition ID for the case of no input in certain partition." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/01-watermark-delay-splitting-with-partition-id.png":::
23+
If your embarrassingly parallel job's watermark delay is steadily increasing, go to **Metrics**. Then use these steps to find out if the root cause is a lack of data in some partitions of your input source:
2524

26-
2. You can then check if there's any input data missing for this partition. To do this, you can select Input Events metric and filter it to this specific partition ID.
25+
1. Check which partition has the increasing watermark delay. Select the **Watermark Delay** metric and split it by the **Partition ID** dimension. In the following example, partition 465 has a high watermark delay.
2726

28-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/02-input-events-splitting-with-partition-id.png" alt-text="Diagram that shows the Input Events splitting with Partition ID for the case of no input in certain partition." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/02-input-events-splitting-with-partition-id.png":::
27+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/01-watermark-delay-splitting-with-partition-id.png" alt-text="Screenshot of a chart that shows watermark delay splitting by Partition ID for the case of no input in a partition." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/01-watermark-delay-splitting-with-partition-id.png":::
2928

29+
2. Check if any input data is missing for this partition. Select the **Input Events** metric and filter it to this specific partition ID.
3030

31-
What action could you take further?
31+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/02-input-events-splitting-with-partition-id.png" alt-text="Screenshot of a chart that shows Input Events splitting by Partition ID for the case of no input in a partition." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/02-input-events-splitting-with-partition-id.png":::
3232

33-
- As you can see, the watermark delay for this partition is increasing as there's no input events flowing into this partition. If your job's late arrival tolerance window is several hours and no input data is flowing into a partition, it's expected that the watermark delay for that partition continues to increase until the late arrival window is reached. For example, if your late arrival tolerance is 6 hours and input data isn't flowing into input partition 1, watermark delay for output partition 1 will increase until it reaches 6 hours. You can check if your input source is producing data as expected.
3433

34+
### What further action can you take?
3535

36-
## Input data-skew causes high watermark delay
36+
The watermark delay for this partition is increasing because no input events are flowing into this partition. If your job's tolerance window for late arrivals is several hours and no input data is flowing into a partition, it's expected that the watermark delay for that partition will continue to increase until the late arrival window is reached.
3737

38-
As mentioned in the above case, when you see your embarrassingly parallel job having high watermark delay, the first thing to do is to check the watermark delay splitting by “Partition ID” dimension to identify if all the partitions have high watermark delay or just a few of them.
38+
For example, if your late arrival window is 6 hours and input data isn't flowing into input partition 1, the watermark delay for output partition 1 will increase until it reaches 6 hours. You can check if your input source is producing data as expected.
3939

40-
For this example, you can start by splitting the watermark delay metric by **Partition ID** dimension.
40+
## Input data skew causes a high watermark delay
4141

42-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/03-watermark-delay-splitting-with-partition-id.png" alt-text="Diagram that show the watermark delay splitting with Partition ID for the case of data-skew." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/03-watermark-delay-splitting-with-partition-id.png":::
42+
As mentioned in the preceding case, when your embarrassingly parallel job has a high watermark delay, the first thing to do is to split the **Watermark Delay** metric by the **Partition ID** dimension. You can then identify whether all the partitions have high watermark delay, or just a few of them.
4343

44-
As you can see, partition#0 and partition#1 have higher watermark delay (20 ~ 30s) than other eight partitions. The other partitions watermark delays are always steady at 8s~10 s. Then, let’s check what the input data looks like for all these partitions with the metric “Input Events” splitting by “Partition ID”:
44+
In the following example, partitions 0 and 1 have higher watermark delay (about 20 to 30 seconds) than the other eight partitions have. The other partitions' watermark delays are always steady at about 8 to 10 seconds.
4545

46-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/04-input-events-splitting-with-partition-id.png" alt-text="Diagram that shows the Input Events splitting by Partition ID for the case of data-skew." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/04-input-events-splitting-with-partition-id.png":::
46+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/03-watermark-delay-splitting-with-partition-id.png" alt-text="Screenshot of a chart that shows Watermark Delay split by Partition ID for the case of data skew." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/03-watermark-delay-splitting-with-partition-id.png":::
4747

48+
Let's check what the input data looks like for all these partitions with the metric **Input Events** split by **Partition ID**:
4849

49-
What action could you take further?
50+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/04-input-events-splitting-with-partition-id.png" alt-text="Screenshot of a chart that shows Input Events split by Partition ID for the case of data skew." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/04-input-events-splitting-with-partition-id.png":::
5051

51-
As shown in screenshot above, partition#0 and partition#1 that have high watermark delay, are receiving significantly more input data than other partitions. We call this “data-skew”. This means that the streaming nodes processing the partitions with data-skew need to consume more resources (CPU and memory) than others as shown below.
52+
### What further action can you take?
5253

53-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/05-resource-utilization-of-the-partitions-with-data-skew.png" alt-text="Diagram that show the resource utilization of the partitions with data skew." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/05-resource-utilization-of-the-partitions-with-data-skew.png":::
54+
As shown in the example, the partitions (0 and 1) that have a high watermark delay are receiving significantly more input data than other partitions are. We call this *data skew*. The streaming nodes that are processing the partitions with data skew need to consume more CPU and memory resources than others do, as shown in the following screenshot.
5455

56+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/05-resource-utilization-of-the-partitions-with-data-skew.png" alt-text="Screenshot of a chart that shows the resource utilization of partitions with data skew." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/05-resource-utilization-of-the-partitions-with-data-skew.png":::
5557

56-
Streaming nodes that process partitions with higher data skew will exhibit higher CPU and/or SU (memory) utilization that will affect job's performance and result in increasing watermark delay. To mitigate this, you'll need to repartition your input data more evenly.
58+
Streaming nodes that process partitions with higher data skew will exhibit higher CPU and/or streaming unit (SU) utilization. This utilization will affect the job's performance and increase watermark delay. To mitigate this, you need to repartition your input data more evenly.
5759

58-
## Overloaded CPU/memory causes watermark delay increasing
60+
## Overloaded CPU or memory increases watermark delay
5961

60-
When an embarrassingly parallel job has watermark delay increasing, it may not just happen on one or several partitions, but all of the partitions. How to confirm my job is falling into this case?
61-
1. First, split the watermark delay with “Partition ID” dimension, same as the case above. For example, the below job:
62+
When an embarrassingly parallel job has an increasing watermark delay, it might happen on not just one or several partitions, but all of the partitions. How do you confirm that your job is falling into this case?
6263

63-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/06-watermark-delay-splitting-with-partition-id-all-increasing.png" alt-text="Diagram that shows the watermark delay splitting with Partition ID for the case of overloaded cpu and memory." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/06-watermark-delay-splitting-with-partition-id-all-increasing.png":::
64+
1. Split the **Watermark Delay** metric by **Partition ID**. For example:
6465

66+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/06-watermark-delay-splitting-with-partition-id-all-increasing.png" alt-text="Screenshot of a chart that shows Watermark Delay split by Partition ID for the case of overloaded CPU and memory." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/06-watermark-delay-splitting-with-partition-id-all-increasing.png":::
6567

66-
2. Split the Input Events metric with “Partition IDs” to confirm if there's data-skew in input data per partitions.
67-
3. Then, check the CPU and SU utilization to see if the utilization in all streaming nodes is too high.
68+
2. Split the **Input Events** metric by **Partition ID** to confirm if there's data skew in input data for each partition.
69+
3. Check the CPU and SU utilization to see if the utilization in all streaming nodes is too high.
6870

69-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/07-cpu-and-memory-utilization-splitting-with-node-name.png" alt-text="Diagram that show the CPU and memory utilization splitting by Node name for the case of overloaded cpu and memory." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/07-cpu-and-memory-utilization-splitting-with-node-name.png":::
71+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/07-cpu-and-memory-utilization-splitting-with-node-name.png" alt-text="Screenshot of a chart that shows CPU and memory utilization split by node name for the case of overloaded CPU and memory." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/07-cpu-and-memory-utilization-splitting-with-node-name.png":::
7072

73+
4. If the CPU and SU utilization is very high (more than 80 percent) in all streaming nodes, you can conclude that this job has a large amount of data being processed within each streaming node.
7174

72-
4. If the utilization of CPU and SU is very high (>80%) in all streaming nodes, it can conclude that this job has a large amount of data being processed within each streaming node. You further check how many partitions are allocated to one streaming node by checking the Input Events” metrics with filter by a streaming node ID with "Node Name" dimension and splitting by "Partition ID”. See the screenshot below:
75+
You can further check how many partitions are allocated to one streaming node by checking the **Input Events** metric. Filter by streaming node ID with the **Node Name** dimension, and split by **Partition ID**.
7376

74-
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/08-partition-count-on-one-streaming-node.png" alt-text="Diagram that shows the partition count on one streaming node for the case of overloaded cpu and memory." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/08-partition-count-on-one-streaming-node.png":::
77+
:::image type="content" source="./media/stream-analytics-job-analysis-with-metric-dimensions/08-partition-count-on-one-streaming-node.png" alt-text="Screenshot of a chart that shows the partition count on one streaming node for the case of overloaded CPU and memory." lightbox="./media/stream-analytics-job-analysis-with-metric-dimensions/08-partition-count-on-one-streaming-node.png":::
7578

76-
5. From the above screenshot, you can see there are four partitions allocated to one streaming node that occupied nearly 90% ~ 100% of the streaming node resource. You can use the similar approach to check the rest streaming nodes to confirm if they're also processing four partitions data.
79+
5. The preceding screenshot shows that four partitions are allocated to one streaming node that occupies about 90 to 100 percent of the streaming node resource. You can use a similar approach to check the rest of the streaming nodes to confirm that they're also processing data from four partitions.
7780

78-
What action could you take further?
79-
80-
1. Naturally, you’d think to reduce the partition count for each streaming node to reduce the input data for each streaming node. To achieve this, you can double the SUs to have each streaming node to handle two partitions data, or four times the SUs to have each streaming node to handle one partition data. Refer to [Understand and adjust Streaming Units](./stream-analytics-streaming-unit-consumption.md) for the relationship between SUs assignment and streaming node count.
81-
2. What should I do if the watermark delay is still increasing when one streaming node is handling one partition data? Repartition your input with more partitions to reduce the amount of data in each partition. Refer to this document for details: [Use repartitioning to optimize Azure Stream Analytics jobs](./repartition.md)
81+
### What further action can you take?
8282

83+
You might want to reduce the partition count for each streaming node to reduce the input data for each streaming node. To achieve this, you can double the SUs to have each streaming node handle data from two partitions. Or you can quadruple the SUs to have each streaming node handle data from one partition. For information about the relationship between SU assignment and streaming node count, see [Understand and adjust streaming units](./stream-analytics-streaming-unit-consumption.md).
8384

85+
What should you do if the watermark delay is still increasing when one streaming node is handling data from one partition? Repartition your input with more partitions to reduce the amount of data in each partition. For details, see [Use repartitioning to optimize Azure Stream Analytics jobs](./repartition.md).
8486

8587
## Next steps
8688

87-
* [Monitor Stream Analytics job with Azure portal](./stream-analytics-monitoring.md)
89+
* [Monitor a Stream Analytics job with the Azure portal](./stream-analytics-monitoring.md)
8890
* [Azure Stream Analytics job metrics](./stream-analytics-job-metrics.md)
89-
* [Azure Stream Analytics job metrics dimensions](./stream-analytics-job-metrics-dimensions.md)
90-
* [Understand and adjust Streaming Units](./stream-analytics-streaming-unit-consumption.md)
91+
* [Dimensions for Azure Stream Analytics metrics](./stream-analytics-job-metrics-dimensions.md)
92+
* [Understand and adjust streaming units](./stream-analytics-streaming-unit-consumption.md)

0 commit comments

Comments
 (0)