You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/stream-analytics/stream-analytics-parallelization.md
+23-15Lines changed: 23 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,24 +6,22 @@ ms.author: jeanb
6
6
ms.reviewer: mamccrea
7
7
ms.service: stream-analytics
8
8
ms.topic: conceptual
9
-
ms.date: 05/07/2018
9
+
ms.date: 05/04/2020
10
10
---
11
11
# Leverage query parallelization in Azure Stream Analytics
12
12
This article shows you how to take advantage of parallelization in Azure Stream Analytics. You learn how to scale Stream Analytics jobs by configuring input partitions and tuning the analytics query definition.
13
13
As a prerequisite, you may want to be familiar with the notion of Streaming Unit described in [Understand and adjust Streaming Units](stream-analytics-streaming-unit-consumption.md).
14
14
15
15
## What are the parts of a Stream Analytics job?
16
-
A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
17
-
18
-
A job requires at least one input source for data streaming. The data stream input source can be stored in an Azure event hub or in Azure blob storage. For more information, see [Introduction to Azure Stream Analytics](stream-analytics-introduction.md) and [Get started using Azure Stream Analytics](stream-analytics-real-time-fraud-detection.md).
16
+
A Stream Analytics job definition includes at least one streaming input, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
19
17
20
18
## Partitions in sources and sinks
21
-
Scaling a Stream Analytics job takes advantage of partitions in the input or output. Partitioning lets you divide data into subsets based on a partition key. A process that consumes the data (such as a Streaming Analytics job) can consume and write different partitions in parallel, which increases throughput.
19
+
Partitioning lets you divide data into subsets based on a [partition key](https://docs.microsoft.com/azure/event-hubs/event-hubs-scalability#partitions). If your input (for example Event Hubs) is partitioned by a key, it is highly recommended to specify this partition key when adding input to your Stream Analytics job. Scaling a Stream Analytics job takes advantage of partitions in the input and output. A Stream Analytics job can consume and write different partitions in parallel, which increases throughput.
22
20
23
21
### Inputs
24
22
All Azure Stream Analytics input can take advantage of partitioning:
25
-
- EventHub (need to set the partition key explicitly with PARTITION BY keyword)
26
-
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword)
23
+
- EventHub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
24
+
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
27
25
- Blob storage
28
26
29
27
### Outputs
@@ -48,13 +46,13 @@ For more information about partitions, see the following articles:
48
46
49
47
50
48
## Embarrassingly parallel jobs
51
-
An *embarrassingly parallel* job is the most scalable scenario we have in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
49
+
An *embarrassingly parallel* job is the most scalable scenario in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
52
50
53
-
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. If your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
51
+
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. An example would be a query instance that aggregates data per userID where input event hub is partitioned using userID as partition key. However, if your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
54
52
55
-
2.Once the data is laid out on the input side, you must make sure that your query is partitioned. This requires you to use **PARTITION BY** in all the steps. Multiple steps are allowed, but they all must be partitioned by the same key. Under compatibility level 1.0 and 1.1, the partitioning key must be set to **PartitionId** in order for the job to be fully parallel. For jobs with compatibility level 1.2 and higher, custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically even without PARTITION BY clause. For event hub output the property "Partition key column" must be set to use "PartitionId".
53
+
2.The next step is to make your query is partitioned. For jobs with compatibility level 1.2 or higher (recommended), custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically. Jobs with compatibility level 1.0 or 1.1, requires you to use **PARTITION BY PartitionId**in all the steps of your query. Multiple steps are allowed, but they all must be partitioned by the same key.
56
54
57
-
3. Most of our output can take advantage of partitioning, however if you use an output type that doesn't support partitioning your job won't be fully parallel. For Event Hub outputs, ensure **Partition key column** is set same as the query partition key. Refer to the [output section](#outputs) for more details.
55
+
3. Most of the outputs supported in Stream Analytics can take advantage of partitioning. If you use an output type that doesn't support partitioning your job won't be *embarrassingly parallel*. For Event Hub outputs, ensure **Partition key column** is set to the same partition key used in the query. Refer to the [output section](#outputs) for more details.
58
56
59
57
4. The number of input partitions must equal the number of output partitions. Blob storage output can support partitions and inherits the partitioning scheme of the upstream query. When a partition key for Blob storage is specified, data is partitioned per input partition thus the result is still fully parallel. Here are examples of partition values that allow a fully parallel job:
60
58
@@ -74,8 +72,14 @@ The following sections discuss some example scenarios that are embarrassingly pa
74
72
Query:
75
73
76
74
```SQL
75
+
--Using compatibility level 1.2 or above
77
76
SELECT TollBoothId
78
-
FROM Input1 Partition By PartitionId
77
+
FROM Input1
78
+
WHERE TollBoothId >100
79
+
80
+
--Using compatibility level 1.0 or 1.1
81
+
SELECT TollBoothId
82
+
FROM Input1 PARTITION BY PartitionId
79
83
WHERE TollBoothId >100
80
84
```
81
85
@@ -89,6 +93,12 @@ This query is a simple filter. Therefore, we don't need to worry about partition
89
93
Query:
90
94
91
95
```SQL
96
+
--Using compatibility level 1.2 or above
97
+
SELECTCOUNT(*) AS Count, TollBoothId
98
+
FROM Input1
99
+
GROUP BY TumblingWindow(minute, 3), TollBoothId
100
+
101
+
--Using compatibility level 1.0 or 1.1
92
102
SELECTCOUNT(*) AS Count, TollBoothId
93
103
FROM Input1 Partition By PartitionId
94
104
GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId
@@ -104,7 +114,7 @@ In the previous section, we showed some embarrassingly parallel scenarios. In th
104
114
* Input: Event hub with 8 partitions
105
115
* Output: Event hub with 32 partitions
106
116
107
-
In this case, it doesn't matter what the query is. If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel.+ However we can still get some level or parallelization.
117
+
If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel irrespective of the query. However we can still get some level or parallelization.
108
118
109
119
### Query using non-partitioned output
110
120
* Input: Event hub with 8 partitions
@@ -132,8 +142,6 @@ Query:
132
142
133
143
As you can see, the second step uses **TollBoothId** as the partitioning key. This step is not the same as the first step, and it therefore requires us to do a shuffle.
134
144
135
-
The preceding examples show some Stream Analytics jobs that conform to (or don't) an embarrassingly parallel topology. If they do conform, they have the potential for maximum scale. For jobs that don't fit one of these profiles, scaling guidance will be available in future updates. For now, use the general guidance in the following sections.
136
-
137
145
### Compatibility level 1.2 - Multi-step query with different PARTITION BY values
138
146
* Input: Event hub with 8 partitions
139
147
* Output: Event hub with 8 partitions ("Partition key column" must be set to use "TollBoothId")
0 commit comments