You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/stream-analytics/stream-analytics-parallelization.md
+28-18Lines changed: 28 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,24 +6,22 @@ ms.author: jeanb
6
6
ms.reviewer: mamccrea
7
7
ms.service: stream-analytics
8
8
ms.topic: conceptual
9
-
ms.date: 05/07/2018
9
+
ms.date: 05/04/2020
10
10
---
11
11
# Leverage query parallelization in Azure Stream Analytics
12
12
This article shows you how to take advantage of parallelization in Azure Stream Analytics. You learn how to scale Stream Analytics jobs by configuring input partitions and tuning the analytics query definition.
13
13
As a prerequisite, you may want to be familiar with the notion of Streaming Unit described in [Understand and adjust Streaming Units](stream-analytics-streaming-unit-consumption.md).
14
14
15
15
## What are the parts of a Stream Analytics job?
16
-
A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
16
+
A Stream Analytics job definition includes at least one streaming input, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
17
17
18
-
A job requires at least one input source for data streaming. The data stream input source can be stored in an Azure event hub or in Azure blob storage. For more information, see [Introduction to Azure Stream Analytics](stream-analytics-introduction.md) and [Get started using Azure Stream Analytics](stream-analytics-real-time-fraud-detection.md).
19
-
20
-
## Partitions in sources and sinks
21
-
Scaling a Stream Analytics job takes advantage of partitions in the input or output. Partitioning lets you divide data into subsets based on a partition key. A process that consumes the data (such as a Streaming Analytics job) can consume and write different partitions in parallel, which increases throughput.
18
+
## Partitions in inputs and outputs
19
+
Partitioning lets you divide data into subsets based on a [partition key](https://docs.microsoft.com/azure/event-hubs/event-hubs-scalability#partitions). If your input (for example Event Hubs) is partitioned by a key, it is highly recommended to specify this partition key when adding input to your Stream Analytics job. Scaling a Stream Analytics job takes advantage of partitions in the input and output. A Stream Analytics job can consume and write different partitions in parallel, which increases throughput.
22
20
23
21
### Inputs
24
22
All Azure Stream Analytics input can take advantage of partitioning:
25
-
- EventHub (need to set the partition key explicitly with PARTITION BY keyword)
26
-
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword)
23
+
- EventHub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
24
+
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
27
25
- Blob storage
28
26
29
27
### Outputs
@@ -48,13 +46,13 @@ For more information about partitions, see the following articles:
48
46
49
47
50
48
## Embarrassingly parallel jobs
51
-
An *embarrassingly parallel* job is the most scalable scenario we have in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
49
+
An *embarrassingly parallel* job is the most scalable scenario in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
52
50
53
-
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. If your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
51
+
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. An example would be a query instance that aggregates data per userID where input event hub is partitioned using userID as partition key. However, if your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
54
52
55
-
2.Once the data is laid out on the input side, you must make sure that your query is partitioned. This requires you to use **PARTITION BY** in all the steps. Multiple steps are allowed, but they all must be partitioned by the same key. Under compatibility level 1.0 and 1.1, the partitioning key must be set to **PartitionId** in order for the job to be fully parallel. For jobs with compatibility level 1.2 and higher, custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically even without PARTITION BY clause. For event hub output the property "Partition key column" must be set to use "PartitionId".
53
+
2.The next step is to make your query is partitioned. For jobs with compatibility level 1.2 or higher (recommended), custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically. Jobs with compatibility level 1.0 or 1.1, requires you to use **PARTITION BY PartitionId**in all the steps of your query. Multiple steps are allowed, but they all must be partitioned by the same key.
56
54
57
-
3. Most of our output can take advantage of partitioning, however if you use an output type that doesn't support partitioning your job won't be fully parallel. For Event Hub outputs, ensure **Partition key column** is set same as the query partition key. Refer to the [output section](#outputs) for more details.
55
+
3. Most of the outputs supported in Stream Analytics can take advantage of partitioning. If you use an output type that doesn't support partitioning your job won't be *embarrassingly parallel*. For Event Hub outputs, ensure **Partition key column** is set to the same partition key used in the query. Refer to the [output section](#outputs) for more details.
58
56
59
57
4. The number of input partitions must equal the number of output partitions. Blob storage output can support partitions and inherits the partitioning scheme of the upstream query. When a partition key for Blob storage is specified, data is partitioned per input partition thus the result is still fully parallel. Here are examples of partition values that allow a fully parallel job:
60
58
@@ -74,8 +72,14 @@ The following sections discuss some example scenarios that are embarrassingly pa
74
72
Query:
75
73
76
74
```SQL
75
+
--Using compatibility level 1.2 or above
77
76
SELECT TollBoothId
78
-
FROM Input1 Partition By PartitionId
77
+
FROM Input1
78
+
WHERE TollBoothId >100
79
+
80
+
--Using compatibility level 1.0 or 1.1
81
+
SELECT TollBoothId
82
+
FROM Input1 PARTITION BY PartitionId
79
83
WHERE TollBoothId >100
80
84
```
81
85
@@ -89,6 +93,12 @@ This query is a simple filter. Therefore, we don't need to worry about partition
89
93
Query:
90
94
91
95
```SQL
96
+
--Using compatibility level 1.2 or above
97
+
SELECTCOUNT(*) AS Count, TollBoothId
98
+
FROM Input1
99
+
GROUP BY TumblingWindow(minute, 3), TollBoothId
100
+
101
+
--Using compatibility level 1.0 or 1.1
92
102
SELECTCOUNT(*) AS Count, TollBoothId
93
103
FROM Input1 Partition By PartitionId
94
104
GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId
@@ -104,7 +114,7 @@ In the previous section, we showed some embarrassingly parallel scenarios. In th
104
114
* Input: Event hub with 8 partitions
105
115
* Output: Event hub with 32 partitions
106
116
107
-
In this case, it doesn't matter what the query is. If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel.+ However we can still get some level or parallelization.
117
+
If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel irrespective of the query. However we can still get some level or parallelization.
108
118
109
119
### Query using non-partitioned output
110
120
* Input: Event hub with 8 partitions
@@ -115,6 +125,7 @@ Power BI output doesn't currently support partitioning. Therefore, this scenario
115
125
### Multi-step query with different PARTITION BY values
116
126
* Input: Event hub with 8 partitions
117
127
* Output: Event hub with 8 partitions
128
+
* Compatibility level: 1.0 or 1.1
118
129
119
130
Query:
120
131
@@ -132,11 +143,10 @@ Query:
132
143
133
144
As you can see, the second step uses **TollBoothId** as the partitioning key. This step is not the same as the first step, and it therefore requires us to do a shuffle.
134
145
135
-
The preceding examples show some Stream Analytics jobs that conform to (or don't) an embarrassingly parallel topology. If they do conform, they have the potential for maximum scale. For jobs that don't fit one of these profiles, scaling guidance will be available in future updates. For now, use the general guidance in the following sections.
136
-
137
-
### Compatibility level 1.2 - Multi-step query with different PARTITION BY values
146
+
### Multi-step query with different PARTITION BY values
138
147
* Input: Event hub with 8 partitions
139
148
* Output: Event hub with 8 partitions ("Partition key column" must be set to use "TollBoothId")
149
+
* Compatibility level - 1.2 or above
140
150
141
151
Query:
142
152
@@ -152,7 +162,7 @@ Query:
152
162
GROUP BY TumblingWindow(minute, 3), TollBoothId
153
163
```
154
164
155
-
Compatibility level 1.2 enables parallel query execution by default. For example, query from the previous section will be partitioned as long as "TollBoothId" column is set as input Partition Key. PARTITION BY PartitionId clause is not required.
165
+
Compatibility level 1.2 or above enables parallel query execution by default. For example, query from the previous section will be partitioned as long as "TollBoothId" column is set as input Partition Key. PARTITION BY PartitionId clause is not required.
156
166
157
167
## Calculate the maximum streaming units of a job
158
168
The total number of streaming units that can be used by a Stream Analytics job depends on the number of steps in the query defined for the job and the number of partitions for each step.
Copy file name to clipboardExpand all lines: articles/stream-analytics/stream-analytics-sql-output-perf.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ Here are some configurations within each service that can help improve overall t
18
18
19
19
## Azure Stream Analytics
20
20
21
-
-**Inherit Partitioning** – This SQL output configuration option enables inheriting the partitioning scheme of your previous query step or input. With this enabled, writing to a disk-based table and having a [fully parallel](stream-analytics-parallelization.md#embarrassingly-parallel-jobs) topology for your job, expect to see better throughputs. This partitioning already automatically happens for many other [outputs](stream-analytics-parallelization.md#partitions-in-sources-and-sinks). Table locking (TABLOCK) is also disabled for bulk inserts made with this option.
21
+
-**Inherit Partitioning** – This SQL output configuration option enables inheriting the partitioning scheme of your previous query step or input. With this enabled, writing to a disk-based table and having a [fully parallel](stream-analytics-parallelization.md#embarrassingly-parallel-jobs) topology for your job, expect to see better throughputs. This partitioning already automatically happens for many other [outputs](stream-analytics-parallelization.md#partitions-in-inputs-and-outputs). Table locking (TABLOCK) is also disabled for bulk inserts made with this option.
22
22
23
23
> [!NOTE]
24
24
> When there are more than 8 input partitions, inheriting the input partitioning scheme might not be an appropriate choice. This upper limit was observed on a table with a single identity column and a clustered index. In this case, consider using [INTO](https://docs.microsoft.com/stream-analytics-query/into-azure-stream-analytics#into-shard-count) 8 in your query, to explicitly specify the number of output writers. Based on your schema and choice of indexes, your observations may vary.
0 commit comments