Skip to content

Commit 3a4a3e7

Browse files
authored
Update stream-analytics-parallelization.md
1 parent 8b549ea commit 3a4a3e7

File tree

1 file changed

+23
-15
lines changed

1 file changed

+23
-15
lines changed

articles/stream-analytics/stream-analytics-parallelization.md

Lines changed: 23 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,22 @@ ms.author: jeanb
66
ms.reviewer: mamccrea
77
ms.service: stream-analytics
88
ms.topic: conceptual
9-
ms.date: 05/07/2018
9+
ms.date: 05/04/2020
1010
---
1111
# Leverage query parallelization in Azure Stream Analytics
1212
This article shows you how to take advantage of parallelization in Azure Stream Analytics. You learn how to scale Stream Analytics jobs by configuring input partitions and tuning the analytics query definition.
1313
As a prerequisite, you may want to be familiar with the notion of Streaming Unit described in [Understand and adjust Streaming Units](stream-analytics-streaming-unit-consumption.md).
1414

1515
## What are the parts of a Stream Analytics job?
16-
A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
17-
18-
A job requires at least one input source for data streaming. The data stream input source can be stored in an Azure event hub or in Azure blob storage. For more information, see [Introduction to Azure Stream Analytics](stream-analytics-introduction.md) and [Get started using Azure Stream Analytics](stream-analytics-real-time-fraud-detection.md).
16+
A Stream Analytics job definition includes at least one streaming input, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
1917

2018
## Partitions in sources and sinks
21-
Scaling a Stream Analytics job takes advantage of partitions in the input or output. Partitioning lets you divide data into subsets based on a partition key. A process that consumes the data (such as a Streaming Analytics job) can consume and write different partitions in parallel, which increases throughput.
19+
Partitioning lets you divide data into subsets based on a [partition key](https://docs.microsoft.com/azure/event-hubs/event-hubs-scalability#partitions). If your input (for example Event Hubs) is partitioned by a key, it is highly recommended to specify this partition key when adding input to your Stream Analytics job. Scaling a Stream Analytics job takes advantage of partitions in the input and output. A Stream Analytics job can consume and write different partitions in parallel, which increases throughput.
2220

2321
### Inputs
2422
All Azure Stream Analytics input can take advantage of partitioning:
25-
- EventHub (need to set the partition key explicitly with PARTITION BY keyword)
26-
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword)
23+
- EventHub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
24+
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
2725
- Blob storage
2826

2927
### Outputs
@@ -48,13 +46,13 @@ For more information about partitions, see the following articles:
4846

4947

5048
## Embarrassingly parallel jobs
51-
An *embarrassingly parallel* job is the most scalable scenario we have in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
49+
An *embarrassingly parallel* job is the most scalable scenario in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
5250

53-
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. If your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
51+
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. An example would be a query instance that aggregates data per userID where input event hub is partitioned using userID as partition key. However, if your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
5452

55-
2. Once the data is laid out on the input side, you must make sure that your query is partitioned. This requires you to use **PARTITION BY** in all the steps. Multiple steps are allowed, but they all must be partitioned by the same key. Under compatibility level 1.0 and 1.1, the partitioning key must be set to **PartitionId** in order for the job to be fully parallel. For jobs with compatibility level 1.2 and higher, custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically even without PARTITION BY clause. For event hub output the property "Partition key column" must be set to use "PartitionId".
53+
2. The next step is to make your query is partitioned. For jobs with compatibility level 1.2 or higher (recommended), custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically. Jobs with compatibility level 1.0 or 1.1, requires you to use **PARTITION BY PartitionId** in all the steps of your query. Multiple steps are allowed, but they all must be partitioned by the same key.
5654

57-
3. Most of our output can take advantage of partitioning, however if you use an output type that doesn't support partitioning your job won't be fully parallel. For Event Hub outputs, ensure **Partition key column** is set same as the query partition key. Refer to the [output section](#outputs) for more details.
55+
3. Most of the outputs supported in Stream Analytics can take advantage of partitioning. If you use an output type that doesn't support partitioning your job won't be *embarrassingly parallel*. For Event Hub outputs, ensure **Partition key column** is set to the same partition key used in the query. Refer to the [output section](#outputs) for more details.
5856

5957
4. The number of input partitions must equal the number of output partitions. Blob storage output can support partitions and inherits the partitioning scheme of the upstream query. When a partition key for Blob storage is specified, data is partitioned per input partition thus the result is still fully parallel. Here are examples of partition values that allow a fully parallel job:
6058

@@ -74,8 +72,14 @@ The following sections discuss some example scenarios that are embarrassingly pa
7472
Query:
7573

7674
```SQL
75+
--Using compatibility level 1.2 or above
7776
SELECT TollBoothId
78-
FROM Input1 Partition By PartitionId
77+
FROM Input1
78+
WHERE TollBoothId > 100
79+
80+
--Using compatibility level 1.0 or 1.1
81+
SELECT TollBoothId
82+
FROM Input1 PARTITION BY PartitionId
7983
WHERE TollBoothId > 100
8084
```
8185

@@ -89,6 +93,12 @@ This query is a simple filter. Therefore, we don't need to worry about partition
8993
Query:
9094

9195
```SQL
96+
--Using compatibility level 1.2 or above
97+
SELECT COUNT(*) AS Count, TollBoothId
98+
FROM Input1
99+
GROUP BY TumblingWindow(minute, 3), TollBoothId
100+
101+
--Using compatibility level 1.0 or 1.1
92102
SELECT COUNT(*) AS Count, TollBoothId
93103
FROM Input1 Partition By PartitionId
94104
GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId
@@ -104,7 +114,7 @@ In the previous section, we showed some embarrassingly parallel scenarios. In th
104114
* Input: Event hub with 8 partitions
105115
* Output: Event hub with 32 partitions
106116

107-
In this case, it doesn't matter what the query is. If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel.+ However we can still get some level or parallelization.
117+
If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel irrespective of the query. However we can still get some level or parallelization.
108118

109119
### Query using non-partitioned output
110120
* Input: Event hub with 8 partitions
@@ -132,8 +142,6 @@ Query:
132142

133143
As you can see, the second step uses **TollBoothId** as the partitioning key. This step is not the same as the first step, and it therefore requires us to do a shuffle.
134144

135-
The preceding examples show some Stream Analytics jobs that conform to (or don't) an embarrassingly parallel topology. If they do conform, they have the potential for maximum scale. For jobs that don't fit one of these profiles, scaling guidance will be available in future updates. For now, use the general guidance in the following sections.
136-
137145
### Compatibility level 1.2 - Multi-step query with different PARTITION BY values
138146
* Input: Event hub with 8 partitions
139147
* Output: Event hub with 8 partitions ("Partition key column" must be set to use "TollBoothId")

0 commit comments

Comments
 (0)