Skip to content

Commit 993bccf

Browse files
authored
Merge pull request #113827 from sidramadoss/patch-69
Update stream-analytics-parallelization.md
2 parents 4928d2a + 8ac4985 commit 993bccf

File tree

2 files changed

+29
-19
lines changed

2 files changed

+29
-19
lines changed

articles/stream-analytics/stream-analytics-parallelization.md

Lines changed: 28 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,22 @@ ms.author: jeanb
66
ms.reviewer: mamccrea
77
ms.service: stream-analytics
88
ms.topic: conceptual
9-
ms.date: 05/07/2018
9+
ms.date: 05/04/2020
1010
---
1111
# Leverage query parallelization in Azure Stream Analytics
1212
This article shows you how to take advantage of parallelization in Azure Stream Analytics. You learn how to scale Stream Analytics jobs by configuring input partitions and tuning the analytics query definition.
1313
As a prerequisite, you may want to be familiar with the notion of Streaming Unit described in [Understand and adjust Streaming Units](stream-analytics-streaming-unit-consumption.md).
1414

1515
## What are the parts of a Stream Analytics job?
16-
A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
16+
A Stream Analytics job definition includes at least one streaming input, a query, and output. Inputs are where the job reads the data stream from. The query is used to transform the data input stream, and the output is where the job sends the job results to.
1717

18-
A job requires at least one input source for data streaming. The data stream input source can be stored in an Azure event hub or in Azure blob storage. For more information, see [Introduction to Azure Stream Analytics](stream-analytics-introduction.md) and [Get started using Azure Stream Analytics](stream-analytics-real-time-fraud-detection.md).
19-
20-
## Partitions in sources and sinks
21-
Scaling a Stream Analytics job takes advantage of partitions in the input or output. Partitioning lets you divide data into subsets based on a partition key. A process that consumes the data (such as a Streaming Analytics job) can consume and write different partitions in parallel, which increases throughput.
18+
## Partitions in inputs and outputs
19+
Partitioning lets you divide data into subsets based on a [partition key](https://docs.microsoft.com/azure/event-hubs/event-hubs-scalability#partitions). If your input (for example Event Hubs) is partitioned by a key, it is highly recommended to specify this partition key when adding input to your Stream Analytics job. Scaling a Stream Analytics job takes advantage of partitions in the input and output. A Stream Analytics job can consume and write different partitions in parallel, which increases throughput.
2220

2321
### Inputs
2422
All Azure Stream Analytics input can take advantage of partitioning:
25-
- EventHub (need to set the partition key explicitly with PARTITION BY keyword)
26-
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword)
23+
- EventHub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
24+
- IoT Hub (need to set the partition key explicitly with PARTITION BY keyword if using compatibility level 1.1 or below)
2725
- Blob storage
2826

2927
### Outputs
@@ -48,13 +46,13 @@ For more information about partitions, see the following articles:
4846

4947

5048
## Embarrassingly parallel jobs
51-
An *embarrassingly parallel* job is the most scalable scenario we have in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
49+
An *embarrassingly parallel* job is the most scalable scenario in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. This parallelism has the following requirements:
5250

53-
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. If your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
51+
1. If your query logic depends on the same key being processed by the same query instance, you must make sure that the events go to the same partition of your input. For Event Hubs or IoT Hub, this means that the event data must have the **PartitionKey** value set. Alternatively, you can use partitioned senders. For blob storage, this means that the events are sent to the same partition folder. An example would be a query instance that aggregates data per userID where input event hub is partitioned using userID as partition key. However, if your query logic does not require the same key to be processed by the same query instance, you can ignore this requirement. An example of this logic would be a simple select-project-filter query.
5452

55-
2. Once the data is laid out on the input side, you must make sure that your query is partitioned. This requires you to use **PARTITION BY** in all the steps. Multiple steps are allowed, but they all must be partitioned by the same key. Under compatibility level 1.0 and 1.1, the partitioning key must be set to **PartitionId** in order for the job to be fully parallel. For jobs with compatibility level 1.2 and higher, custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically even without PARTITION BY clause. For event hub output the property "Partition key column" must be set to use "PartitionId".
53+
2. The next step is to make your query is partitioned. For jobs with compatibility level 1.2 or higher (recommended), custom column can be specified as Partition Key in the input settings and the job will be paralellized automatically. Jobs with compatibility level 1.0 or 1.1, requires you to use **PARTITION BY PartitionId** in all the steps of your query. Multiple steps are allowed, but they all must be partitioned by the same key.
5654

57-
3. Most of our output can take advantage of partitioning, however if you use an output type that doesn't support partitioning your job won't be fully parallel. For Event Hub outputs, ensure **Partition key column** is set same as the query partition key. Refer to the [output section](#outputs) for more details.
55+
3. Most of the outputs supported in Stream Analytics can take advantage of partitioning. If you use an output type that doesn't support partitioning your job won't be *embarrassingly parallel*. For Event Hub outputs, ensure **Partition key column** is set to the same partition key used in the query. Refer to the [output section](#outputs) for more details.
5856

5957
4. The number of input partitions must equal the number of output partitions. Blob storage output can support partitions and inherits the partitioning scheme of the upstream query. When a partition key for Blob storage is specified, data is partitioned per input partition thus the result is still fully parallel. Here are examples of partition values that allow a fully parallel job:
6058

@@ -74,8 +72,14 @@ The following sections discuss some example scenarios that are embarrassingly pa
7472
Query:
7573

7674
```SQL
75+
--Using compatibility level 1.2 or above
7776
SELECT TollBoothId
78-
FROM Input1 Partition By PartitionId
77+
FROM Input1
78+
WHERE TollBoothId > 100
79+
80+
--Using compatibility level 1.0 or 1.1
81+
SELECT TollBoothId
82+
FROM Input1 PARTITION BY PartitionId
7983
WHERE TollBoothId > 100
8084
```
8185

@@ -89,6 +93,12 @@ This query is a simple filter. Therefore, we don't need to worry about partition
8993
Query:
9094

9195
```SQL
96+
--Using compatibility level 1.2 or above
97+
SELECT COUNT(*) AS Count, TollBoothId
98+
FROM Input1
99+
GROUP BY TumblingWindow(minute, 3), TollBoothId
100+
101+
--Using compatibility level 1.0 or 1.1
92102
SELECT COUNT(*) AS Count, TollBoothId
93103
FROM Input1 Partition By PartitionId
94104
GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId
@@ -104,7 +114,7 @@ In the previous section, we showed some embarrassingly parallel scenarios. In th
104114
* Input: Event hub with 8 partitions
105115
* Output: Event hub with 32 partitions
106116

107-
In this case, it doesn't matter what the query is. If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel.+ However we can still get some level or parallelization.
117+
If the input partition count doesn't match the output partition count, the topology isn't embarrassingly parallel irrespective of the query. However we can still get some level or parallelization.
108118

109119
### Query using non-partitioned output
110120
* Input: Event hub with 8 partitions
@@ -115,6 +125,7 @@ Power BI output doesn't currently support partitioning. Therefore, this scenario
115125
### Multi-step query with different PARTITION BY values
116126
* Input: Event hub with 8 partitions
117127
* Output: Event hub with 8 partitions
128+
* Compatibility level: 1.0 or 1.1
118129

119130
Query:
120131

@@ -132,11 +143,10 @@ Query:
132143

133144
As you can see, the second step uses **TollBoothId** as the partitioning key. This step is not the same as the first step, and it therefore requires us to do a shuffle.
134145

135-
The preceding examples show some Stream Analytics jobs that conform to (or don't) an embarrassingly parallel topology. If they do conform, they have the potential for maximum scale. For jobs that don't fit one of these profiles, scaling guidance will be available in future updates. For now, use the general guidance in the following sections.
136-
137-
### Compatibility level 1.2 - Multi-step query with different PARTITION BY values
146+
### Multi-step query with different PARTITION BY values
138147
* Input: Event hub with 8 partitions
139148
* Output: Event hub with 8 partitions ("Partition key column" must be set to use "TollBoothId")
149+
* Compatibility level - 1.2 or above
140150

141151
Query:
142152

@@ -152,7 +162,7 @@ Query:
152162
GROUP BY TumblingWindow(minute, 3), TollBoothId
153163
```
154164

155-
Compatibility level 1.2 enables parallel query execution by default. For example, query from the previous section will be partitioned as long as "TollBoothId" column is set as input Partition Key. PARTITION BY PartitionId clause is not required.
165+
Compatibility level 1.2 or above enables parallel query execution by default. For example, query from the previous section will be partitioned as long as "TollBoothId" column is set as input Partition Key. PARTITION BY PartitionId clause is not required.
156166

157167
## Calculate the maximum streaming units of a job
158168
The total number of streaming units that can be used by a Stream Analytics job depends on the number of steps in the query defined for the job and the number of partitions for each step.

articles/stream-analytics/stream-analytics-sql-output-perf.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Here are some configurations within each service that can help improve overall t
1818

1919
## Azure Stream Analytics
2020

21-
- **Inherit Partitioning** – This SQL output configuration option enables inheriting the partitioning scheme of your previous query step or input. With this enabled, writing to a disk-based table and having a [fully parallel](stream-analytics-parallelization.md#embarrassingly-parallel-jobs) topology for your job, expect to see better throughputs. This partitioning already automatically happens for many other [outputs](stream-analytics-parallelization.md#partitions-in-sources-and-sinks). Table locking (TABLOCK) is also disabled for bulk inserts made with this option.
21+
- **Inherit Partitioning** – This SQL output configuration option enables inheriting the partitioning scheme of your previous query step or input. With this enabled, writing to a disk-based table and having a [fully parallel](stream-analytics-parallelization.md#embarrassingly-parallel-jobs) topology for your job, expect to see better throughputs. This partitioning already automatically happens for many other [outputs](stream-analytics-parallelization.md#partitions-in-inputs-and-outputs). Table locking (TABLOCK) is also disabled for bulk inserts made with this option.
2222

2323
> [!NOTE]
2424
> When there are more than 8 input partitions, inheriting the input partitioning scheme might not be an appropriate choice. This upper limit was observed on a table with a single identity column and a clustered index. In this case, consider using [INTO](https://docs.microsoft.com/stream-analytics-query/into-azure-stream-analytics#into-shard-count) 8 in your query, to explicitly specify the number of output writers. Based on your schema and choice of indexes, your observations may vary.

0 commit comments

Comments
 (0)