You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/stream-analytics/stream-analytics-documentdb-output.md
+36-30Lines changed: 36 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Azure Stream Analytics output to Cosmos DB
2
+
title: Azure Stream Analytics output to Azure Cosmos DB
3
3
description: This article describes how to use Azure Stream Analytics to save output to Azure Cosmos DB for JSON output, for data archiving and low-latency queries on unstructured JSON data.
4
4
services: stream-analytics
5
5
author: mamccrea
@@ -13,11 +13,11 @@ ms.custom: seodec18
13
13
# Azure Stream Analytics output to Azure Cosmos DB
14
14
Azure Stream Analytics can target [Azure Cosmos DB](https://azure.microsoft.com/services/documentdb/) for JSON output, enabling data archiving and low-latency queries on unstructured JSON data. This document covers some best practices for implementing this configuration.
15
15
16
-
If you're unfamiliar with Cosmos DB, see the [Azure Cosmos DB learning path](https://azure.microsoft.com/documentation/learning-paths/documentdb/) to get started.
16
+
If you're unfamiliar with Azure Cosmos DB, see the [Azure Cosmos DB learning path](https://azure.microsoft.com/documentation/learning-paths/documentdb/) to get started.
17
17
18
18
> [!Note]
19
19
> At this time, Stream Analytics supports connection to Azure Cosmos DB only through the *SQL API*.
20
-
> Other Azure Cosmos DB APIs are not yet supported. If you point Stream Analytics to the Azure Cosmos DB accounts created with other APIs, the data might not be properly stored.
20
+
> Other Azure Cosmos DB APIs are not yet supported. If you point Stream Analytics to Azure Cosmos DB accounts created with other APIs, the data might not be properly stored.
21
21
22
22
## Basics of Azure Cosmos DB as an output target
23
23
The Azure Cosmos DB output in Stream Analytics enables writing your stream processing results as JSON output into your Azure Cosmos DB containers.
@@ -32,38 +32,44 @@ The following sections detail some of the container options for Azure Cosmos DB.
32
32
## Tuning consistency, availability, and latency
33
33
To match your application requirements, Azure Cosmos DB allows you to fine-tune the database and containers and make trade-offs between consistency, availability, latency, and throughput.
34
34
35
-
Depending on what levels of read consistency your scenario needs against read and write latency, you can choose a consistency level on your database account. You can improve throughput by scaling up Request Units (RUs) on the container. Also by default, Azure Cosmos DB enables synchronous indexing on each CRUD operation to your container. This is another useful option to control the write/read performance in Azure Cosmos DB. For more information, review the [Change your database and query consistency levels](../cosmos-db/consistency-levels.md) article.
35
+
Depending on what levels of read consistency your scenario needs against read and write latency, you can choose a consistency level on your database account. You can improve throughput by scaling up Request Units (RUs) on the container.
36
+
37
+
Also by default, Azure Cosmos DB enables synchronous indexing on each CRUD operation to your container. This is another useful option to control write/read performance in Azure Cosmos DB.
38
+
39
+
For more information, review the [Change your database and query consistency levels](../cosmos-db/consistency-levels.md) article.
36
40
37
41
## Upserts from Stream Analytics
38
42
Stream Analytics integration with Azure Cosmos DB allows you to insert or update records in your container based on a given **Document ID** column. This is also called an *upsert*.
39
43
40
-
Stream Analytics uses an optimistic upsert approach. Updates happen only when an insert fails with a document ID conflict. With compatibility level 1.0, this update is performed as a PATCH, so it enables partial updates to the document. Addition of new properties or replacing an existing property is performed incrementally. However, changes in the values of array properties in your JSON document result in the entire array getting overwritten. That is, the array isn't merged.
44
+
Stream Analytics uses an optimistic upsert approach. Updates happen only when an insert fails with a document ID conflict.
45
+
46
+
With compatibility level 1.0, Stream Analytics performs this update as a PATCH operation, so it enables partial updates to the document. Stream Analytics adds new properties or replaces an existing property incrementally. However, changes in the values of array properties in your JSON document result in overwriting the entire array. That is, the array isn't merged.
41
47
42
-
With 1.2, upsert behavior is modified to insert or replace the document. This is described further in the later section about compatibility level 1.2.
48
+
With 1.2, upsert behavior is modified to insert or replace the document. The later section about compatibility level 1.2 further describes this behavior.
43
49
44
50
If the incoming JSON document has an existing ID field, that field is automatically used as the **Document ID** column in Azure Cosmos DB. Any subsequent writes are handled as such, leading to one of these situations:
45
51
46
52
- Unique IDs lead to insert.
47
53
- Duplicate IDs and **Document ID** set to **ID** lead to upsert.
48
54
- Duplicate IDs and **Document ID** not set lead to error, after the first document.
49
55
50
-
If you want to save *all* documents, including the ones with a duplicate ID, rename the ID field in your query (by using the **AS** keyword). Let Azure Cosmos DB create the ID field or replace the ID with another column's value (by using the **AS** keyword or by using the **Document ID** setting).
56
+
If you want to save *all* documents, including the ones that have a duplicate ID, rename the ID field in your query (by using the **AS** keyword). Let Azure Cosmos DB create the ID field or replace the ID with another column's value (by using the **AS** keyword or by using the **Document ID** setting).
51
57
52
58
## Data partitioning in Azure Cosmos DB
53
59
Azure Cosmos DB automatically scales partitions based on your workload. So we recommend [unlimited](../cosmos-db/partition-data.md) containers as the approach for partitioning your data. When Stream Analytics writes to unlimited containers, it uses as many parallel writers as the previous query step or input partitioning scheme.
54
60
55
61
> [!NOTE]
56
-
> At this time, Azure Stream Analytics supports only unlimited containers with partition keys at the top level. For example, `/region` is supported. Nested partition keys (for example, `/region/name`) are not supported.
62
+
> At this time, Azure Stream Analytics supports only unlimited containers with partition keys at the top level. For example, **/region** is supported. Nested partition keys (for example, **/region/name**) are not supported.
57
63
58
64
Depending on your choice of partition key, you might receive this _warning_:
59
65
60
-
`CosmosDB Output contains multiple rows and just one row per partition key. If the output latency is higher than expected, consider choosing a partition key that contains at least several hundred records per partition key.`
66
+
"CosmosDB Output contains multiple rows and just one row per partition key. If the output latency is higher than expected, consider choosing a partition key that contains at least several hundred records per partition key."
61
67
62
-
It's important to choose a partition key property that has a number of distinct values, and lets you distribute your workload evenly across these values. As a natural artifact of partitioning, requests that involve the same partition key are limited by the maximum throughput of a single partition.
68
+
It's important to choose a partition key property that has a number of distinct values, and that lets you distribute your workload evenly across these values. As a natural artifact of partitioning, requests that involve the same partition key are limited by the maximum throughput of a single partition.
63
69
64
-
The storage size for documents that belong to the same partition key is limited to 10 GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
70
+
The storage size for documents that belong to the same partition key is limited to 10 GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure that your solution is scalable.
65
71
66
-
A partition key is also the boundary for transactions in stored procedures and triggers for Azure Cosmos DB. You should choose the partition key so that documents that occur together in transactions share the same partition key value. The article [Partitioning in Cosmos DB](../cosmos-db/partitioning-overview.md) gives more details on choosing a partition key.
72
+
A partition key is also the boundary for transactions in stored procedures and triggers for Azure Cosmos DB. You should choose the partition key so that documents that occur together in transactions share the same partition key value. The article [Partitioning in Azure Cosmos DB](../cosmos-db/partitioning-overview.md) gives more details on choosing a partition key.
67
73
68
74
For fixed Azure Cosmos DB containers, Stream Analytics allows no way to scale up or out after they're full. They have an upper limit of 10 GB and 10,000 RU/s of throughput. To migrate the data from a fixed container to an unlimited container (for example, one with at least 1,000 RU/s and a partition key), use the [data migration tool](../cosmos-db/import-data.md) or the [change feed library](../cosmos-db/change-feed.md).
69
75
@@ -72,43 +78,43 @@ The ability to write to multiple fixed containers is being deprecated. We don't
72
78
## Improved throughput with compatibility level 1.2
73
79
With compatibility level 1.2, Stream Analytics supports native integration to bulk write into Azure Cosmos DB. This integration enables writing effectively to Azure Cosmos DB while maximizing throughput and efficiently handling throttling requests.
74
80
75
-
The improved writing mechanism is available under a new compatibility level because of a difference in upsert behavior. With levels before 1.2, the upsert behavior was to insert or merge the document. With 1.2, upsert behavior is modified to insert or replace the document.
81
+
The improved writing mechanism is available under a new compatibility level because of a difference in upsert behavior. With levels before 1.2, the upsert behavior is to insert or merge the document. With 1.2, upsert behavior is modified to insert or replace the document.
76
82
77
-
Before 1.2, Stream Analytics used a custom stored procedure to bulk upsert documents per partition key into Azure Cosmos DB. There, a batch was written as a transaction. Even when a single record his a transient error (throttling), the whole batch had to be retried. This made scenarios with even reasonable throttling relatively slow.
83
+
With levels before 1.2, Stream Analytics uses a custom stored procedure to bulk upsert documents per partition key into Azure Cosmos DB. There, a batch is written as a transaction. Even when a single record hits a transient error (throttling), the whole batch has to be retried. This makes scenarios with even reasonable throttling relatively slow.
78
84
79
-
The following example shows two identical Stream Analytics jobs reading from same Azure Event Hubs input. Both Stream Analytics jobs are [fully partitioned](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs) with a passthrough query and write to identical Azure Cosmos DB containers. Metrics on the left are from the job configured with compatibility level 1.0. Metrics on the right are configured with 1.2. An Azure Cosmos DB container's partition key is a unique GUID that comes from the input event.
85
+
The following example shows two identical Stream Analytics jobs reading from the same Azure Event Hubs input. Both Stream Analytics jobs are [fully partitioned](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs) with a passthrough query and write to identical Azure Cosmos DB containers. Metrics on the left are from the job configured with compatibility level 1.0. Metrics on the right are configured with 1.2. An Azure Cosmos DB container's partition key is a unique GUID that comes from the input event.
80
86
81
87

82
88
83
-
The incoming event rate in Event Hubs is two times higher than Azure Cosmos DB containers (20,000 RUs) are configured to take in, so throttling is expected in Azure Cosmos DB. However, the job with 1.2 is consistently writing at a higher throughput (output events per minute) and with a lower average SU% utilization. In your environment, this difference will depend on few more factors such as choice of event format, input event/message size, partition keys, and query.
89
+
The incoming event rate in Event Hubs is two times higher than Azure Cosmos DB containers (20,000 RUs) are configured to take in, so throttling is expected in Azure Cosmos DB. However, the job with 1.2 is consistently writing at a higher throughput (output events per minute) and with a lower average SU% utilization. In your environment, this difference will depend on few more factors. These factors include choice of event format, input event/message size, partition keys, and query.
84
90
85
91

86
92
87
-
With 1.2, Stream Analytics is more intelligent in utilizing 100 percent of the available throughput in Azure Cosmos DB with very few resubmissions from throttling/rate limiting. This provides a better experience for other workloads like queries running on the container at the same time. In case you need to try out how Strea Analytics scales out with Azure Cosmos DB as a sink for 1,000 to 10,000 messages per second, here is an [Azure samples project](https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-cosmosdb) that lets you do that.
93
+
With 1.2, Stream Analytics is more intelligent in utilizing 100 percent of the available throughput in Azure Cosmos DB with very few resubmissions from throttling or rate limiting. This provides a better experience for other workloads like queries running on the container at the same time. If you want to see how Stream Analytics scales out with Azure Cosmos DB as a sink for 1,000 to 10,000 messages per second, try [this Azure sample project](https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-cosmosdb).
88
94
89
-
Please note that Cosmos DB output throughput is identical with 1.0 and 1.1. Since 1.2 is currently not the default, you can [set compatibility level](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-compatibility-level) for a Stream Analytics job by using the portal or by using the [create job REST API call](https://docs.microsoft.com/rest/api/streamanalytics/stream-analytics-job). We *strongly recommend* that you use compatibility level 1.2 in Stream Analytics with Azure Cosmos DB.
95
+
Throughput of Azure Cosmos DB output is identical with 1.0 and 1.1. Because 1.2 is currently not the default, you can [set the compatibility level](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-compatibility-level) for a Stream Analytics job by using the portal or by using the [Create Stream Analytics Job REST API call](https://docs.microsoft.com/rest/api/streamanalytics/stream-analytics-job). We *strongly recommend* that you use compatibility level 1.2 in Stream Analytics with Azure Cosmos DB.
90
96
97
+
## Azure Cosmos DB settings for JSON output
91
98
99
+
Using Azure Cosmos DB as an output in Stream Analytics generates the following prompt for information.
92
100
93
-
## Cosmos DB settings for JSON output
94
-
95
-
Creating Cosmos DB as an output in Stream Analytics generates a prompt for information as seen below. This section provides an explanation of the properties definition.

98
102
99
103
|Field | Description|
100
104
|------------- | -------------|
101
-
|Output alias | An alias to refer this output in your ASA query.|
102
-
|Subscription |Choose the Azure subscription.|
105
+
|Output alias | An alias to refer to this output in your Stream Analytics query.|
106
+
|Subscription |The Azure subscription.|
103
107
|Account ID | The name or endpoint URI of the Azure Cosmos DB account.|
104
108
|Account key | The shared access key for the Azure Cosmos DB account.|
105
109
|Database | The Azure Cosmos DB database name.|
106
-
|Container name | The container name to be used. `MyContainer` is a sample valid input - one container named `MyContainer` must exist. |
107
-
|Document ID | Optional. The column name in output events used as the unique key on which insert or update operations must be based. If left empty, all events will be inserted, with no update option.|
110
+
|Collection name pattern | The container name. `MyContainer` is a sample valid input. One container named `MyContainer` must exist. |
111
+
|Document ID | Optional. The column name in output events used as the unique key on which insert or update operations must be based. If you leave it empty, all events will be inserted, with no update option.|
112
+
113
+
After you configure the Azure Cosmos DB output, you can use it in the query as the target of an [INTO statement](https://docs.microsoft.com/stream-analytics-query/into-azure-stream-analytics). When you're using an Azure Cosmos DB output that way, [a partition key needs to be set explicitly](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#partitions-in-sources-and-sinks).
108
114
109
-
After the Cosmos DB output is configured, it can be used in the query as the target of an [INTO statement](https://docs.microsoft.com/stream-analytics-query/into-azure-stream-analytics). When using an Azure Cosmos DB output as such, [a partition key needs to be set explicitly](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#partitions-in-sources-and-sinks). The output record must contain a case-sensitive column named after the partition key in Cosmos DB. To achieve greater parallelization, the statement may require a [PARTITION BY clause](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs)using the same column.
115
+
The output record must contain a case-sensitive column named after the partition key in Azure Cosmos DB. To achieve greater parallelization, the statement might require a [PARTITION BY clause](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs)that uses the same column.
110
116
111
-
**Sample query**:
117
+
Here's a sample query:
112
118
113
119
```SQL
114
120
SELECT TollBoothId, PartitionId
@@ -118,7 +124,7 @@ After the Cosmos DB output is configured, it can be used in the query as the tar
118
124
119
125
## Error handling and retries
120
126
121
-
In the event of a transient failure, service unavailability or throttling while sending events to Azure Cosmos DB, Stream Analytics retries indefinitely to successfully complete the operation. However, retries aren't attempted for thes following failures:
127
+
If a transient failure, service unavailability, or throttling happens while Stream Analytics is sending events to Azure Cosmos DB, Stream Analytics retries indefinitely to finish the operation successfully. But it doesn't attempt retries for the following failures:
0 commit comments