Skip to content

Commit 94cdb95

Browse files
committed
edit pass: stream-analytics-documentdb-output
1 parent b18ba0f commit 94cdb95

File tree

1 file changed

+36
-30
lines changed

1 file changed

+36
-30
lines changed

articles/stream-analytics/stream-analytics-documentdb-output.md

Lines changed: 36 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Azure Stream Analytics output to Cosmos DB
2+
title: Azure Stream Analytics output to Azure Cosmos DB
33
description: This article describes how to use Azure Stream Analytics to save output to Azure Cosmos DB for JSON output, for data archiving and low-latency queries on unstructured JSON data.
44
services: stream-analytics
55
author: mamccrea
@@ -13,11 +13,11 @@ ms.custom: seodec18
1313
# Azure Stream Analytics output to Azure Cosmos DB
1414
Azure Stream Analytics can target [Azure Cosmos DB](https://azure.microsoft.com/services/documentdb/) for JSON output, enabling data archiving and low-latency queries on unstructured JSON data. This document covers some best practices for implementing this configuration.
1515

16-
If you're unfamiliar with Cosmos DB, see the [Azure Cosmos DB learning path](https://azure.microsoft.com/documentation/learning-paths/documentdb/) to get started.
16+
If you're unfamiliar with Azure Cosmos DB, see the [Azure Cosmos DB learning path](https://azure.microsoft.com/documentation/learning-paths/documentdb/) to get started.
1717

1818
> [!Note]
1919
> At this time, Stream Analytics supports connection to Azure Cosmos DB only through the *SQL API*.
20-
> Other Azure Cosmos DB APIs are not yet supported. If you point Stream Analytics to the Azure Cosmos DB accounts created with other APIs, the data might not be properly stored.
20+
> Other Azure Cosmos DB APIs are not yet supported. If you point Stream Analytics to Azure Cosmos DB accounts created with other APIs, the data might not be properly stored.
2121
2222
## Basics of Azure Cosmos DB as an output target
2323
The Azure Cosmos DB output in Stream Analytics enables writing your stream processing results as JSON output into your Azure Cosmos DB containers.
@@ -32,38 +32,44 @@ The following sections detail some of the container options for Azure Cosmos DB.
3232
## Tuning consistency, availability, and latency
3333
To match your application requirements, Azure Cosmos DB allows you to fine-tune the database and containers and make trade-offs between consistency, availability, latency, and throughput.
3434

35-
Depending on what levels of read consistency your scenario needs against read and write latency, you can choose a consistency level on your database account. You can improve throughput by scaling up Request Units (RUs) on the container. Also by default, Azure Cosmos DB enables synchronous indexing on each CRUD operation to your container. This is another useful option to control the write/read performance in Azure Cosmos DB. For more information, review the [Change your database and query consistency levels](../cosmos-db/consistency-levels.md) article.
35+
Depending on what levels of read consistency your scenario needs against read and write latency, you can choose a consistency level on your database account. You can improve throughput by scaling up Request Units (RUs) on the container.
36+
37+
Also by default, Azure Cosmos DB enables synchronous indexing on each CRUD operation to your container. This is another useful option to control write/read performance in Azure Cosmos DB.
38+
39+
For more information, review the [Change your database and query consistency levels](../cosmos-db/consistency-levels.md) article.
3640

3741
## Upserts from Stream Analytics
3842
Stream Analytics integration with Azure Cosmos DB allows you to insert or update records in your container based on a given **Document ID** column. This is also called an *upsert*.
3943

40-
Stream Analytics uses an optimistic upsert approach. Updates happen only when an insert fails with a document ID conflict. With compatibility level 1.0, this update is performed as a PATCH, so it enables partial updates to the document. Addition of new properties or replacing an existing property is performed incrementally. However, changes in the values of array properties in your JSON document result in the entire array getting overwritten. That is, the array isn't merged.
44+
Stream Analytics uses an optimistic upsert approach. Updates happen only when an insert fails with a document ID conflict.
45+
46+
With compatibility level 1.0, Stream Analytics performs this update as a PATCH operation, so it enables partial updates to the document. Stream Analytics adds new properties or replaces an existing property incrementally. However, changes in the values of array properties in your JSON document result in overwriting the entire array. That is, the array isn't merged.
4147

42-
With 1.2, upsert behavior is modified to insert or replace the document. This is described further in the later section about compatibility level 1.2.
48+
With 1.2, upsert behavior is modified to insert or replace the document. The later section about compatibility level 1.2 further describes this behavior.
4349

4450
If the incoming JSON document has an existing ID field, that field is automatically used as the **Document ID** column in Azure Cosmos DB. Any subsequent writes are handled as such, leading to one of these situations:
4551

4652
- Unique IDs lead to insert.
4753
- Duplicate IDs and **Document ID** set to **ID** lead to upsert.
4854
- Duplicate IDs and **Document ID** not set lead to error, after the first document.
4955

50-
If you want to save *all* documents, including the ones with a duplicate ID, rename the ID field in your query (by using the **AS** keyword). Let Azure Cosmos DB create the ID field or replace the ID with another column's value (by using the **AS** keyword or by using the **Document ID** setting).
56+
If you want to save *all* documents, including the ones that have a duplicate ID, rename the ID field in your query (by using the **AS** keyword). Let Azure Cosmos DB create the ID field or replace the ID with another column's value (by using the **AS** keyword or by using the **Document ID** setting).
5157

5258
## Data partitioning in Azure Cosmos DB
5359
Azure Cosmos DB automatically scales partitions based on your workload. So we recommend [unlimited](../cosmos-db/partition-data.md) containers as the approach for partitioning your data. When Stream Analytics writes to unlimited containers, it uses as many parallel writers as the previous query step or input partitioning scheme.
5460

5561
> [!NOTE]
56-
> At this time, Azure Stream Analytics supports only unlimited containers with partition keys at the top level. For example, `/region` is supported. Nested partition keys (for example, `/region/name`) are not supported.
62+
> At this time, Azure Stream Analytics supports only unlimited containers with partition keys at the top level. For example, **/region** is supported. Nested partition keys (for example, **/region/name**) are not supported.
5763
5864
Depending on your choice of partition key, you might receive this _warning_:
5965

60-
`CosmosDB Output contains multiple rows and just one row per partition key. If the output latency is higher than expected, consider choosing a partition key that contains at least several hundred records per partition key.`
66+
"CosmosDB Output contains multiple rows and just one row per partition key. If the output latency is higher than expected, consider choosing a partition key that contains at least several hundred records per partition key."
6167

62-
It's important to choose a partition key property that has a number of distinct values, and lets you distribute your workload evenly across these values. As a natural artifact of partitioning, requests that involve the same partition key are limited by the maximum throughput of a single partition.
68+
It's important to choose a partition key property that has a number of distinct values, and that lets you distribute your workload evenly across these values. As a natural artifact of partitioning, requests that involve the same partition key are limited by the maximum throughput of a single partition.
6369

64-
The storage size for documents that belong to the same partition key is limited to 10 GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
70+
The storage size for documents that belong to the same partition key is limited to 10 GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure that your solution is scalable.
6571

66-
A partition key is also the boundary for transactions in stored procedures and triggers for Azure Cosmos DB. You should choose the partition key so that documents that occur together in transactions share the same partition key value. The article [Partitioning in Cosmos DB](../cosmos-db/partitioning-overview.md) gives more details on choosing a partition key.
72+
A partition key is also the boundary for transactions in stored procedures and triggers for Azure Cosmos DB. You should choose the partition key so that documents that occur together in transactions share the same partition key value. The article [Partitioning in Azure Cosmos DB](../cosmos-db/partitioning-overview.md) gives more details on choosing a partition key.
6773

6874
For fixed Azure Cosmos DB containers, Stream Analytics allows no way to scale up or out after they're full. They have an upper limit of 10 GB and 10,000 RU/s of throughput. To migrate the data from a fixed container to an unlimited container (for example, one with at least 1,000 RU/s and a partition key), use the [data migration tool](../cosmos-db/import-data.md) or the [change feed library](../cosmos-db/change-feed.md).
6975

@@ -72,43 +78,43 @@ The ability to write to multiple fixed containers is being deprecated. We don't
7278
## Improved throughput with compatibility level 1.2
7379
With compatibility level 1.2, Stream Analytics supports native integration to bulk write into Azure Cosmos DB. This integration enables writing effectively to Azure Cosmos DB while maximizing throughput and efficiently handling throttling requests.
7480

75-
The improved writing mechanism is available under a new compatibility level because of a difference in upsert behavior. With levels before 1.2, the upsert behavior was to insert or merge the document. With 1.2, upsert behavior is modified to insert or replace the document.
81+
The improved writing mechanism is available under a new compatibility level because of a difference in upsert behavior. With levels before 1.2, the upsert behavior is to insert or merge the document. With 1.2, upsert behavior is modified to insert or replace the document.
7682

77-
Before 1.2, Stream Analytics used a custom stored procedure to bulk upsert documents per partition key into Azure Cosmos DB. There, a batch was written as a transaction. Even when a single record his a transient error (throttling), the whole batch had to be retried. This made scenarios with even reasonable throttling relatively slow.
83+
With levels before 1.2, Stream Analytics uses a custom stored procedure to bulk upsert documents per partition key into Azure Cosmos DB. There, a batch is written as a transaction. Even when a single record hits a transient error (throttling), the whole batch has to be retried. This makes scenarios with even reasonable throttling relatively slow.
7884

79-
The following example shows two identical Stream Analytics jobs reading from same Azure Event Hubs input. Both Stream Analytics jobs are [fully partitioned](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs) with a passthrough query and write to identical Azure Cosmos DB containers. Metrics on the left are from the job configured with compatibility level 1.0. Metrics on the right are configured with 1.2. An Azure Cosmos DB container's partition key is a unique GUID that comes from the input event.
85+
The following example shows two identical Stream Analytics jobs reading from the same Azure Event Hubs input. Both Stream Analytics jobs are [fully partitioned](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs) with a passthrough query and write to identical Azure Cosmos DB containers. Metrics on the left are from the job configured with compatibility level 1.0. Metrics on the right are configured with 1.2. An Azure Cosmos DB container's partition key is a unique GUID that comes from the input event.
8086

8187
![Comparison of Stream Analytics metrics](media/stream-analytics-documentdb-output/stream-analytics-documentdb-output-3.png)
8288

83-
The incoming event rate in Event Hubs is two times higher than Azure Cosmos DB containers (20,000 RUs) are configured to take in, so throttling is expected in Azure Cosmos DB. However, the job with 1.2 is consistently writing at a higher throughput (output events per minute) and with a lower average SU% utilization. In your environment, this difference will depend on few more factors such as choice of event format, input event/message size, partition keys, and query.
89+
The incoming event rate in Event Hubs is two times higher than Azure Cosmos DB containers (20,000 RUs) are configured to take in, so throttling is expected in Azure Cosmos DB. However, the job with 1.2 is consistently writing at a higher throughput (output events per minute) and with a lower average SU% utilization. In your environment, this difference will depend on few more factors. These factors include choice of event format, input event/message size, partition keys, and query.
8490

8591
![Azure Cosmos DB metrics comparison](media/stream-analytics-documentdb-output/stream-analytics-documentdb-output-2.png)
8692

87-
With 1.2, Stream Analytics is more intelligent in utilizing 100 percent of the available throughput in Azure Cosmos DB with very few resubmissions from throttling/rate limiting. This provides a better experience for other workloads like queries running on the container at the same time. In case you need to try out how Strea Analytics scales out with Azure Cosmos DB as a sink for 1,000 to 10,000 messages per second, here is an [Azure samples project](https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-cosmosdb) that lets you do that.
93+
With 1.2, Stream Analytics is more intelligent in utilizing 100 percent of the available throughput in Azure Cosmos DB with very few resubmissions from throttling or rate limiting. This provides a better experience for other workloads like queries running on the container at the same time. If you want to see how Stream Analytics scales out with Azure Cosmos DB as a sink for 1,000 to 10,000 messages per second, try [this Azure sample project](https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-cosmosdb).
8894

89-
Please note that Cosmos DB output throughput is identical with 1.0 and 1.1. Since 1.2 is currently not the default, you can [set compatibility level](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-compatibility-level) for a Stream Analytics job by using the portal or by using the [create job REST API call](https://docs.microsoft.com/rest/api/streamanalytics/stream-analytics-job). We *strongly recommend* that you use compatibility level 1.2 in Stream Analytics with Azure Cosmos DB.
95+
Throughput of Azure Cosmos DB output is identical with 1.0 and 1.1. Because 1.2 is currently not the default, you can [set the compatibility level](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-compatibility-level) for a Stream Analytics job by using the portal or by using the [Create Stream Analytics Job REST API call](https://docs.microsoft.com/rest/api/streamanalytics/stream-analytics-job). We *strongly recommend* that you use compatibility level 1.2 in Stream Analytics with Azure Cosmos DB.
9096

97+
## Azure Cosmos DB settings for JSON output
9198

99+
Using Azure Cosmos DB as an output in Stream Analytics generates the following prompt for information.
92100

93-
## Cosmos DB settings for JSON output
94-
95-
Creating Cosmos DB as an output in Stream Analytics generates a prompt for information as seen below. This section provides an explanation of the properties definition.
96-
97-
![documentdb stream analytics output screen](media/stream-analytics-documentdb-output/stream-analytics-documentdb-output-1.png)
101+
![Information fields for an Azure Cosmos DB output stream](media/stream-analytics-documentdb-output/stream-analytics-documentdb-output-1.png)
98102

99103
|Field | Description|
100104
|------------- | -------------|
101-
|Output alias | An alias to refer this output in your ASA query.|
102-
|Subscription | Choose the Azure subscription.|
105+
|Output alias | An alias to refer to this output in your Stream Analytics query.|
106+
|Subscription | The Azure subscription.|
103107
|Account ID | The name or endpoint URI of the Azure Cosmos DB account.|
104108
|Account key | The shared access key for the Azure Cosmos DB account.|
105109
|Database | The Azure Cosmos DB database name.|
106-
|Container name | The container name to be used. `MyContainer` is a sample valid input - one container named `MyContainer` must exist. |
107-
|Document ID | Optional. The column name in output events used as the unique key on which insert or update operations must be based. If left empty, all events will be inserted, with no update option.|
110+
|Collection name pattern | The container name. `MyContainer` is a sample valid input. One container named `MyContainer` must exist. |
111+
|Document ID | Optional. The column name in output events used as the unique key on which insert or update operations must be based. If you leave it empty, all events will be inserted, with no update option.|
112+
113+
After you configure the Azure Cosmos DB output, you can use it in the query as the target of an [INTO statement](https://docs.microsoft.com/stream-analytics-query/into-azure-stream-analytics). When you're using an Azure Cosmos DB output that way, [a partition key needs to be set explicitly](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#partitions-in-sources-and-sinks).
108114

109-
After the Cosmos DB output is configured, it can be used in the query as the target of an [INTO statement](https://docs.microsoft.com/stream-analytics-query/into-azure-stream-analytics). When using an Azure Cosmos DB output as such, [a partition key needs to be set explicitly](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#partitions-in-sources-and-sinks). The output record must contain a case-sensitive column named after the partition key in Cosmos DB. To achieve greater parallelization, the statement may require a [PARTITION BY clause](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs) using the same column.
115+
The output record must contain a case-sensitive column named after the partition key in Azure Cosmos DB. To achieve greater parallelization, the statement might require a [PARTITION BY clause](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-parallelization#embarrassingly-parallel-jobs) that uses the same column.
110116

111-
**Sample query**:
117+
Here's a sample query:
112118

113119
```SQL
114120
SELECT TollBoothId, PartitionId
@@ -118,7 +124,7 @@ After the Cosmos DB output is configured, it can be used in the query as the tar
118124

119125
## Error handling and retries
120126

121-
In the event of a transient failure, service unavailability or throttling while sending events to Azure Cosmos DB, Stream Analytics retries indefinitely to successfully complete the operation. However, retries aren't attempted for thes following failures:
127+
If a transient failure, service unavailability, or throttling happens while Stream Analytics is sending events to Azure Cosmos DB, Stream Analytics retries indefinitely to finish the operation successfully. But it doesn't attempt retries for the following failures:
122128

123129
- Unauthorized (HTTP error code 401)
124130
- NotFound (HTTP error code 404)

0 commit comments

Comments
 (0)