Skip to content

Commit e5f9f3e

Browse files
authored
Merge pull request #231838 from seesharprun/cosmos-analytical-store-cdc
Cosmos DB | CDC for analytical store
2 parents 03670de + 00a9f6c commit e5f9f3e

28 files changed

+286
-0
lines changed
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: Change data capture in analytical store
3+
titleSuffix: Azure Cosmos DB
4+
description: Change data capture (CDC) in Azure Cosmos DB analytical store allows you to efficiently consume a continuous and incremental feed of changed data.
5+
author: Rodrigossz
6+
ms.author: rosouz
7+
ms.reviewer: sidandrews
8+
ms.service: cosmos-db
9+
ms.topic: conceptual
10+
ms.date: 03/23/2023
11+
---
12+
13+
# Change Data Capture in Azure Cosmos DB analytical store
14+
15+
[!INCLUDE[NoSQL, MongoDB](includes/appliesto-nosql-mongodb.md)]
16+
17+
Change data capture (CDC) in [Azure Cosmos DB analytical store](analytical-store-introduction.md) allows you to efficiently consume a continuous and incremental feed of changed (inserted, updated, and deleted) data from analytical store. The change data capture feature of the analytical store is seamlessly integrated with Azure Synapse and Azure Data Factory, providing you with a scalable no-code experience for high data volume. As the change data capture feature is based on analytical store, it [doesn't consume provisioned RUs, doesn't affect your transactional workloads](analytical-store-introduction.md#decoupled-performance-for-analytical-workloads), provides lower latency, and has lower TCO.
18+
19+
:::image type="content" source="media\analytical-store-change-data-capture\overview-diagram.png" alt-text="Diagram of the analytical store in Azure Cosmos DB and how it, with change data capture, can write to various first and third-party target services.":::
20+
21+
In addition to providing incremental data feed from analytical store to diverse targets, change data capture supports the following capabilities:
22+
23+
- Supports applying filters, projections and transformations on the Change feed via source query
24+
- Supports capturing deletes and intermediate updates
25+
- Ability to filter the change feed for a specific type of operation (**Insert** | **Update** | **Delete** | **TTL**)
26+
- Each change in Container appears exactly once in the change data capture feed, and the checkpoints are managed internally for you
27+
- Changes can be synchronized from “the Beginning” or “from a given timestamp” or “from now”
28+
- There's no limitation around the fixed data retention period for which changes are available
29+
- Multiple change feeds on the same container can be consumed simultaneously
30+
31+
## Features
32+
33+
Change data capture in Azure Cosmos DB analytical store supports the following key features.
34+
35+
### Capturing deletes and intermediate updates
36+
37+
The change data capture feature for the analytical store captures deleted records and the intermediate updates. The captured deletes and updates can be applied on Sinks that support delete and update operations. The {_rid} value uniquely identifies the records and so by specifying {_rid} as key column on the Sink side, the update and delete operations would be reflected on the Sink.
38+
39+
### Filter the change feed for a specific type of operation
40+
41+
You can filter the change data capture feed for a specific type of operation. For example, you can selectively capture the insert and update operations only, thereby ignoring the user-delete and TTL-delete operations.
42+
43+
### Applying filters, projections, and transformations on the Change feed via source query
44+
45+
You can optionally use a source query to specify filter(s), projection(s), and transformation(s), which would all be pushed down to the columnar analytical store. Here's a sample source-query that would only capture incremental records with the filter `Category = 'Urban'`. This sample query projects only five fields and applies a simple transformation:
46+
47+
```sql
48+
SELECT ProductId, Product, Segment, concat(Manufacturer, '-', Category) as ManufacturerCategory
49+
FROM c
50+
WHERE Category = 'Urban'
51+
```
52+
53+
> [!NOTE]
54+
> If you would like to enable source-query based change data capture on Azure Data Factory data flows during preview, please email [[email protected]](mailto:[email protected]) and share your **subscription Id** and **region**. This is not necessary to enable source-query based change data capture on an Azure Synapse data flow.
55+
56+
### Throughput isolation, lower latency and lower TCO
57+
58+
Operations on Cosmos DB analytical store don't consume the provisioned RUs and so don't affect your transactional workloads. change data capture with analytical store also has lower latency and lower TCO. The lower latency is attributed to analytical store enabling better parallelism for data processing and reduces the overall TCO enabling you to drive cost efficiencies in these rapidly shifting economic conditions.
59+
60+
## Scenarios
61+
62+
Here are common scenarios where you could use change data capture and the analytical store.
63+
64+
### Consuming incremental data from Cosmos DB
65+
66+
You can use analytical store change data capture, if you're currently using or planning to use:
67+
68+
- Incremental data capture using Azure Data Factory Data Flows or Copy activity.
69+
- One time batch processing using Azure Data Factory.
70+
- Streaming Cosmos DB data
71+
- The analytical store has up to 2-min latency to sync transactional store data. You can schedule Data Flows in Azure Data Factory every minute.
72+
- If you need to stream without the above latency, we recommend using the change feed feature of the transactional store.
73+
- Capturing deletes, incremental changes, applying filters on Cosmos DB Data.
74+
- If you're using Azure Functions triggers or any other option with change feed and would like to capture deletes, incremental changes, apply transformations etc.; we recommend change data capture over analytical store.
75+
76+
### Incremental feed to analytical platform of your choice
77+
78+
change data capture capability enables end-to-end analytical story providing you with the flexibility to use Azure Cosmos DB data on analytical platform of your choice seamlessly. It also enables you to bring Cosmos DB data into a centralized data lake and join with data from diverse data sources. For more information, see [supported sink types](../data-factory/data-flow-sink.md#supported-sinks). You can flatten the data, apply more transformations either in Azure Synapse Analytics or Azure Data Factory.
79+
80+
## Change data capture on Azure Cosmos DB for MongoDB containers
81+
82+
The linked service interface for the API for MongoDB isn't available within Azure Data Factory data flows yet. You can use your API for MongoDB's account endpoint with the **Azure Cosmos DB for NoSQL** linked service interface as a work around until the Mongo linked service is directly supported.
83+
84+
In the interface for a new NoSQL linked service, select **Enter Manually** to provide the Azure Cosmos DB account information. Here, use the account's NoSQL document endpoint (ex: `https://<account-name>.documents.azure.com:443/`) instead of the Mongo DB endpoint (ex: `mongodb://<account-name>.mongo.cosmos.azure.com:10255/`)
85+
86+
## Next steps
87+
88+
> [!div class="nextstepaction"]
89+
> [Get started with change data capture in the analytical store](get-started-change-data-capture.md)
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
---
2+
title: Get started with change data capture in analytical store
3+
titleSuffix: Azure Cosmos DB
4+
description: Enable change data capture in Azure Cosmos DB analytical store for an existing account to consume a continuous and incremental feed of changed data.
5+
author: Rodrigossz
6+
ms.author: rosouz
7+
ms.reviewer: sidandrews
8+
ms.service: cosmos-db
9+
ms.topic: how-to
10+
ms.date: 03/23/2023
11+
---
12+
13+
# Get started with change data capture in the analytical store for Azure Cosmos DB
14+
15+
[!INCLUDE[NoSQL, MongoDB](includes/appliesto-nosql-mongodb.md)]
16+
17+
Use Change data capture (CDC) in Azure Cosmos DB analytical store as a source to [Azure Data Factory](../data-factory/index.yml) or [Azure Synapse Analytics](../synapse-analytics/index.yml) to capture specific changes to your data.
18+
19+
## Prerequisites
20+
21+
- An existing Azure Cosmos DB account.
22+
- If you have an Azure subscription, [create a new account](nosql/how-to-create-account.md?tabs=azure-portal).
23+
- If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin.
24+
- Alternatively, you can [try Azure Cosmos DB free](try-free.md) before you commit.
25+
26+
## Enable analytical store
27+
28+
First, enable Azure Synapse Link at the account level and then enable analytical store for the containers that's appropriate for your workload.
29+
30+
1. Enable Azure Synapse Link: [Enable Azure Synapse Link for an Azure Cosmos DB account](configure-synapse-link.md#enable-synapse-link) |
31+
32+
1. Enable analytical store for your container\[s\]:
33+
34+
| Option | Guide |
35+
| --- | --- |
36+
| **Enable for a specific new container** | [Enable Azure Synapse Link for your new containers](configure-synapse-link.md#new-container) |
37+
| **Enable for a specific existing container** | [Enable Azure Synapse Link for your existing containers](configure-synapse-link.md#existing-container) |
38+
39+
## Create a target Azure resource using data flows
40+
41+
The change data capture feature of the analytical store is available through the data flow feature of [Azure Data Factory](../data-factory/concepts-data-flow-overview.md) or [Azure Synapse Analytics](../synapse-analytics/concepts-data-flow-overview.md). For this guide, use Azure Data Factory.
42+
43+
> [!IMPORTANT]
44+
> You can alternatively use Azure Synapse Analytics. First, [create an Azure Synapse workspace](../synapse-analytics/quickstart-create-workspace.md), if you don't already have one. Within the newly created workspace, select the **Develop** tab, select **Add new resource**, and then select **Data flow**.
45+
46+
1. [Create an Azure Data Factory](../data-factory/quickstart-create-data-factory.md), if you don't already have one.
47+
48+
> [!TIP]
49+
> If possible, create the data factory in the same region where your Azure Cosmos DB account resides.
50+
51+
1. Launch the newly created data factory.
52+
53+
1. In the data factory, select the **Data flows** tab, and then select **New data flow**.
54+
55+
1. Give the newly created data flow a unique name. In this example, the data flow is named `cosmoscdc`.
56+
57+
:::image type="content" source="media/get-started-change-data-capture/data-flow-name.png" lightbox="media/get-started-change-data-capture/data-flow-name.png" alt-text="Screnshot of a new data flow with the name cosmoscdc.":::
58+
59+
## Configure source settings for the analytical store container
60+
61+
Now create and configure a source to flow data from the Azure Cosmos DB account's analytical store.
62+
63+
1. Select **Add Source**.
64+
65+
:::image type="content" source="media/get-started-change-data-capture/add-source.png" alt-text="Screenshot of the add source menu option.":::
66+
67+
1. In the **Output stream name** field, enter **cosmos**.
68+
69+
:::image type="content" source="media/get-started-change-data-capture/source-name.png" alt-text="Screenshot of naming the newly created source cosmos.":::
70+
71+
1. In the **Source type** section, select **Inline**.
72+
73+
:::image type="content" source="media/get-started-change-data-capture/inline-source-type.png" alt-text="Screenshot of selecting the inline source type.":::
74+
75+
1. In the **Dataset** field, select **Azure - Azure Cosmos DB for NoSQL**.
76+
77+
:::image type="content" source="media/get-started-change-data-capture/dataset-type-cosmos.png" alt-text="Screenshot of selecting Azure Cosmos DB for NoSQL as the dataset type.":::
78+
79+
1. Create a new linked service for your account named **cosmoslinkedservice**. Select your existing Azure Cosmos DB for NoSQL account in the **New linked service** popup dialog and then select **Ok**. In this example, we select a pre-existing Azure Cosmos DB for NoSQL account named `msdocs-cosmos-source` and a database named `cosmicworks`.
80+
81+
:::image type="content" source="media/get-started-change-data-capture/new-linked-service.png" alt-text="Screenshot of the New linked service dialog with an Azure Cosmos DB account selected.":::
82+
83+
1. Select **Analytical** for the store type.
84+
85+
:::image type="content" source="media/get-started-change-data-capture/linked-service-analytical.png" alt-text="Screenshot of the analytical option selected for a linked service.":::
86+
87+
1. Select the **Source options** tab.
88+
89+
1. Within **Source options**, select your target container and enable **Data flow debug**. In this example, the container is named `products`.
90+
91+
:::image type="content" source="media/get-started-change-data-capture/container-name.png" alt-text="Screenshot of a source container selected named products.":::
92+
93+
1. Select **Data flow debug**. In the **Turn on data flow debug** popup dialog, retain the default options and then select **Ok**.
94+
95+
:::image type="content" source="media/get-started-change-data-capture/enable-data-flow-debug.png" alt-text="Screenshot of the toggle option to enable data flow debug.":::
96+
97+
1. The **Source options** tab also contains other options you may wish to enable. This table describes those options:
98+
99+
| Option | Description |
100+
| --- | --- |
101+
| Capture intermediate updates | Enable this option if you would like to capture the history of changes to items including the intermediate changes between change data capture reads. |
102+
| Capture Deletes | Enable this option to capture user-deleted records and apply them on the Sink. Deletes can't be applied on Azure Data Explorer and Azure Cosmos DB Sinks. |
103+
| Capture Transactional store TTLs | Enable this option to capture Azure Cosmos DB transactional store (time-to-live) TTL deleted records and apply on the Sink. TTL-deletes can't be applied on Azure Data Explorer and Azure Cosmos DB sinks. |
104+
| Batchsize in bytes | Specify the size in bytes if you would like to batch the change data capture feeds |
105+
| Extra Configs | Extra Azure Cosmos DB analytical store configs and their values. (ex: `spark.cosmos.allowWhiteSpaceInFieldNames -> true`) |
106+
107+
## Create and configure sink settings for update and delete operations
108+
109+
First, create a straightforward [Azure Blob Storage](../storage/blobs/index.yml) sink and then configure the sink to filter data to only specific operations.
110+
111+
1. [Create an Azure Blob Storage](../data-factory/quickstart-create-data-factory.md) account and container, if you don't already have one. For the next examples, we'll use an account named `msdocsblobstorage` and a container named `output`.
112+
113+
> [!TIP]
114+
> If possible, create the storage account in the same region where your Azure Cosmos DB account resides.
115+
116+
1. Back in Azure Data Factory, create a new sink for the change data captured from your `cosmos` source.
117+
118+
:::image type="content" source="media/get-started-change-data-capture/add-sink.png" alt-text="Screenshot of adding a new sink that's connected to the existing source.":::
119+
120+
1. Give the sink a unique name. In this example, the sink is named `storage`.
121+
122+
:::image type="content" source="media/get-started-change-data-capture/sink-name.png" alt-text="Screenshot of naming the newly created sink storage.":::
123+
124+
1. In the **Sink type** section, select **Inline**. In the **Dataset** field, select **Delta**.
125+
126+
:::image type="content" source="media/get-started-change-data-capture/sink-dataset-type.png" alt-text="Screenshot of selecting and Inline Delta dataset type for the sink.":::
127+
128+
1. Create a new linked service for your account using **Azure Blob Storage** named **storagelinkedservice**. Select your existing Azure Blob Storage account in the **New linked service** popup dialog and then select **Ok**. In this example, we select a pre-existing Azure Blob Storage account named `msdocsblobstorage`.
129+
130+
:::image type="content" source="media/get-started-change-data-capture/new-linked-service-sink-type.png" alt-text="Screenshot of the service type options for a new Delta linked service.":::
131+
132+
:::image type="content" source="media/get-started-change-data-capture/new-linked-service-sink-config.png" alt-text="Screenshot of the New linked service dialog with an Azure Blob Storage account selected.":::
133+
134+
1. Select the **Settings** tab.
135+
136+
1. Within **Settings**, set the **Folder path** to the name of the blob container. In this example, the container's name is `output`.
137+
138+
:::image type="content" source="media/get-started-change-data-capture/sink-container-name.png" alt-text="Screenshot of the blob container named output set as the sink target.":::
139+
140+
1. Locate the **Update method** section and change the selections to only allow **delete** and **update** operations. Also, specify the **Key columns** as a **List of columns** using the field `_{rid}` as the unique identifier.
141+
142+
:::image type="content" source="media/get-started-change-data-capture/sink-methods-columns.png" alt-text="Screenshot of update methods and key columns being specified for the sink.":::
143+
144+
1. Select **Validate** to ensure you haven't made any errors or omissions. Then, select **Publish** to publish the data flow.
145+
146+
:::image type="content" source="media/get-started-change-data-capture/validate-publish-data-flow.png" alt-text="Screenshot of the option to validate and then publish the current data flow.":::
147+
148+
## Schedule change data capture execution
149+
150+
After a data flow has been published, you can add a new pipeline to move and transform your data.
151+
152+
1. Create a new pipeline. Give the pipeline a unique name. In this example, the pipeline is named `cosmoscdcpipeline`.
153+
154+
:::image type="content" source="media/get-started-change-data-capture/new-pipeline.png" alt-text="Screenshot of the new pipeline option within the resources section.":::
155+
156+
1. In the **Activities** section, expand the **Move &amp; transform** option and then select **Data flow**.
157+
158+
:::image type="content" source="media/get-started-change-data-capture/data-flow-activity.png" alt-text="Screenshot of the data flow activity option within the activities section.":::
159+
160+
1. Give the data flow activity a unique name. In this example, the activity is named `cosmoscdcactivity`.
161+
162+
1. In the **Settings** tab, select the data flow named `cosmoscdc` you created earlier in this guide. Then, select a compute size based on the data volume and required latency for your workload.
163+
164+
:::image type="content" source="media/get-started-change-data-capture/data-flow-settings.png" alt-text="Screenshot of the configuration settings for both the data flow and compute size for the activity.":::
165+
166+
> [!TIP]
167+
> For incremental data sizes greater than 100 GB, we recommend the **Custom** size with a core count of 32 (+16 driver cores).
168+
169+
1. Select **Add trigger**. Schedule this pipeline to execute at a cadence that makes sense for your workload. In this example, the pipeline is configured to execute every five minutes.
170+
171+
:::image type="content" source="media/get-started-change-data-capture/add-trigger.png" alt-text="Screenshot of the add trigger button for a new pipeline.":::
172+
173+
:::image type="content" source="media/get-started-change-data-capture/trigger-configuration.png" alt-text="Screenshot of a trigger configuration based on a schedule, starting in the year 2023, that runs every five minutes.":::
174+
175+
> [!NOTE]
176+
> The minimum recurrence window for change data capture executions is one minute.
177+
178+
1. Select **Validate** to ensure you haven't made any errors or omissions. Then, select **Publish** to publish the pipeline.
179+
180+
1. Observe the data placed into the Azure Blob Storage container as an output of the data flow using Azure Cosmos DB analytical store change data capture.
181+
182+
:::image type="content" source="media/get-started-change-data-capture/output-files.png" alt-text="Screnshot of the output files from the pipeline in the Azure Blob Storage container.":::
183+
184+
> [!NOTE]
185+
> The initial cluster startup time may take up to three minutes. To avoid cluster startup time in the subsequent change data capture executions, configure the Dataflow cluster **Time to live** value. For more information about the itegration runtime and TTL, see [integration runtime in Azure Data Factory](../data-factory/concepts-integration-runtime.md).
186+
187+
## Next steps
188+
189+
- Review the [overview of Azure Cosmos DB analytical store](analytical-store-introduction.md)
107 KB
Loading
44.1 KB
Loading
14.8 KB
Loading
10.5 KB
Loading
14.1 KB
Loading
8.61 KB
Loading
91 KB
Loading
11.1 KB
Loading

0 commit comments

Comments
 (0)