|
| 1 | +--- |
| 2 | +title: Get started with change data capture in analytical store |
| 3 | +titleSuffix: Azure Cosmos DB |
| 4 | +description: Enable change data capture in Azure Cosmos DB analytical store for an existing account to consume a continuous and incremental feed of changed data. |
| 5 | +author: Rodrigossz |
| 6 | +ms.author: rosouz |
| 7 | +ms.reviewer: sidandrews |
| 8 | +ms.service: cosmos-db |
| 9 | +ms.topic: how-to |
| 10 | +ms.date: 03/23/2023 |
| 11 | +--- |
| 12 | + |
| 13 | +# Get started with change data capture in the analytical store for Azure Cosmos DB |
| 14 | + |
| 15 | +[!INCLUDE[NoSQL, MongoDB](includes/appliesto-nosql-mongodb.md)] |
| 16 | + |
| 17 | +Use Change data capture (CDC) in Azure Cosmos DB analytical store as a source to [Azure Data Factory](../data-factory/index.yml) or [Azure Synapse Analytics](../synapse-analytics/index.yml) to capture specific changes to your data. |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +- An existing Azure Cosmos DB account. |
| 22 | + - If you have an Azure subscription, [create a new account](nosql/how-to-create-account.md?tabs=azure-portal). |
| 23 | + - If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin. |
| 24 | + - Alternatively, you can [try Azure Cosmos DB free](try-free.md) before you commit. |
| 25 | + |
| 26 | +## Enable analytical store |
| 27 | + |
| 28 | +First, enable Azure Synapse Link at the account level and then enable analytical store for the containers that's appropriate for your workload. |
| 29 | + |
| 30 | +1. Enable Azure Synapse Link: [Enable Azure Synapse Link for an Azure Cosmos DB account](configure-synapse-link.md#enable-synapse-link) | |
| 31 | + |
| 32 | +1. Enable analytical store for your container\[s\]: |
| 33 | + |
| 34 | + | Option | Guide | |
| 35 | + | --- | --- | |
| 36 | + | **Enable for a specific new container** | [Enable Azure Synapse Link for your new containers](configure-synapse-link.md#new-container) | |
| 37 | + | **Enable for a specific existing container** | [Enable Azure Synapse Link for your existing containers](configure-synapse-link.md#existing-container) | |
| 38 | + |
| 39 | +## Create a target Azure resource using data flows |
| 40 | + |
| 41 | +The change data capture feature of the analytical store is available through the data flow feature of [Azure Data Factory](../data-factory/concepts-data-flow-overview.md) or [Azure Synapse Analytics](../synapse-analytics/concepts-data-flow-overview.md). For this guide, use Azure Data Factory. |
| 42 | + |
| 43 | +> [!IMPORTANT] |
| 44 | +> You can alternatively use Azure Synapse Analytics. First, [create an Azure Synapse workspace](../synapse-analytics/quickstart-create-workspace.md), if you don't already have one. Within the newly created workspace, select the **Develop** tab, select **Add new resource**, and then select **Data flow**. |
| 45 | +
|
| 46 | +1. [Create an Azure Data Factory](../data-factory/quickstart-create-data-factory.md), if you don't already have one. |
| 47 | + |
| 48 | + > [!TIP] |
| 49 | + > If possible, create the data factory in the same region where your Azure Cosmos DB account resides. |
| 50 | +
|
| 51 | +1. Launch the newly created data factory. |
| 52 | + |
| 53 | +1. In the data factory, select the **Data flows** tab, and then select **New data flow**. |
| 54 | + |
| 55 | +1. Give the newly created data flow a unique name. In this example, the data flow is named `cosmoscdc`. |
| 56 | + |
| 57 | + :::image type="content" source="media/get-started-change-data-capture/data-flow-name.png" lightbox="media/get-started-change-data-capture/data-flow-name.png" alt-text="Screnshot of a new data flow with the name cosmoscdc."::: |
| 58 | + |
| 59 | +## Configure source settings for the analytical store container |
| 60 | + |
| 61 | +Now create and configure a source to flow data from the Azure Cosmos DB account's analytical store. |
| 62 | + |
| 63 | +1. Select **Add Source**. |
| 64 | + |
| 65 | + :::image type="content" source="media/get-started-change-data-capture/add-source.png" alt-text="Screenshot of the add source menu option."::: |
| 66 | + |
| 67 | +1. In the **Output stream name** field, enter **cosmos**. |
| 68 | + |
| 69 | + :::image type="content" source="media/get-started-change-data-capture/source-name.png" alt-text="Screenshot of naming the newly created source cosmos."::: |
| 70 | + |
| 71 | +1. In the **Source type** section, select **Inline**. |
| 72 | + |
| 73 | + :::image type="content" source="media/get-started-change-data-capture/inline-source-type.png" alt-text="Screenshot of selecting the inline source type."::: |
| 74 | + |
| 75 | +1. In the **Dataset** field, select **Azure - Azure Cosmos DB for NoSQL**. |
| 76 | + |
| 77 | + :::image type="content" source="media/get-started-change-data-capture/dataset-type-cosmos.png" alt-text="Screenshot of selecting Azure Cosmos DB for NoSQL as the dataset type."::: |
| 78 | + |
| 79 | +1. Create a new linked service for your account named **cosmoslinkedservice**. Select your existing Azure Cosmos DB for NoSQL account in the **New linked service** popup dialog and then select **Ok**. In this example, we select a pre-existing Azure Cosmos DB for NoSQL account named `msdocs-cosmos-source` and a database named `cosmicworks`. |
| 80 | + |
| 81 | + :::image type="content" source="media/get-started-change-data-capture/new-linked-service.png" alt-text="Screenshot of the New linked service dialog with an Azure Cosmos DB account selected."::: |
| 82 | + |
| 83 | +1. Select **Analytical** for the store type. |
| 84 | + |
| 85 | + :::image type="content" source="media/get-started-change-data-capture/linked-service-analytical.png" alt-text="Screenshot of the analytical option selected for a linked service."::: |
| 86 | + |
| 87 | +1. Select the **Source options** tab. |
| 88 | + |
| 89 | +1. Within **Source options**, select your target container and enable **Data flow debug**. In this example, the container is named `products`. |
| 90 | + |
| 91 | + :::image type="content" source="media/get-started-change-data-capture/container-name.png" alt-text="Screenshot of a source container selected named products."::: |
| 92 | + |
| 93 | +1. Select **Data flow debug**. In the **Turn on data flow debug** popup dialog, retain the default options and then select **Ok**. |
| 94 | + |
| 95 | + :::image type="content" source="media/get-started-change-data-capture/enable-data-flow-debug.png" alt-text="Screenshot of the toggle option to enable data flow debug."::: |
| 96 | + |
| 97 | +1. The **Source options** tab also contains other options you may wish to enable. This table describes those options: |
| 98 | + |
| 99 | +| Option | Description | |
| 100 | +| --- | --- | |
| 101 | +| Capture intermediate updates | Enable this option if you would like to capture the history of changes to items including the intermediate changes between change data capture reads. | |
| 102 | +| Capture Deletes | Enable this option to capture user-deleted records and apply them on the Sink. Deletes can't be applied on Azure Data Explorer and Azure Cosmos DB Sinks. | |
| 103 | +| Capture Transactional store TTLs | Enable this option to capture Azure Cosmos DB transactional store (time-to-live) TTL deleted records and apply on the Sink. TTL-deletes can't be applied on Azure Data Explorer and Azure Cosmos DB sinks. | |
| 104 | +| Batchsize in bytes | Specify the size in bytes if you would like to batch the change data capture feeds | |
| 105 | +| Extra Configs | Extra Azure Cosmos DB analytical store configs and their values. (ex: `spark.cosmos.allowWhiteSpaceInFieldNames -> true`) | |
| 106 | + |
| 107 | +## Create and configure sink settings for update and delete operations |
| 108 | + |
| 109 | +First, create a straightforward [Azure Blob Storage](../storage/blobs/index.yml) sink and then configure the sink to filter data to only specific operations. |
| 110 | + |
| 111 | +1. [Create an Azure Blob Storage](../data-factory/quickstart-create-data-factory.md) account and container, if you don't already have one. For the next examples, we'll use an account named `msdocsblobstorage` and a container named `output`. |
| 112 | + |
| 113 | + > [!TIP] |
| 114 | + > If possible, create the storage account in the same region where your Azure Cosmos DB account resides. |
| 115 | +
|
| 116 | +1. Back in Azure Data Factory, create a new sink for the change data captured from your `cosmos` source. |
| 117 | + |
| 118 | + :::image type="content" source="media/get-started-change-data-capture/add-sink.png" alt-text="Screenshot of adding a new sink that's connected to the existing source."::: |
| 119 | + |
| 120 | +1. Give the sink a unique name. In this example, the sink is named `storage`. |
| 121 | + |
| 122 | + :::image type="content" source="media/get-started-change-data-capture/sink-name.png" alt-text="Screenshot of naming the newly created sink storage."::: |
| 123 | + |
| 124 | +1. In the **Sink type** section, select **Inline**. In the **Dataset** field, select **Delta**. |
| 125 | + |
| 126 | + :::image type="content" source="media/get-started-change-data-capture/sink-dataset-type.png" alt-text="Screenshot of selecting and Inline Delta dataset type for the sink."::: |
| 127 | + |
| 128 | +1. Create a new linked service for your account using **Azure Blob Storage** named **storagelinkedservice**. Select your existing Azure Blob Storage account in the **New linked service** popup dialog and then select **Ok**. In this example, we select a pre-existing Azure Blob Storage account named `msdocsblobstorage`. |
| 129 | + |
| 130 | + :::image type="content" source="media/get-started-change-data-capture/new-linked-service-sink-type.png" alt-text="Screenshot of the service type options for a new Delta linked service."::: |
| 131 | + |
| 132 | + :::image type="content" source="media/get-started-change-data-capture/new-linked-service-sink-config.png" alt-text="Screenshot of the New linked service dialog with an Azure Blob Storage account selected."::: |
| 133 | + |
| 134 | +1. Select the **Settings** tab. |
| 135 | + |
| 136 | +1. Within **Settings**, set the **Folder path** to the name of the blob container. In this example, the container's name is `output`. |
| 137 | + |
| 138 | + :::image type="content" source="media/get-started-change-data-capture/sink-container-name.png" alt-text="Screenshot of the blob container named output set as the sink target."::: |
| 139 | + |
| 140 | +1. Locate the **Update method** section and change the selections to only allow **delete** and **update** operations. Also, specify the **Key columns** as a **List of columns** using the field `_{rid}` as the unique identifier. |
| 141 | + |
| 142 | + :::image type="content" source="media/get-started-change-data-capture/sink-methods-columns.png" alt-text="Screenshot of update methods and key columns being specified for the sink."::: |
| 143 | + |
| 144 | +1. Select **Validate** to ensure you haven't made any errors or omissions. Then, select **Publish** to publish the data flow. |
| 145 | + |
| 146 | + :::image type="content" source="media/get-started-change-data-capture/validate-publish-data-flow.png" alt-text="Screenshot of the option to validate and then publish the current data flow."::: |
| 147 | + |
| 148 | +## Schedule change data capture execution |
| 149 | + |
| 150 | +After a data flow has been published, you can add a new pipeline to move and transform your data. |
| 151 | + |
| 152 | +1. Create a new pipeline. Give the pipeline a unique name. In this example, the pipeline is named `cosmoscdcpipeline`. |
| 153 | + |
| 154 | + :::image type="content" source="media/get-started-change-data-capture/new-pipeline.png" alt-text="Screenshot of the new pipeline option within the resources section."::: |
| 155 | + |
| 156 | +1. In the **Activities** section, expand the **Move & transform** option and then select **Data flow**. |
| 157 | + |
| 158 | + :::image type="content" source="media/get-started-change-data-capture/data-flow-activity.png" alt-text="Screenshot of the data flow activity option within the activities section."::: |
| 159 | + |
| 160 | +1. Give the data flow activity a unique name. In this example, the activity is named `cosmoscdcactivity`. |
| 161 | + |
| 162 | +1. In the **Settings** tab, select the data flow named `cosmoscdc` you created earlier in this guide. Then, select a compute size based on the data volume and required latency for your workload. |
| 163 | + |
| 164 | + :::image type="content" source="media/get-started-change-data-capture/data-flow-settings.png" alt-text="Screenshot of the configuration settings for both the data flow and compute size for the activity."::: |
| 165 | + |
| 166 | + > [!TIP] |
| 167 | + > For incremental data sizes greater than 100 GB, we recommend the **Custom** size with a core count of 32 (+16 driver cores). |
| 168 | +
|
| 169 | +1. Select **Add trigger**. Schedule this pipeline to execute at a cadence that makes sense for your workload. In this example, the pipeline is configured to execute every five minutes. |
| 170 | + |
| 171 | + :::image type="content" source="media/get-started-change-data-capture/add-trigger.png" alt-text="Screenshot of the add trigger button for a new pipeline."::: |
| 172 | + |
| 173 | + :::image type="content" source="media/get-started-change-data-capture/trigger-configuration.png" alt-text="Screenshot of a trigger configuration based on a schedule, starting in the year 2023, that runs every five minutes."::: |
| 174 | + |
| 175 | + > [!NOTE] |
| 176 | + > The minimum recurrence window for change data capture executions is one minute. |
| 177 | +
|
| 178 | +1. Select **Validate** to ensure you haven't made any errors or omissions. Then, select **Publish** to publish the pipeline. |
| 179 | + |
| 180 | +1. Observe the data placed into the Azure Blob Storage container as an output of the data flow using Azure Cosmos DB analytical store change data capture. |
| 181 | + |
| 182 | + :::image type="content" source="media/get-started-change-data-capture/output-files.png" alt-text="Screnshot of the output files from the pipeline in the Azure Blob Storage container."::: |
| 183 | + |
| 184 | + > [!NOTE] |
| 185 | + > The initial cluster startup time may take up to three minutes. To avoid cluster startup time in the subsequent change data capture executions, configure the Dataflow cluster **Time to live** value. For more information about the itegration runtime and TTL, see [integration runtime in Azure Data Factory](../data-factory/concepts-integration-runtime.md). |
| 186 | +
|
| 187 | +## Next steps |
| 188 | + |
| 189 | +- Review the [overview of Azure Cosmos DB analytical store](analytical-store-introduction.md) |
0 commit comments