Skip to content

Commit f4fa5e5

Browse files
authored
Merge pull request #178262 from AnithaAdusumilli/patch-36
Updating for Ignite - Synapse Link Custom Partitioning
2 parents f4ddf12 + 7d5da43 commit f4fa5e5

File tree

1 file changed

+63
-23
lines changed

1 file changed

+63
-23
lines changed

articles/cosmos-db/custom-partitioning-analytical-store.md

Lines changed: 63 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -12,40 +12,39 @@ ms.custom: ignite-fall-2021
1212
# Custom partitioning in Azure Synapse Link for Azure Cosmos DB (Preview)
1313
[!INCLUDE[appliesto-sql-api](includes/appliesto-sql-api.md)]
1414

15-
Custom partitioning enables you to partition the analytical store data on fields that are commonly used as filters in analytical queries resulting in improved query performance.
15+
Custom partitioning enables you to partition analytical store data, on fields that are commonly used as filters in analytical queries, resulting in improved query performance.
1616

1717
In this article, you will learn how to partition your data in Azure Cosmos DB analytical store using keys that are critical for your analytical workloads. It also explains how to take advantage of the improved query performance with partition pruning. You will also learn how the partitioned store helps to improve the query performance when your workloads have a significant number of updates or deletes.
1818

1919
> [!IMPORTANT]
2020
> Custom partitioning feature is currently in public preview. This preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
2121
2222
> [!NOTE]
23-
> Azure Cosmos DB accounts should have Azure Synapse Link enabled to take advantage of custom partitioning. Custom partitioning is currently supported for Azure Synapse Spark 2.0 only.
23+
> Azure Cosmos DB accounts should have [Azure Synapse Link](synapse-link.md) enabled to take advantage of custom partitioning. Custom partitioning is currently supported for Azure Synapse Spark 2.0 only.
2424
2525
## How does it work?
2626

27-
With custom partitioning, you can choose a single field or a combination of fields from your dataset as the analytical store partition key.
27+
Analytical store partitioning is independent of partitioning in the transactional store. By default, analytical store is not partitioned. If you want to query analytical store frequently based on fields such as Date, Time, Category etc. you leverage custom partitioning to create a separate partitioned store based on these keys. You can choose a single field or a combination of fields from your dataset as the analytical store partition key.
2828

29-
The analytical store partitioning is independent of partitioning in the transactional store. By default, analytical store is not partitioned. If you want to query analytical store frequently based on fields such as Date, Time, Category etc. we recommend that you create a partitioned store based on these keys.
30-
31-
To trigger partitioning, you can periodically execute partitioning job from an Azure Synapse Spark notebook using Azure Synapse Link. You can schedule it to run as a background job at your convenient schedule.
29+
You can trigger partitioning from an Azure Synapse Spark notebook using Azure Synapse Link. You can schedule it to run as a background job, once or twice a day but can be executed more often, if needed.
3230

3331
> [!NOTE]
3432
> The partitioned store points to the ADLS Gen2 primary storage account that is linked with the Azure Synapse workspace.
3533
3634
:::image type="content" source="./media/custom-partitioning-analytical-store/partitioned-store-architecture.png" alt-text="Architecture of partitioned store in Azure Synapse Link for Azure Cosmos DB" lightbox="./media/custom-partitioning-analytical-store/partitioned-store-architecture.png" border="false":::
3735

38-
The partitioned store contains Azure Cosmos DB analytical data until the last timestamp you ran your partitioning job. When you query your analytical data using the partition key filters in Synapse Spark, Synapse Link will automatically merge most recent data from the analytical store with the data in partitioned store. This way it gives you the latest results. Although it merges the data before querying, the delta isn’t written back to the partitioned store. As the delta between data in analytical store and partitioned store widens, the query times on partitioned data may vary. Triggering partitioning job more frequently will reduce this delta. Each time you execute the partition job, only incremental changes in the analytical store will be processed, instead of the full data set.
36+
The partitioned store contains Azure Cosmos DB analytical data until the last timestamp you ran your partitioning job. When you query your analytical data using the partition key filters in Synapse Spark, Synapse Link will automatically merge the data in partitioned store with the most recent data from the analytical store. This way it gives you the latest results for your queries. Although it merges the data before querying, the delta isn’t written back to the partitioned store. As the delta between data in analytical store and partitioned store widens, the query times on partitioned data may vary. Triggering partitioning job more frequently will reduce this delta. Each time you execute the partition job, only incremental changes in the analytical store will be processed, instead of the full data set.
3937

4038
## When to use?
4139

4240
Using partitioned store is optional when querying analytical data in Azure Cosmos DB. You can directly query the same data using Synapse Link with the existing analytical store. You may want to turn on partitioned store if you have following requirements:
41+
* Common analytical query filters that could be used as partition columns
42+
* Low cardinality partition columns
43+
* Partition column distributes data equally across partitions
44+
* High volume of update or delete operations
45+
* Slow data ingestion
4346

44-
* You want to frequently query analytical data filtered on some fields.
45-
46-
* You have high volume of updates/delete operations or data is ingested slowly. Partitioned store provides better query performance in these cases, irrespective of whether you are querying using partition keys or not.
47-
48-
Except for the workloads above, if you are querying live data using query filters that are different from the partition keys, we recommend that you query this directly from the analytical store, especially if the partitioning jobs are not run frequently.
47+
Except for the workloads that meet above requirements, if you are querying live data using query filters that are different from the partition keys, we recommend that you query directly from the analytical store. This is especially true if the partitioning jobs are not scheduled to run frequently.
4948

5049
## Benefits
5150

@@ -55,9 +54,7 @@ Because the data corresponding to each unique partition key is colocated in the
5554

5655
### Flexibility to partition your analytical data
5756

58-
You can have multiple partitioning strategies for a given analytical store container where the analytical store data can be partitioned using separate partition keys. For example, the "store_sales" container can be partitioned using "sold_date" as key and can also be partitioned using "item" as key. You must have two separate partitioning jobs in this case, which will essentially partition the data into two separate partitioned stores. This partitioning strategy is beneficial if some of the queries use "sold_date" as the query filter and some other queries use "item" as the query filter.
59-
60-
The data across different partition keys will be part of the same partitioned store and you can query based on the partition key to pick the corresponding data.
57+
You can have multiple partitioning strategies for a given analytical store container. You could use composite or separate partition keys based on your query requirements. Please see partition strategies for guidance on this.
6158

6259
### Query performance improvements
6360

@@ -77,13 +74,57 @@ If you configured [managed private endpoints](analytical-store-private-endpoints
7774

7875
Similarly, if you configured [customer-managed keys on analytical store](how-to-setup-cmk.md#is-it-possible-to-use-customer-managed-keys-in-conjunction-with-the-azure-cosmos-db-analytical-store), you must directly enable it on the Synapse workspace primary storage account, which is the partitioned store, as well.
7976

77+
## Partitioning strategies
78+
You could use one or more partition keys for your analytical data. If you are using multiple partition keys, below are some recommendations on how to partition the data:
79+
- **Using composite keys:**
80+
81+
Say, you want to frequently query based on Key1 and Key2.
82+
83+
For example, "Query for all records where ReadDate = ‘2021-10-08’ and Location = ‘Sydney’".
84+
85+
In this case, using composite keys will be more efficient, to look up all records that match the ReadDate and the records that match Location within that ReadDate.
86+
87+
Sample configuration options:
88+
```python
89+
.option("spark.cosmos.asns.partition.keys", "ReadDate String, Location String") \
90+
.option("spark.cosmos.asns.basePath", "/mnt/CosmosDBPartitionedStore/") \
91+
```
92+
93+
Now, on above partitioned store, if you want to only query based on "Location" filter:
94+
* You may want to query analytical store directly. Partitioned store will scan all records by ReadDate first and then by Location.
95+
So, depending on your workload and cardinality of your analytical data, you may get better results by querying analytical store directly.
96+
* You could also run another partition job to also partition based on ‘Location’ on the same partitioned store.
97+
98+
* **Using multiple keys separately:**
99+
100+
Say, you want to frequently query sometimes based on 'ReadDate' and other times, based on 'Location'.
101+
102+
For example,
103+
- Query for all records where ReadDate = ‘2021-10-08
104+
- Query for all records where Location = ‘Sydney’
105+
106+
Run two partition jobs with partition keys as defined below for this scenario:
107+
108+
Job 1:
109+
```python
110+
.option("spark.cosmos.asns.partition.keys", "ReadDate String") \
111+
.option("spark.cosmos.asns.basePath", "/mnt/CosmosDBPartitionedStore/") \
112+
```
113+
Job 2:
114+
```python
115+
.option("spark.cosmos.asns.partition.keys", "Location String") \
116+
.option("spark.cosmos.asns.basePath", "/mnt/CosmosDBPartitionedStore/") \
117+
```
118+
Please note that it's not efficient to now frequently query based on "ReadDate" and "Location" filters together, on above partitioning. Composite keys will give
119+
better query performance in that case.
120+
80121
## Limitations
81122

82123
* Custom partitioning is only available for Azure Synapse Spark. Custom partitioning is currently not supported for serverless SQL pools.
83124

84-
* Currently partitioned store can only point to the primary storage account associated with the Synapse workspace. We do not support selecting custom storage accounts at this point.
125+
* Currently partitioned store can only point to the primary storage account associated with the Synapse workspace. Selecting custom storage accounts is not supported at this point.
85126

86-
* Although the API for MongoDB supports analytical store and Synapse Link, it currently doesn't support custom partitioning.
127+
* Custom partitioning is only available for SQL API in Cosmos DB. API for Mongo DB, Gremlin and Cassandra are not supported at this time.
87128

88129
## Pricing
89130

@@ -116,13 +157,12 @@ Yes, the partition key for the given container can be changed and the new partit
116157

117158
### Can different partition keys point to the same BasePath?
118159

119-
Yes, since the partition key definition is part of the partitioned store path, different partition keys will have different paths branching from the same BasePath.
120-
121-
Base path format could be specified as: /mnt/partitionedstorename/\<Cosmos_DB_account_name\>/\<Cosmos_DB_database_rid\>/\<Cosmos_DB_container_rid\>/partition=partitionkey/
160+
Yes, you can specify multiple partition keys on the same partitioned store as below:
122161

123-
For example:
124-
/mnt/CosmosDBPartitionedStore/store_sales/…/partition=sold_date/...
125-
/mnt/CosmosDBPartitionedStore/store_sales/…/partition=Date/...
162+
```python
163+
.option("spark.cosmos.asns.partition.keys", "ReadDate String, Location String") \
164+
.option("spark.cosmos.asns.basePath", "/mnt/CosmosDBPartitionedStore/") \
165+
```
126166

127167
## Next steps
128168

0 commit comments

Comments
 (0)