Skip to content

Commit 51af1b8

Browse files
authored
Merge pull request #186524 from Rodrigossz/master
Update modeling-data.md
2 parents 8d267a4 + ab010cf commit 51af1b8

File tree

1 file changed

+95
-17
lines changed

1 file changed

+95
-17
lines changed

articles/cosmos-db/sql/modeling-data.md

Lines changed: 95 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ While schema-free databases, like Azure Cosmos DB, make it super easy to store a
1717

1818
How is data going to be stored? How is your application going to retrieve and query data? Is your application read-heavy, or write-heavy?
1919

20-
After reading this article, you will be able to answer the following questions:
20+
After reading this article, you'll be able to answer the following questions:
2121

2222
* What is data modeling and why should I care?
2323
* How is modeling data in Azure Cosmos DB different to a relational database?
@@ -32,7 +32,7 @@ For comparison, let's first see how we might model data in a relational database
3232

3333
:::image type="content" source="./media/sql-api-modeling-data/relational-data-model.png" alt-text="Relational database model" border="false":::
3434

35-
When working with relational databases, the strategy is to normalize all your data. Normalizing your data typically involves taking an entity, such as a person, and breaking it down into discrete components. In the example above, a person can have multiple contact detail records, as well as multiple address records. Contact details can be further broken down by further extracting common fields like a type. The same applies to address, each record can be of type *Home* or *Business*.
35+
When working with relational databases, the strategy is to normalize all your data. Normalizing your data typically involves taking an entity, such as a person, and breaking it down into discrete components. In the example above, a person may have multiple contact detail records, as well as multiple address records. Contact details can be further broken down by further extracting common fields like a type. The same applies to address, each record can be of type *Home* or *Business*.
3636

3737
The guiding premise when normalizing data is to **avoid storing redundant data** on each record and rather refer to data. In this example, to read a person, with all their contact details and addresses, you need to use JOINS to effectively compose back (or denormalize) your data at run time.
3838

@@ -69,7 +69,7 @@ Now let's take a look at how we would model the same data as a self-contained en
6969
}
7070
```
7171

72-
Using the approach above we have **denormalized** the person record, by **embedding** all the information related to this person, such as their contact details and addresses, into a *single JSON* document.
72+
Using the approach above we've **denormalized** the person record, by **embedding** all the information related to this person, such as their contact details and addresses, into a *single JSON* document.
7373
In addition, because we're not confined to a fixed schema we have the flexibility to do things like having contact details of different shapes entirely.
7474

7575
Retrieving a complete person record from the database is now a **single read operation** against a single container and for a single item. Updating a person record, with their contact details and addresses, is also a **single write operation** against a single item.
@@ -80,11 +80,11 @@ By denormalizing data, your application may need to issue fewer queries and upda
8080

8181
In general, use embedded data models when:
8282

83-
* There are **contained** relationships between entities.
84-
* There are **one-to-few** relationships between entities.
85-
* There is embedded data that **changes infrequently**.
86-
* There is embedded data that will not grow **without bound**.
87-
* There is embedded data that is **queried frequently together**.
83+
* There're **contained** relationships between entities.
84+
* There're **one-to-few** relationships between entities.
85+
* There's embedded data that **changes infrequently**.
86+
* There's embedded data that will not grow **without bound**.
87+
* There's embedded data that is **queried frequently together**.
8888

8989
> [!NOTE]
9090
> Typically denormalized data models provide better **read** performance.
@@ -113,7 +113,7 @@ Take this JSON snippet.
113113
}
114114
```
115115

116-
This might be what a post entity with embedded comments would look like if we were modeling a typical blog, or CMS, system. The problem with this example is that the comments array is **unbounded**, meaning that there is no (practical) limit to the number of comments any single post can have. This may become a problem as the size of the item could grow infinitely large.
116+
This might be what a post entity with embedded comments would look like if we were modeling a typical blog, or CMS, system. The problem with this example is that the comments array is **unbounded**, meaning that there's no (practical) limit to the number of comments any single post can have. This may become a problem as the size of the item could grow infinitely large.
117117

118118
As the size of the item grows the ability to transmit the data over the wire as well as reading and updating the item, at scale, will be impacted.
119119

@@ -154,7 +154,7 @@ Comment items:
154154

155155
This model has the three most recent comments embedded in the post container, which is an array with a fixed set of attributes. The other comments are grouped in to batches of 100 comments and stored as separate items. The size of the batch was chosen as 100 because our fictitious application allows the user to load 100 comments at a time.
156156

157-
Another case where embedding data is not a good idea is when the embedded data is used often across items and will change frequently.
157+
Another case where embedding data isn't a good idea is when the embedded data is used often across items and will change frequently.
158158

159159
Take this JSON snippet.
160160

@@ -182,7 +182,7 @@ Stock *zaza* may be traded many hundreds of times in a single day and thousands
182182

183183
## Referencing data
184184

185-
Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it is worth. So what do we do now?
185+
Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it's worth. So what do we do now?
186186

187187
Relational databases are not the only place where you can create relationships between entities. In a document database, you can have information in one document that relates to data in other documents. We do not recommend building systems that would be better suited to a relational database in Azure Cosmos DB, or any other document database, but simple relationships are fine and can be useful.
188188

@@ -230,7 +230,7 @@ An immediate downside to this approach though is if your application is required
230230
231231
### What about foreign keys?
232232

233-
Because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively "weak links" and will not be verified by the database itself. If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.
233+
Because there's currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively "weak links" and will not be verified by the database itself. If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.
234234

235235
### When to reference
236236

@@ -321,8 +321,8 @@ Joining documents:
321321

322322
This would work. However, loading either an author with their books, or loading a book with its author, would always require at least two additional queries against the database. One query to the joining document and then another query to fetch the actual document being joined.
323323

324-
If all this join table is doing is gluing together two pieces of data, then why not drop it completely?
325-
Consider the following.
324+
If all this join is doing is gluing together two pieces of data, then why not drop it completely?
325+
Consider the following example.
326326

327327
```json
328328
Author documents:
@@ -393,7 +393,7 @@ Book documents:
393393

394394
Here we've (mostly) followed the embedded model, where data from other entities are embedded in the top-level document, but other data is referenced.
395395

396-
If you look at the book document, we can see a few interesting fields when we look at the array of authors. There is an `id` field that is the field we use to refer back to an author document, standard practice in a normalized model, but then we also have `name` and `thumbnailUrl`. We could have stuck with `id` and left the application to get any additional information it needed from the respective author document using the "link", but because our application displays the author's name and a thumbnail picture with every book displayed we can save a round trip to the server per book in a list by denormalizing **some** data from the author.
396+
If you look at the book document, we can see a few interesting fields when we look at the array of authors. There's an `id` field that is the field we use to refer back to an author document, standard practice in a normalized model, but then we also have `name` and `thumbnailUrl`. We could have stuck with `id` and left the application to get any additional information it needed from the respective author document using the "link", but because our application displays the author's name and a thumbnail picture with every book displayed we can save a round trip to the server per book in a list by denormalizing **some** data from the author.
397397

398398
Sure, if the author's name changed or they wanted to update their photo we'd have to go and update every book they ever published but for our application, based on the assumption that authors don't change their names often, this is an acceptable design decision.
399399

@@ -429,11 +429,89 @@ Review documents:
429429
}
430430
```
431431

432+
## Data modeling for Azure Synapse Link and Azure Cosmos DB analytical store
433+
434+
[Azure Synapse Link for Azure Cosmos DB](../synapse-link.md) is a cloud-native hybrid transactional and analytical processing (HTAP) capability that enables you to run near real-time analytics over operational data in Azure Cosmos DB. Azure Synapse Link creates a tight seamless integration between Azure Cosmos DB and Azure Synapse Analytics.
435+
436+
This integration happens through [Azure Cosmos DB analytical store](../analytical-store-introduction.md), a columnar representation of your transactional data that enables large-scale analytics without any impact to your transactional workloads. This analytical store is suitable for fast, cost effective queries on large operational data sets, without copying data and impacting the performance of your transactional workloads. When you create a container with analytical store enabled, or when you enable analytical store on an existing container, all transactional inserts, updates, and deletes are synchronized with analytical store in near real time, no change feed or ETL jobs are required.
437+
438+
With Synapse Link, you can now directly connect to your Azure Cosmos DB containers from Azure Synapse Analytics and access the analytical store, at no Request Units (RUs) costs. Azure Synapse Analytics currently supports Synapse Link with Synapse Apache Spark and serverless SQL pool. If you have a globally distributed Azure Cosmos DB account, after you enable analytical store for a container, it will be available in all regions for that account.
439+
440+
### Analytical store automatic schema inference
441+
442+
While Azure Cosmos DB transactional store is considered row-oriented semi-structured data, analytical store has columnar and structured format. This conversion is automatically made for customers, using the schema inference rules described [here](../analytical-store-introduction.md).
443+
444+
You can minimize the impact of the schema inference conversions, and maximize your analytical capabilities, by using following techniques.
445+
446+
#### Normalization
447+
448+
Normalization becomes meaningless since with Azure Synapse Link you can join between your containers, using T-SQL or Spark SQL. The expected benefits of normalization are:
449+
* Smaller data footprint in both transactional and analytical store.
450+
* Smaller transactions.
451+
* Fewer properties per document.
452+
* Data structures with fewer nested levels.
453+
454+
Please note that these last two factors, fewer properties and fewer levels, help in the performance of your analytical queries but also decrease the chances of parts of your data not being represented in the analytical store. As described in the article on automatic schema inference rules, there are limits to the number of levels and properties that are represented in analytical store.
455+
456+
Another important factor for normalization is that SQL serverless pools in Azure Synapse support result sets with up to 1000 columns, and exposing nested columns also counts towards that limit. In other words, both analytical store and Synapse SQL serverless pools have a limit of 1000 properties.
457+
458+
But what to do since denormalization is an important data modeling technique for Azure Cosmos DB? The answer is that you must find the right balance for your transactional and analytical workloads.
459+
460+
#### Partition Key
461+
462+
Your Azure Cosmos DB partition key (PK) isn't used in analytical store. And now you can use [analytical store custom partitioning](https://devblogs.microsoft.com/cosmosdb/custom-partitioning-azure-synapse-link/) to copies of analytical store using any PK that you want. Because of this isolation, you can choose a PK for your transactional data with focus on data ingestion and point reads, while cross-partition queries can be done with Azure Synapse Link. Let's see an example:
463+
464+
In a hypothetical global IoT scenario, *device id* is a good PK since all devices have a similar data volume and with that you won't have a hot partition problem. But if you want to analyze the data of more than one device, like "all data from yesterday" or "totals per city", you may have problems since those are cross-partition queries. Those queries can hurt your transactional performance since they use part of your throughput in RUs to run. But with Azure Synapse Link, you can run these analytical queries at no RUs costs. Analytical store columnar format is optimized for analytical queries and Azure Synapse Link leverages this characteristic to allow great performance with Azure Synapse Analytics runtimes.
465+
466+
#### Data types and properties names
467+
468+
The automatic schema inference rules article lists what are the supported data types. While unsupported data type blocks the representation in analytical store, supported datatypes may be processed differently by the Azure Synapse runtimes. One example is: When using DateTime strings that follow the ISO 8601 UTC standard, Spark pools in Azure Synapse will represent these columns as string and SQL serverless pools in Azure Synapse will represent these columns as varchar(8000).
469+
470+
Another challenge is that not all characters are accepted by Azure Synapse Spark. While white spaces are accepted, characters like colon, grave accent, and comma are not. Let's say that your document has a property named **"First Name, Last Name". This property will be represented in analytical store and Synapse SQL serverless pool can read it without a problem. But since it is in analytical store, Azure Synapse Spark can't read any data from analytical store, including all other properties. At the end of the day, you can't use Azure Synapse Spark when you have one property using the unsupported characters in their names.
471+
472+
#### Columns and nested data
473+
474+
All properties in the root level of your Cosmos DB data will be represented in analytical store as a column and everything else that is in deeper levels of your document data model will be represented as JSON, also in nested structures. Nested structures demand extra processing from Azure Synapse runtimes to flatten the data in structured format, what may be a challenge in big data scenarios.
475+
476+
477+
The document below will have only 2 columns in analytical store, `id` and `contactDetails`. All other data, `email` and `phone` will require extra processing through SQL functions to be individually read.
478+
479+
```json
480+
481+
{
482+
"id": "1",
483+
"contactDetails": [
484+
{"email": [email protected]},
485+
{"phone": "+1 555 555-5555", "extension": 5555}
486+
]
487+
}
488+
```
489+
490+
The document below will have 3 columns in analytical store, `id`, `email`, and `phone`. All data is directly accessible as columns.
491+
492+
```json
493+
494+
{
495+
"id": "1",
496+
497+
"phone": "+1 555 555-5555", "extension": 5555
498+
}
499+
```
500+
501+
#### RUs Optimization
502+
503+
Azure Synapse Link allows you to reduce costs from the following perspectives:
504+
505+
* Fewer queries running in your transactional database.
506+
* A PK optimized for data ingestion and point reads, reducing data footprint, hot partition scenarios, and partitions splits.
507+
* Data tiering since analytical ttl (attl) is independent from transactional ttl (tttl). You can keep your transactional data in transactional store for a few days, weeks, months, and keep the data in analytical store for years or for ever. Analytical store columnar format brings a natural data compression, from 50% up to 90%. And its cost per GB is ~10% of transactional store actual price. Please check the [analytical store overview](../analytical-store-introduction.md) to read about the current backup limitations.
508+
* No ETL jobs running in your environment, meaning that you don't need to provision RUs for them.
509+
432510
## Next steps
433511

434512
The biggest takeaways from this article are to understand that data modeling in a schema-free world is as important as ever.
435513

436-
Just as there is no single way to represent a piece of data on a screen, there is no single way to model your data. You need to understand your application and how it will produce, consume, and process the data. Then, by applying some of the guidelines presented here you can set about creating a model that addresses the immediate needs of your application. When your applications need to change, you can leverage the flexibility of a schema-free database to embrace that change and evolve your data model easily.
514+
Just as there's no single way to represent a piece of data on a screen, there's no single way to model your data. You need to understand your application and how it will produce, consume, and process the data. Then, by applying some of the guidelines presented here you can set about creating a model that addresses the immediate needs of your application. When your applications need to change, you can leverage the flexibility of a schema-free database to embrace that change and evolve your data model easily.
437515

438516
* To learn more about Azure Cosmos DB, refer to the service's [documentation](https://azure.microsoft.com/documentation/services/cosmos-db/) page.
439517

@@ -445,5 +523,5 @@ Data Modeling and Partitioning - a Real-World Example](how-to-model-partition-ex
445523
* See the learn module on how to [Model and partition your data in Azure Cosmos DB.](/learn/modules/model-partition-data-azure-cosmos-db/)
446524

447525
* Trying to do capacity planning for a migration to Azure Cosmos DB? You can use information about your existing database cluster for capacity planning.
448-
* If all you know is the number of vcores and servers in your existing database cluster, read about [estimating request units using vCores or vCPUs](../convert-vcore-to-request-unit.md)
526+
* If all you know is the number of vCores and servers in your existing database cluster, read about [estimating request units using vCores or vCPUs](../convert-vcore-to-request-unit.md)
449527
* If you know typical request rates for your current database workload, read about [estimating request units using Azure Cosmos DB capacity planner](estimate-ru-with-capacity-planner.md)

0 commit comments

Comments
 (0)