You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cosmos-db/analytical-store-introduction.md
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ Azure Cosmos DB transactional store is schema-agnostic, and it allows you to ite
21
21
22
22
The multi-model operational data in an Azure Cosmos DB container is internally stored in an indexed row-based "transactional store". Row store format is designed to allow fast transactional reads and writes in the order-of-milliseconds response times, and operational queries. If your dataset grows large, complex analytical queries can be expensive in terms of provisioned throughput on the data stored in this format. High consumption of provisioned throughput in turn, impacts the performance of transactional workloads that are used by your real-time applications and services.
23
23
24
-
Traditionally, to analyze large amounts of data, operational data is extracted from Azure Cosmos DB's transactional store and stored in a separate data layer. For example, the data is stored in a data warehouse or data lake in a suitable format. This data is later used for large-scale analytics and analyzed using compute engine such as the Apache Spark clusters. This separation of analytical storage and compute layers from operational data results in additional latency, because the ETL(Extract, Transform, Load) pipelines are run less frequently to minimize the potential impact on your transactional workloads.
24
+
Traditionally, to analyze large amounts of data, operational data is extracted from Azure Cosmos DB's transactional store and stored in a separate data layer. For example, the data is stored in a data warehouse or data lake in a suitable format. This data is later used for large-scale analytics and analyzed using compute engines such as the Apache Spark clusters. The separation of analytical from operational data results in delays for analysts that want to use the most recent data.
25
25
26
26
The ETL pipelines also become complex when handling updates to the operational data when compared to handling only newly ingested operational data.
27
27
@@ -57,7 +57,7 @@ There's no impact on the performance of your transactional workloads due to anal
57
57
58
58
## Auto-Sync
59
59
60
-
Auto-Sync refers to the fully managed capability of Azure Cosmos DB where the inserts, updates, deletes to operational data are automatically synced from transactional store to analytical store in near real time. Auto-sync latency is usually within 2 minutes. In cases of shared throughput database with a large number of containers, auto-sync latency of individual containers could be higher and take up to 5 minutes. We would like to learn more how this latency fits your scenarios. For that, please reach out to the [Azure Cosmos DB Team](mailto:[email protected]).
60
+
Auto-Sync refers to the fully managed capability of Azure Cosmos DB where the inserts, updates, deletes to operational data are automatically synced from transactional store to analytical store in near real time. Auto-sync latency is usually within 2 minutes. In cases of shared throughput database with a large number of containers, auto-sync latency of individual containers could be higher and take up to 5 minutes.
61
61
62
62
At the end of each execution of the automatic sync process, your transactional data will be immediately available for Azure Synapse Analytics runtimes:
63
63
@@ -100,9 +100,9 @@ The following constraints are applicable on the operational data in Azure Cosmos
100
100
101
101
102
102
* Sample scenarios:
103
-
* If your document's first level has 2000 properties, only the first 1000 will be represented.
104
-
* If your documents have five levels with 200 properties in each one, all properties will be represented.
105
-
* If your documents have 10 levels with 400 properties in each one, only the two first levels will be fully represented in analytical store. Half of the third level will also be represented.
103
+
* If your document's first level has 2000 properties, the sync process will represent the first 1000 of them.
104
+
* If your documents have five levels with 200 properties in each one, the sync process will represent all properties.
105
+
* If your documents have 10 levels with 400 properties in each one, the sync process will fully represent the two first levels and only half of the third level.
106
106
107
107
* The hypothetical document below contains four properties and three levels.
108
108
* The levels are `root`, `myArray`, and the nested structure within the `myArray`.
@@ -207,18 +207,18 @@ df = spark.read\
207
207
* MinKey/MaxKey
208
208
209
209
* When using DateTime strings that follow the ISO 8601 UTC standard, expect the following behavior:
210
-
* Spark pools in Azure Synapse will represent these columns as `string`.
211
-
* SQL serverless pools in Azure Synapse will represent these columns as `varchar(8000)`.
210
+
* Spark pools in Azure Synapse represents these columns as `string`.
211
+
* SQL serverless pools in Azure Synapse represents these columns as `varchar(8000)`.
212
212
213
213
* Properties with `UNIQUEIDENTIFIER (guid)` types are represented as `string` in analytical store and should be converted to `VARCHAR` in **SQL** or to `string` in **Spark** for correct visualization.
214
214
215
-
* SQL serverless pools in Azure Synapse support result sets with up to 1000 columns, and exposing nested columns also counts towards that limit. Please consider this information when designing your data architecture and modeling your transactional data.
215
+
* SQL serverless pools in Azure Synapse support result sets with up to 1000 columns, and exposing nested columns also counts towards that limit. It is a good practice to consider this information in your transactional data architecture and modeling.
216
216
217
217
* If you rename a property, in one or many documents, it will be considered a new column. If you execute the same rename in all documents in the collection, all data will be migrated to the new column and the old column will be represented with `NULL` values.
218
218
219
219
### Schema representation
220
220
221
-
There are two types of schema representation in the analytical store. These types define the schema representation method for all containers in the database account and have tradeoffs between the simplicity of query experience versus the convenience of a more inclusive columnar representation for polymorphic schemas.
221
+
There are two methods of schema representation in the analytical store, valid for all containers in the database account. They have tradeoffs between the simplicity of query experience versus the convenience of a more inclusive columnar representation for polymorphic schemas.
222
222
223
223
* Well-defined schema representation, default option for API for NoSQL and Gremlin accounts.
224
224
* Full fidelity schema representation, default option for API for MongoDB accounts.
@@ -260,16 +260,16 @@ WITH (num varchar(100)) AS [IntToFloat]
260
260
> If the Azure Cosmos DB analytical store follows the well-defined schema representation and the specification above is violated by certain items, those items won't be included in the analytical store.
261
261
262
262
* Expect different behavior in regard to different types in well-defined schema:
263
-
* Spark pools in Azure Synapse will represent these values as `undefined`.
264
-
* SQL serverless pools in Azure Synapse will represent these values as `NULL`.
263
+
* Spark pools in Azure Synapse represents these values as `undefined`.
264
+
* SQL serverless pools in Azure Synapse represents these values as `NULL`.
265
265
266
266
* Expect different behavior in regard to explicit `NULL` values:
267
-
* Spark pools in Azure Synapse will read these values as `0` (zero). And it will change to`undefined` as soon as the column has a non-null value.
268
-
* SQL serverless pools in Azure Synapse will read these values as `NULL`.
267
+
* Spark pools in Azure Synapse reads these values as `0` (zero), and as`undefined` as soon as the column has a non-null value.
268
+
* SQL serverless pools in Azure Synapse reads these values as `NULL`.
269
269
270
270
* Expect different behavior in regard to missing columns:
271
-
* Spark pools in Azure Synapse will represent these columns as `undefined`.
272
-
* SQL serverless pools in Azure Synapse will represent these columns as `NULL`.
271
+
* Spark pools in Azure Synapse represents these columns as `undefined`.
272
+
* SQL serverless pools in Azure Synapse represents these columns as `NULL`.
0 commit comments