You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cosmos-db/analytical-store-introduction.md
+119-9Lines changed: 119 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -276,9 +276,9 @@ WITH (num varchar(100)) AS [IntToFloat]
276
276
277
277
The full fidelity schema representation is designed to handle the full breadth of polymorphic schemas in the schema-agnostic operational data. In this schema representation, no items are dropped from the analytical store even if the well-defined schema constraints (that is no mixed data type fields nor mixed data type arrays) are violated.
278
278
279
-
This is achieved by translating the leaf properties of the operational data into the analytical store with distinct columns based on the data type of values in the property. The leaf property names are extended with data types as a suffix in the analytical store schema such that they can be queries without ambiguity.
279
+
This is achieved by translating the leaf properties of the operational data into the analytical store as JSON `key-value` pairs, where the datatype is the `key` and the property content is the `value`. This JSON object representation allows queries without ambiguity, and you can individually analyze each datatype.
280
280
281
-
In the full fidelity schema representation, each datatype of each property will generate a column for that datatype. Each of them count as one of the 1000 maximum properties.
281
+
In other words, in the full fidelity schema representation, each datatype of each property of each document will generate a `key-value`pair in a JSON object for that property. Each of them count as one of the 1000 maximum properties limit.
282
282
283
283
For example, let's take the following sample document in the transactional store:
284
284
@@ -296,11 +296,14 @@ salary: 1000000
296
296
}
297
297
```
298
298
299
-
The leaf property `streetNo` within the nested object `address`will be represented in the analytical store schema as a column `address.object.streetNo.int32`. The datatype is added as a suffix to the column. This way, if another document is added to the transactional store where the value of leaf property `streetNo` is "123" (note it's a string), the schema of the analytical store automatically evolves without altering the type of a previously written column. A new column added to the analytical store as `address.object.streetNo.string` where this value of "123" is stored.
299
+
The nested object `address`is a property in the root level of the document and will be represented as a column. Each leaf property in the `address` object will be represented as a JSON object: `{"object":{"streetNo":{"int32":15850},"streetName":{"string":"NE 40th St."},"zip":{"int32":98052}}}`.
300
300
301
-
##### Data type to suffix map for full fidelity schema
301
+
Unlike the well-defined schema representation, the full fidelity method allows variation in datatypes. If the next document in this collection of the example above has `streetNo` as a string, it will be represented in analytical store as `"streetNo":{"string":15850}`. In well-defined schema method, it wouldn't be represented.
302
302
303
-
Here's a map of all the property data types and their suffix representations in the analytical store in full fidelity schema representation:
303
+
304
+
##### Datatypes map for full fidelity schema
305
+
306
+
Here's a map of all the property data types and their representations in the analytical store in full fidelity schema representation:
304
307
305
308
|Original data type |Suffix |Example |
306
309
|---------|---------|---------|
@@ -324,11 +327,118 @@ Here's a map of all the property data types and their suffix representations in
324
327
* Spark pools in Azure Synapse will represent these columns as `undefined`.
325
328
* SQL serverless pools in Azure Synapse will represent these columns as `NULL`.
326
329
330
+
##### Using full fidelity schema on Spark
331
+
332
+
Spark will manage each datatype as a column when loading into a `DataFrame`. Let's assume a collection with the documents below.
333
+
334
+
```json
335
+
{
336
+
"_id" : "1" ,
337
+
"item" : "Pizza",
338
+
"price" : 3.49,
339
+
"rating" : 3,
340
+
"timestamp" : 1604021952.6790195
341
+
},
342
+
{
343
+
"_id" : "2" ,
344
+
"item" : "Ice Cream",
345
+
"price" : 1.59,
346
+
"rating" : "4" ,
347
+
"timestamp" : "2022-11-11 10:00 AM"
348
+
}
349
+
```
350
+
351
+
While the first document has `rating` as a number and `timestamp` in utc format, the second document has `rating` and `timestamp` as strings. Assuming that this collection was loaded into `DataFrame` without any data transformation, the output of the `df.printSchema()` is:
352
+
353
+
```JSON
354
+
root
355
+
|-- _rid: string (nullable = true)
356
+
|-- _ts: long (nullable = true)
357
+
|-- id: string (nullable = true)
358
+
|-- _etag: string (nullable = true)
359
+
|-- _id: struct (nullable = true)
360
+
| |-- objectId: string (nullable = true)
361
+
|-- item: struct (nullable = true)
362
+
| |-- string: string (nullable = true)
363
+
|-- price: struct (nullable = true)
364
+
| |-- float64: double (nullable = true)
365
+
|-- rating: struct (nullable = true)
366
+
| |-- int32: integer (nullable = true)
367
+
| |-- string: string (nullable = true)
368
+
|-- timestamp: struct (nullable = true)
369
+
| |-- float64: double (nullable = true)
370
+
| |-- string: string (nullable = true)
371
+
|-- _partitionKey: struct (nullable = true)
372
+
| |-- string: string (nullable = true)
373
+
```
374
+
375
+
In well-defined schema representation, both `rating` and `timestamp` of the second document wouldn't be represented. In full fidelity schema, you can use the following examples to individually access to each value of each datatype.
376
+
377
+
In the example below, we can use `PySpark` to run an aggregation:
378
+
379
+
```PySpark
380
+
df.groupBy(df.item.string).sum().show()
381
+
```
382
+
383
+
In the example below, we can use `PySQL` to run another aggregation:
384
+
385
+
```PySQL
386
+
df.createOrReplaceTempView("Pizza")
387
+
sql_results = spark.sql("SELECT sum(price.float64),count(*) FROM Pizza where timestamp.string is not null and item.string = 'Pizza'")
388
+
sql_results.show()
389
+
```
390
+
391
+
##### Using full fidelity schema on SQL
392
+
393
+
Considering the same documents of the Spark example above, customers can use the following syntax example:
WHERE timestamp is not null or timestamp_utc is not null
407
+
```
408
+
409
+
Starting from the query above, customers can implement transformations using `cast`, `convert` or any other T-SQL function to manipulate your data. Customers can also hide complex datatype structures by using views.
the MongoDB `_id` field is fundamental to every collection in MongoDB and originally has a hexadecimal representation. As you can see in the table above, `Full Fidelity Schema` will preserve its characteristics, creating a challenge for its visualization in Azure Synapse Analytics. For correct visualization, you must convert the `_id` datatype as below:
439
+
the MongoDB `_id` field is fundamental to every collection in MongoDB and originally has a hexadecimal representation. As you can see in the table above, full fidelity schema will preserve its characteristics, creating a challenge for its visualization in Azure Synapse Analytics. For correct visualization, you must convert the `_id` datatype as below:
330
440
331
-
###### Spark
441
+
###### Working with the MongoDB `_id` field in Spark
0 commit comments