Skip to content

Commit 78a5d95

Browse files
Merge pull request #217990 from Rodrigossz/main
FFS update
2 parents a4fd2ef + 638d306 commit 78a5d95

File tree

1 file changed

+119
-9
lines changed

1 file changed

+119
-9
lines changed

articles/cosmos-db/analytical-store-introduction.md

Lines changed: 119 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -276,9 +276,9 @@ WITH (num varchar(100)) AS [IntToFloat]
276276

277277
The full fidelity schema representation is designed to handle the full breadth of polymorphic schemas in the schema-agnostic operational data. In this schema representation, no items are dropped from the analytical store even if the well-defined schema constraints (that is no mixed data type fields nor mixed data type arrays) are violated.
278278

279-
This is achieved by translating the leaf properties of the operational data into the analytical store with distinct columns based on the data type of values in the property. The leaf property names are extended with data types as a suffix in the analytical store schema such that they can be queries without ambiguity.
279+
This is achieved by translating the leaf properties of the operational data into the analytical store as JSON `key-value` pairs, where the datatype is the `key` and the property content is the `value`. This JSON object representation allows queries without ambiguity, and you can individually analyze each datatype.
280280

281-
In the full fidelity schema representation, each datatype of each property will generate a column for that datatype. Each of them count as one of the 1000 maximum properties.
281+
In other words, in the full fidelity schema representation, each datatype of each property of each document will generate a `key-value`pair in a JSON object for that property. Each of them count as one of the 1000 maximum properties limit.
282282

283283
For example, let's take the following sample document in the transactional store:
284284

@@ -296,11 +296,14 @@ salary: 1000000
296296
}
297297
```
298298

299-
The leaf property `streetNo` within the nested object `address` will be represented in the analytical store schema as a column `address.object.streetNo.int32`. The datatype is added as a suffix to the column. This way, if another document is added to the transactional store where the value of leaf property `streetNo` is "123" (note it's a string), the schema of the analytical store automatically evolves without altering the type of a previously written column. A new column added to the analytical store as `address.object.streetNo.string` where this value of "123" is stored.
299+
The nested object `address` is a property in the root level of the document and will be represented as a column. Each leaf property in the `address` object will be represented as a JSON object: `{"object":{"streetNo":{"int32":15850},"streetName":{"string":"NE 40th St."},"zip":{"int32":98052}}}`.
300300

301-
##### Data type to suffix map for full fidelity schema
301+
Unlike the well-defined schema representation, the full fidelity method allows variation in datatypes. If the next document in this collection of the example above has `streetNo` as a string, it will be represented in analytical store as `"streetNo":{"string":15850}`. In well-defined schema method, it wouldn't be represented.
302302

303-
Here's a map of all the property data types and their suffix representations in the analytical store in full fidelity schema representation:
303+
304+
##### Datatypes map for full fidelity schema
305+
306+
Here's a map of all the property data types and their representations in the analytical store in full fidelity schema representation:
304307

305308
|Original data type |Suffix |Example |
306309
|---------|---------|---------|
@@ -324,11 +327,118 @@ Here's a map of all the property data types and their suffix representations in
324327
* Spark pools in Azure Synapse will represent these columns as `undefined`.
325328
* SQL serverless pools in Azure Synapse will represent these columns as `NULL`.
326329

330+
##### Using full fidelity schema on Spark
331+
332+
Spark will manage each datatype as a column when loading into a `DataFrame`. Let's assume a collection with the documents below.
333+
334+
```json
335+
{
336+
"_id" : "1" ,
337+
"item" : "Pizza",
338+
"price" : 3.49,
339+
"rating" : 3,
340+
"timestamp" : 1604021952.6790195
341+
},
342+
{
343+
"_id" : "2" ,
344+
"item" : "Ice Cream",
345+
"price" : 1.59,
346+
"rating" : "4" ,
347+
"timestamp" : "2022-11-11 10:00 AM"
348+
}
349+
```
350+
351+
While the first document has `rating` as a number and `timestamp` in utc format, the second document has `rating` and `timestamp` as strings. Assuming that this collection was loaded into `DataFrame` without any data transformation, the output of the `df.printSchema()` is:
352+
353+
```JSON
354+
root
355+
|-- _rid: string (nullable = true)
356+
|-- _ts: long (nullable = true)
357+
|-- id: string (nullable = true)
358+
|-- _etag: string (nullable = true)
359+
|-- _id: struct (nullable = true)
360+
| |-- objectId: string (nullable = true)
361+
|-- item: struct (nullable = true)
362+
| |-- string: string (nullable = true)
363+
|-- price: struct (nullable = true)
364+
| |-- float64: double (nullable = true)
365+
|-- rating: struct (nullable = true)
366+
| |-- int32: integer (nullable = true)
367+
| |-- string: string (nullable = true)
368+
|-- timestamp: struct (nullable = true)
369+
| |-- float64: double (nullable = true)
370+
| |-- string: string (nullable = true)
371+
|-- _partitionKey: struct (nullable = true)
372+
| |-- string: string (nullable = true)
373+
```
374+
375+
In well-defined schema representation, both `rating` and `timestamp` of the second document wouldn't be represented. In full fidelity schema, you can use the following examples to individually access to each value of each datatype.
376+
377+
In the example below, we can use `PySpark` to run an aggregation:
378+
379+
```PySpark
380+
df.groupBy(df.item.string).sum().show()
381+
```
382+
383+
In the example below, we can use `PySQL` to run another aggregation:
384+
385+
```PySQL
386+
df.createOrReplaceTempView("Pizza")
387+
sql_results = spark.sql("SELECT sum(price.float64),count(*) FROM Pizza where timestamp.string is not null and item.string = 'Pizza'")
388+
sql_results.show()
389+
```
390+
391+
##### Using full fidelity schema on SQL
392+
393+
Considering the same documents of the Spark example above, customers can use the following syntax example:
394+
395+
```SQL
396+
SELECT rating,timestamp_string,timestamp_utc
397+
FROM OPENROWSET(PROVIDER = 'CosmosDB',
398+
CONNECTION = 'Account=<your-database-account-name';Database=<your-database-name>',
399+
OBJECT = '<your-collection-name>',
400+
SERVER_CREDENTIAL = '<your-synapse-sql-server-credential-name>')
401+
WITH (
402+
rating integer '$.rating.int32',
403+
timestamp varchar(50) '$.timestamp.string',
404+
timestamp_utc float '$.timestamp.float64'
405+
) as HTAP
406+
WHERE timestamp is not null or timestamp_utc is not null
407+
```
408+
409+
Starting from the query above, customers can implement transformations using `cast`, `convert` or any other T-SQL function to manipulate your data. Customers can also hide complex datatype structures by using views.
410+
411+
```SQL
412+
create view MyView as
413+
SELECT MyRating=rating,MyTimestamp = convert(varchar(50),timestamp_utc)
414+
FROM OPENROWSET(PROVIDER = 'CosmosDB',
415+
CONNECTION = 'Account=<your-database-account-name';Database=<your-database-name>',
416+
OBJECT = '<your-collection-name>',
417+
SERVER_CREDENTIAL = '<your-synapse-sql-server-credential-name>')
418+
WITH (
419+
rating integer '$.rating.int32',
420+
timestamp_utc float '$.timestamp.float64'
421+
) as HTAP
422+
WHERE timestamp_utc is not null
423+
union all
424+
SELECT MyRating=convert(integer,rating_string),MyTimestamp = timestamp_string
425+
FROM OPENROWSET(PROVIDER = 'CosmosDB',
426+
CONNECTION = 'Account=<your-database-account-name';Database=<your-database-name>',
427+
OBJECT = '<your-collection-name>',
428+
SERVER_CREDENTIAL = '<your-synapse-sql-server-credential-name>')
429+
WITH (
430+
rating_string varchar(50) '$.rating.string',
431+
timestamp_string varchar(50) '$.timestamp.string'
432+
) as HTAP
433+
WHERE timestamp_string is not null
434+
```
435+
436+
327437
##### Working with the MongoDB `_id` field
328438
329-
the MongoDB `_id` field is fundamental to every collection in MongoDB and originally has a hexadecimal representation. As you can see in the table above, `Full Fidelity Schema` will preserve its characteristics, creating a challenge for its visualization in Azure Synapse Analytics. For correct visualization, you must convert the `_id` datatype as below:
439+
the MongoDB `_id` field is fundamental to every collection in MongoDB and originally has a hexadecimal representation. As you can see in the table above, full fidelity schema will preserve its characteristics, creating a challenge for its visualization in Azure Synapse Analytics. For correct visualization, you must convert the `_id` datatype as below:
330440
331-
###### Spark
441+
###### Working with the MongoDB `_id` field in Spark
332442
333443
```Python
334444
import org.apache.spark.sql.types._
@@ -345,7 +455,7 @@ df = spark.read.format("cosmos.olap")\
345455
346456
df.select("id", "_id.objectId").show()
347457
```
348-
###### SQL
458+
###### Working with the MongoDB `_id` field in SQL
349459
350460
```SQL
351461
SELECT TOP 100 id=CAST(_id as VARBINARY(1000))
@@ -374,7 +484,7 @@ The schema representation type decision must be made at the same time that Synap
374484
> In the command above, replace `create` with `update` for existing accounts.
375485

376486
With the PowerShell:
377-
```
487+
```PowerShell
378488
New-AzCosmosDBAccount -ResourceGroupName MyResourceGroup -Name MyCosmosDBDatabaseAccount -EnableAnalyticalStorage true -AnalyticalStorageSchemaType "FullFidelity"
379489
```
380490

0 commit comments

Comments
 (0)