Skip to content

Commit 1c14f4d

Browse files
Merge pull request #226309 from kecheung/kecheung-patch-1
Fix docs
2 parents 873eaa2 + 85a13b6 commit 1c14f4d

File tree

1 file changed

+9
-105
lines changed

1 file changed

+9
-105
lines changed

articles/synapse-analytics/spark/data-sources/apache-spark-cdm-connector.md

Lines changed: 9 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,15 @@ ms.author: AvinandaC
66
ms.service: synapse-analytics
77
ms.topic: conceptual
88
ms.subservice: spark
9-
ms.date: 03/10/2022
9+
ms.date: 02/03/2023
1010
author: AvinandaMS
1111
---
1212

1313
# Common Data Model (CDM) Connector for Azure Synapse Spark
1414

1515
The Synapse Spark Common Data Model (CDM) format reader/writer enables a Spark program to read and write CDM entities in a CDM folder via Spark dataframes.
1616

17-
For information on defining CDM documents using CDM 1.0 see. [What is CDM and how to use it](/common-data-model/).
17+
For information on defining CDM documents using CDM 1.2 see. [What is CDM and how to use it](/common-data-model/).
1818

1919
## High level functionality
2020

@@ -35,6 +35,7 @@ The following capabilities are supported:
3535
* Supports writing data using user modifiable partition patterns.
3636
* Supports use of managed identity Synapse and credentials.
3737
* Supports resolving CDM aliases locations used in imports using CDM adapter definitions described in a config.json.
38+
* Parallel writes are not supported. It is not recommended. There is no locking mechanism at the storage layer.
3839

3940
## Limitations
4041

@@ -47,7 +48,10 @@ The following scenarios aren't supported:
4748
* Write support for model.json isn't supported.
4849
* Executing ```com.microsoft.cdm.BuildInfo.version``` will verify the version
4950

50-
Spark 2.4 and Spark 3.1 are supported.
51+
Spark 2.4, 3.1, and 3.2 are supported.
52+
53+
## Samples
54+
Checkout the [sample code and CDM files](https://github.com/Azure/spark-cdm-connector/tree/spark3.2/samples) for a quick start.
5155

5256
## Reading data
5357

@@ -62,8 +66,6 @@ When reading CSV data, the connector uses the Spark FAILFAST option by default.
6266
.option("entity", "permissive") or .option("mode", "failfast")
6367
```
6468

65-
For example, [here's an example Python sample.](https://github.com/Azure/spark-cdm-connector/blob/master/samples/SparkCDMsamplePython.ipynb)
66-
6769
## Writing data
6870

6971
When writing to a CDM folder, if the entity doesn't already exist in the CDM folder, a new entity and definition is created and added to the CDM folder and referenced in the manifest. Two writing modes are supported:
@@ -214,7 +216,7 @@ SaS Token Credential authentication to storage accounts is an extra option for a
214216

215217
| **Option** |**Description** |**Pattern and example usage** |
216218
|----------|---------|:---------:|
217-
| sasToken |The sastoken to access the relative storageAccount with the correct permissions | \<token\>|
219+
| sasToken |The sastoken to access the relative storageAccount with the correct permissions | \<token\>|
218220

219221
### Credential-based access control options
220222

@@ -292,8 +294,7 @@ df.write.format("com.microsoft.cdm")
292294
.option("manifestPath", "cdmdata/Teams/root.manifest.cdm.json")
293295
.option("entity", "TeamMembership")
294296
.option("useCdmStandardModelRoot", true)
295-
.option("entityDefinitionPath", "core/applicationCommon/TeamMembership.cdm.json/Tea
296-
mMembership")
297+
.option("entityDefinitionPath", "core/applicationCommon/TeamMembership.cdm.json/TeamMembership")
297298
.option("useSubManifest", true)
298299
.mode(SaveMode.Overwrite)
299300
.save()
@@ -433,103 +434,6 @@ val df= spark.createDataFrame(spark.sparkContext.parallelize(data, 2), schema)
433434
+-- ...
434435
```
435436

436-
## Samples
437-
438-
See https://github.com/Azure/spark-cdm-connector/tree/master/samples for sample code and CDM files.
439-
440-
### Examples
441-
442-
The following examples all use appId, appKey and tenantId variables initialized earlier in the code based on an Azure app registration that has been given Storage Blob Data Contributor permissions on the storage for write and Storage Blob Data Reader permissions for read.
443-
444-
#### Read
445-
446-
This code reads the Person entity from the CDM folder with manifest in `mystorage.dfs.core.windows.net/cdmdata/contacts/root.manifest.cdm.json`.
447-
448-
```scala
449-
val df = spark.read.format("com.microsoft.cdm")
450-
.option("storage", "mystorage.dfs.core.windows.net")
451-
.option("manifestPath", "cdmdata/contacts/root.manifest.cdm.json")
452-
.option("entity", "Person")
453-
.load()
454-
```
455-
456-
#### Implicit write – using dataframe schema only
457-
458-
This code writes the dataframe _df_ to a CDM folder with a manifest to `mystorage.dfs.core.windows.net/cdmdata/Contacts/default.manifest.cdm.json` with an Event entity.
459-
460-
Event data is written as Parquet files, compressed with gzip, that are appended to the folder (new files
461-
are added without deleting existing files).
462-
463-
```scala
464-
465-
df.write.format("com.microsoft.cdm")
466-
.option("storage", "mystorage.dfs.core.windows.net")
467-
.option("manifestPath", "cdmdata/Contacts/default.manifest.cdm.json")
468-
.option("entity", "Event")
469-
.option("format", "parquet")
470-
.option("compression", "gzip")
471-
.mode(SaveMode.Append)
472-
.save()
473-
```
474-
475-
#### Explicit write - using an entity definition stored in ADLS
476-
477-
This code writes the dataframe _df_ to a CDM folder with manifest at
478-
`https://mystorage.dfs.core.windows.net/cdmdata/Contacts/root.manifest.cdm.json` with the entity Person. Person data is written as new CSV files (by default) which overwrite existing files in the folder.
479-
The Person entity definition is retrieved from
480-
`https://mystorage.dfs.core.windows.net/models/cdmmodels/core/Contacts/Person.cdm.json`
481-
482-
```scala
483-
df.write.format("com.microsoft.cdm")
484-
.option("storage", "mystorage.dfs.core.windows.net")
485-
.option("manifestPath", "cdmdata/contacts/root.manifest.cdm.json")
486-
.option("entity", "Person")
487-
.option("entityDefinitionModelRoot", "cdmmodels/core")
488-
.option("entityDefinitionPath", "/Contacts/Person.cdm.json/Person")
489-
.mode(SaveMode.Overwrite)
490-
.save()
491-
```
492-
493-
#### Explicit write - using an entity defined in the CDM GitHub
494-
495-
This code writes the dataframe _df_ to a CDM folder with the manifest at `https://_mystorage_.dfs.core.windows.net/cdmdata/Teams/root.manifest.cdm.json` and a submanifest containing the TeamMembership entity, created in a TeamMembership subdirectory. TeamMembership data is written to CSV files (the default) that overwrite any existing data files. The TeamMembership entity definition is retrieved from the CDM CDN, at:
496-
[https://cdm-schema.microsoft.com/logical/core/applicationCommon/TeamMembership.cdm.json](https://cdm-schema.microsoft.com/logical/core/applicationCommon/TeamMembership.cdm.json)
497-
498-
```scala
499-
df.write.format("com.microsoft.cdm")
500-
.option("storage", "mystorage.dfs.core.windows.net")
501-
.option("manifestPath", "cdmdata/Teams/root.manifest.cdm.json")
502-
.option("entity", "TeamMembership")
503-
.option("useCdmStandardModelRoot", true)
504-
.option("entityDefinitionPath", "core/applicationCommon/TeamMembership.cdm.json/Tea
505-
mMembership")
506-
.option("useSubManifest", true)
507-
.mode(SaveMode.Overwrite)
508-
.save()
509-
```
510-
511-
### Other considerations
512-
513-
#### Spark to CDM datatype mapping
514-
515-
The following datatype mappings are applied when converting CDM to/from Spark.
516-
517-
|**Spark** |**CDM**|
518-
|---------|---------|
519-
|ShortType|SmallInteger|
520-
|IntegerType|Integer|
521-
|LongType |BigInteger|
522-
|DateType |Date|
523-
|Timestamp|DateTime (optionally Time, see below)|
524-
|StringType|String|
525-
|DoubleType|Double|
526-
|DecimalType(x,y)|Decimal (x,y) (default scale and precision are 18,4)|
527-
|FloatType|Float|
528-
|BooleanType|Boolean|
529-
|ByteType|Byte|
530-
531-
The CDM Binary datatype isn't supported.
532-
533437
## Troubleshooting and known issues
534438

535439
* Ensure the decimal precision and scale of decimal data type fields used in the dataframe match the data type used in the CDM entity definition - requires precision and scale traits are defined on the data type. If the precision and scale aren't defined explicitly in CDM, the default used is Decimal(18,4). For model.json files, Decimal is assumed to be Decimal(18,4).

0 commit comments

Comments
 (0)