You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/spark/data-sources/apache-spark-cdm-connector.md
+9-105Lines changed: 9 additions & 105 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,15 +6,15 @@ ms.author: AvinandaC
6
6
ms.service: synapse-analytics
7
7
ms.topic: conceptual
8
8
ms.subservice: spark
9
-
ms.date: 03/10/2022
9
+
ms.date: 02/03/2023
10
10
author: AvinandaMS
11
11
---
12
12
13
13
# Common Data Model (CDM) Connector for Azure Synapse Spark
14
14
15
15
The Synapse Spark Common Data Model (CDM) format reader/writer enables a Spark program to read and write CDM entities in a CDM folder via Spark dataframes.
16
16
17
-
For information on defining CDM documents using CDM 1.0 see. [What is CDM and how to use it](/common-data-model/).
17
+
For information on defining CDM documents using CDM 1.2 see. [What is CDM and how to use it](/common-data-model/).
18
18
19
19
## High level functionality
20
20
@@ -35,6 +35,7 @@ The following capabilities are supported:
35
35
* Supports writing data using user modifiable partition patterns.
36
36
* Supports use of managed identity Synapse and credentials.
37
37
* Supports resolving CDM aliases locations used in imports using CDM adapter definitions described in a config.json.
38
+
* Parallel writes are not supported. It is not recommended. There is no locking mechanism at the storage layer.
38
39
39
40
## Limitations
40
41
@@ -47,7 +48,10 @@ The following scenarios aren't supported:
47
48
* Write support for model.json isn't supported.
48
49
* Executing ```com.microsoft.cdm.BuildInfo.version``` will verify the version
49
50
50
-
Spark 2.4 and Spark 3.1 are supported.
51
+
Spark 2.4, 3.1, and 3.2 are supported.
52
+
53
+
## Samples
54
+
Checkout the [sample code and CDM files](https://github.com/Azure/spark-cdm-connector/tree/spark3.2/samples) for a quick start.
51
55
52
56
## Reading data
53
57
@@ -62,8 +66,6 @@ When reading CSV data, the connector uses the Spark FAILFAST option by default.
62
66
.option("entity", "permissive") or .option("mode", "failfast")
63
67
```
64
68
65
-
For example, [here's an example Python sample.](https://github.com/Azure/spark-cdm-connector/blob/master/samples/SparkCDMsamplePython.ipynb)
66
-
67
69
## Writing data
68
70
69
71
When writing to a CDM folder, if the entity doesn't already exist in the CDM folder, a new entity and definition is created and added to the CDM folder and referenced in the manifest. Two writing modes are supported:
@@ -214,7 +216,7 @@ SaS Token Credential authentication to storage accounts is an extra option for a
214
216
215
217
|**Option**|**Description**|**Pattern and example usage**|
216
218
|----------|---------|:---------:|
217
-
| sasToken |The sastoken to access the relative storageAccount with the correct permissions |\<token\>|
219
+
| sasToken |The sastoken to access the relative storageAccount with the correct permissions |\<token\>|
@@ -433,103 +434,6 @@ val df= spark.createDataFrame(spark.sparkContext.parallelize(data, 2), schema)
433
434
+-- ...
434
435
```
435
436
436
-
## Samples
437
-
438
-
See https://github.com/Azure/spark-cdm-connector/tree/master/samples for sample code and CDM files.
439
-
440
-
### Examples
441
-
442
-
The following examples all use appId, appKey and tenantId variables initialized earlier in the code based on an Azure app registration that has been given Storage Blob Data Contributor permissions on the storage for write and Storage Blob Data Reader permissions for read.
443
-
444
-
#### Read
445
-
446
-
This code reads the Person entity from the CDM folder with manifest in `mystorage.dfs.core.windows.net/cdmdata/contacts/root.manifest.cdm.json`.
This code writes the dataframe _df_ to a CDM folder with a manifest to `mystorage.dfs.core.windows.net/cdmdata/Contacts/default.manifest.cdm.json` with an Event entity.
459
-
460
-
Event data is written as Parquet files, compressed with gzip, that are appended to the folder (new files
#### Explicit write - using an entity definition stored in ADLS
476
-
477
-
This code writes the dataframe _df_ to a CDM folder with manifest at
478
-
`https://mystorage.dfs.core.windows.net/cdmdata/Contacts/root.manifest.cdm.json` with the entity Person. Person data is written as new CSV files (by default) which overwrite existing files in the folder.
#### Explicit write - using an entity defined in the CDM GitHub
494
-
495
-
This code writes the dataframe _df_ to a CDM folder with the manifest at `https://_mystorage_.dfs.core.windows.net/cdmdata/Teams/root.manifest.cdm.json` and a submanifest containing the TeamMembership entity, created in a TeamMembership subdirectory. TeamMembership data is written to CSV files (the default) that overwrite any existing data files. The TeamMembership entity definition is retrieved from the CDM CDN, at:
The following datatype mappings are applied when converting CDM to/from Spark.
516
-
517
-
|**Spark**|**CDM**|
518
-
|---------|---------|
519
-
|ShortType|SmallInteger|
520
-
|IntegerType|Integer|
521
-
|LongType |BigInteger|
522
-
|DateType |Date|
523
-
|Timestamp|DateTime (optionally Time, see below)|
524
-
|StringType|String|
525
-
|DoubleType|Double|
526
-
|DecimalType(x,y)|Decimal (x,y) (default scale and precision are 18,4)|
527
-
|FloatType|Float|
528
-
|BooleanType|Boolean|
529
-
|ByteType|Byte|
530
-
531
-
The CDM Binary datatype isn't supported.
532
-
533
437
## Troubleshooting and known issues
534
438
535
439
* Ensure the decimal precision and scale of decimal data type fields used in the dataframe match the data type used in the CDM entity definition - requires precision and scale traits are defined on the data type. If the precision and scale aren't defined explicitly in CDM, the default used is Decimal(18,4). For model.json files, Decimal is assumed to be Decimal(18,4).
0 commit comments