You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/spark/data-sources/apache-spark-cdm-connector.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,7 +64,7 @@ Entity partitions can be in a mix of formats (for example, CSV and Parquet). All
64
64
65
65
When the connector reads CSV data, it uses the Spark `failfast` option by default. If the number of columns isn't equal to the number of attributes in the entity, the connector returns an error.
66
66
67
-
Alternatively, as of 0.19, the connector supports permissive mode (only for CSV files). With permissive mode, when a CSV row has a lower number of columns than the entity schema, the connector assigns null values for the missing columns. When a CSV row has more columns than the entity schema, the columns greater than the entity schema column count will be truncated to the schema column count. Usage is as follows:
67
+
Alternatively, as of 0.19, the connector supports permissive mode (only for CSV files). With permissive mode, when a CSV row has a lower number of columns than the entity schema, the connector assigns null values for the missing columns. When a CSV row has more columns than the entity schema, the columns greater than the entity schema column count are truncated to the schema column count. Usage is as follows:
68
68
69
69
```scala
70
70
.option("entity", "permissive") or .option("mode", "failfast")
@@ -94,7 +94,7 @@ The connector supports two writing modes:
94
94
95
95
## Common Data Model alias integration
96
96
97
-
Common Data Model definition files use aliases in import statements to simplify the import statements and allow the location of the imported content to be late bound at execution time. Using aliases:
97
+
Common Data Model definition files use aliases in import statements to simplify the import statements and allow the location of the imported content to be late bound at runtime. Using aliases:
98
98
99
99
* Facilitates easy organization of Common Data Model files so that related Common Data Model definitions can be grouped together at different locations.
100
100
* Allows Common Data Model content to be accessed from different deployed locations at runtime.
@@ -173,7 +173,7 @@ The following options identify the logical entity definition for the entity that
173
173
|`entityDefinitionPath`|The location of the entity. It's the file path to the Common Data Model definition file relative to the model root, including the name of the entity in that file.|`<folderPath>/<entityName>.cdm.json/<entityName>`<br/>`"sales/customer.cdm.json/customer"`|
174
174
`configPath`| The container and folder path to a *config.json* file that contains the adapter configurations for all aliases included in the entity definition file and any directly or indirectly referenced Common Data Model files. <br/><br/>This option is not required if *config.json* is in the model root folder.| `<container><folderPath>`|
175
175
|`useCdmStandardModelRoot`| Indicates that the model root is located at [https://cdm-schema.microsoft.com/CDM/logical/](https://github.com/microsoft/CDM/tree/master/schemaDocuments). Used to reference entity types defined in the Common Data Model CDN. Overrides `entityDefinitionStorage` and `entityDefinitionModelRoot` (if specified).<br/>|`"useCdmStandardModelRoot"`|
176
-
|`cdmSource`|Defines how the `cdm` alias (if it's present in Common Data Model definition files) is resolved. If you use this option, it overrides any `cdm` adapter specified in the *config.json* file. Values are `builtin` or `referenced`. The default value is `referenced`.<br/><br/> If you set this option to `referenced`, the connector uses the latest published standard Common Data Model definitions at `https://cdm-schema.microsoft.com/logical/`. If you set this option to `builtin`, the connector uses the Common Data Model base definitions built in to the Common Data Model object model that the connector is using. <br/><br/> Note: <br/> * The Spark CDM connector might not be using the latest Common Data Model SDK, so it might not contain the latest published standard definitions. <br/> * The built-in definitions include only the top-level Common Data Model content, such as *foundations.cdm.json* or *primitives.cdm.json*. If you want to use lower-level standard Common Data Model definitions, either use `referenced` or include a `cdm` adapter in *config.json*.| `"builtin"\|"referenced"` |
176
+
|`cdmSource`|Defines how the `cdm` alias (if it's present in Common Data Model definition files) is resolved. If you use this option, it overrides any `cdm` adapter specified in the *config.json* file. Values are `builtin` or `referenced`. The default value is `referenced`.<br/><br/> If you set this option to `referenced`, the connector uses the latest published standard Common Data Model definitions at `https://cdm-schema.microsoft.com/logical/`. If you set this option to `builtin`, the connector uses the Common Data Model base definitions built in to the Common Data Model object model that the connector is using. <br/><br/> Note: <br/> * The Spark CDM connector might not be using the latest Common Data Model SDK, so it might not contain the latest published standard definitions. <br/> * The built-in definitions include only the top-level Common Data Model content, such as *foundations.cdm.json* or *primitives.cdm.json*. If you want to use lower-level standard Common Data Model definitions, either use `referenced` or include a `cdm` adapter in *config.json*.| `"builtin"|"referenced"` |
177
177
178
178
In the preceding example, the full path to the customer entity definition object is `https://myAccount.dfs.core.windows.net/models/crm/core/sales/customer.cdm.json/customer`. In that path, *models* is the container in Azure Data Lake Storage.
179
179
@@ -191,11 +191,11 @@ You can use the following options to change folder organization and file format.
191
191
192
192
|**Option**|**Description**|**Pattern or example usage**|
193
193
|---------|---------|:---------:|
194
-
|`useSubManifest`|If `true`, causes the target entity to be included in the root manifest via a submanifest. The submanifest and the entity definition are written into an entity folder beneath the root. Default is `false`.|`"true"\|"false"`|
195
-
|`format`|Defines the file format. Current supported file formats are CSV and Parquet. Default is `csv`.|`"csv"\|"parquet"` <br/> |
196
-
|`delimiter`|CSV only. Defines the delimiter that you're using. Default is comma. |`"\|"`|
197
-
|`columnHeaders`| CSV only. If `true`, adds a first row to data files with column headers. Default is `true`.|`"true"\|"false"`|
198
-
|`compression`|Write only. Parquet only. Defines the compression format that you're using. Default is `snappy`. |`"uncompressed" \| "snappy" \| "gzip" \| "lzo"`|
194
+
|`useSubManifest`|If `true`, causes the target entity to be included in the root manifest via a submanifest. The submanifest and the entity definition are written into an entity folder beneath the root. Default is `false`.|`"true"|"false"` |
195
+
|`format`|Defines the file format. Current supported file formats are CSV and Parquet. Default is `csv`.|`"csv"|"parquet"` <br/> |
196
+
|`delimiter`|CSV only. Defines the delimiter that you're using. Default is comma. | `"|"` |
197
+
|`columnHeaders`| CSV only. If `true`, adds a first row to data files with column headers. Default is `true`.|`"true"|"false"`|
198
+
|`compression`|Write only. Parquet only. Defines the compression format that you're using. Default is `snappy`. |`"uncompressed" | "snappy" | "gzip" | "lzo"` |
199
199
|`dataFolderFormat`|Allows a user-definable data folder structure within an entity folder. Allows you to substitute date and time values into folder names by using `DateTimeFormatter` formatting. Non-formatter content must be enclosed in single quotation marks. Default format is `"yyyy"-"MM"-"dd"`, which produces folder names like *2020-07-30*.|`year "yyyy" / month "MM"` <br/> `"Data"`|
200
200
201
201
### Save mode
@@ -208,7 +208,7 @@ The save mode specifies how the connector handles existing entity data in the Co
208
208
|`SaveMode.Append`|Appends data that's being written in new partitions alongside the existing partitions.<br/><br/>This mode doesn't support changing the schema. If the schema of the data that's being written is incompatible with the existing entity definition, the connector throws an error.|
209
209
|`SaveMode.ErrorIfExists`|Returns an error if partitions already exist.|
210
210
211
-
For details of how data files are named and organized on write, see the [Folder and file naming and organization](#folder-and-file-naming-and-organization) section later in this article.
211
+
For details of how data files are named and organized on write, see the [Folder and file naming and organization](#naming-and-organization-of-folders-and-files) section later in this article.
212
212
213
213
## Authentication
214
214
@@ -356,7 +356,7 @@ The connector interprets Common Data Model `DateTime` data type values as UTC. I
356
356
357
357
Common Data Model `DateTimeOffset` values intended for recording local time instants are handled differently in Spark and Parquet from CSV. CSV and other formats can express a local time instant as a structure that comprises a datetime, such as `2020-03-13 09:49:00-08:00`. Parquet and Spark don't support such structures. Instead, they use a `TIMESTAMP` data type that allows an instant to be recorded in UTC (or in an unspecified time zone).
358
358
359
-
The Spark CDM connector converts a `DateTimeOffset` value in CSV to a UTC time stamp. This value is persisted as a time stamp in Parquet. If the value is subsequently persisted to CSV, it will be serialized as a `DateTimeOffset` value with a +00:00 offset. There's no loss of temporal accuracy. The serialized values represent the same instant as the original values, although the offset is lost.
359
+
The Spark CDM connector converts a `DateTimeOffset` value in CSV to a UTC time stamp. This value is persisted as a time stamp in Parquet. If the value is later persisted to CSV, it will be serialized as a `DateTimeOffset` value with a +00:00 offset. There's no loss of temporal accuracy. The serialized values represent the same instant as the original values, although the offset is lost.
360
360
361
361
Spark systems use their system time as the baseline and normally express time by using that local time. UTC times can always be computed through application of the local system offset. For Azure systems in all regions, the system time is always UTC, so all timestamp values are normally in UTC. When you're using an implicit write, where a Common Data Model definition is derived from a DataFrame, timestamp columns are translated to attributes with the Common Data Model DateTime data type, which implies a UTC time.
0 commit comments