Skip to content

Commit 5202acb

Browse files
committed
edit pass: apache-spark-cdm-connector
1 parent 515c15d commit 5202acb

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

articles/synapse-analytics/spark/data-sources/apache-spark-cdm-connector.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ Entity partitions can be in a mix of formats (for example, CSV and Parquet). All
6464

6565
When the connector reads CSV data, it uses the Spark `failfast` option by default. If the number of columns isn't equal to the number of attributes in the entity, the connector returns an error.
6666

67-
Alternatively, as of 0.19, the connector supports permissive mode (only for CSV files). With permissive mode, when a CSV row has a lower number of columns than the entity schema, the connector assigns null values for the missing columns. When a CSV row has more columns than the entity schema, the columns greater than the entity schema column count will be truncated to the schema column count. Usage is as follows:
67+
Alternatively, as of 0.19, the connector supports permissive mode (only for CSV files). With permissive mode, when a CSV row has a lower number of columns than the entity schema, the connector assigns null values for the missing columns. When a CSV row has more columns than the entity schema, the columns greater than the entity schema column count are truncated to the schema column count. Usage is as follows:
6868

6969
```scala
7070
.option("entity", "permissive") or .option("mode", "failfast")
@@ -94,7 +94,7 @@ The connector supports two writing modes:
9494

9595
## Common Data Model alias integration
9696

97-
Common Data Model definition files use aliases in import statements to simplify the import statements and allow the location of the imported content to be late bound at execution time. Using aliases:
97+
Common Data Model definition files use aliases in import statements to simplify the import statements and allow the location of the imported content to be late bound at runtime. Using aliases:
9898

9999
* Facilitates easy organization of Common Data Model files so that related Common Data Model definitions can be grouped together at different locations.
100100
* Allows Common Data Model content to be accessed from different deployed locations at runtime.
@@ -173,7 +173,7 @@ The following options identify the logical entity definition for the entity that
173173
|`entityDefinitionPath`|The location of the entity. It's the file path to the Common Data Model definition file relative to the model root, including the name of the entity in that file.|`<folderPath>/<entityName>.cdm.json/<entityName>`<br/>`"sales/customer.cdm.json/customer"`|
174174
`configPath`| The container and folder path to a *config.json* file that contains the adapter configurations for all aliases included in the entity definition file and any directly or indirectly referenced Common Data Model files. <br/><br/>This option is not required if *config.json* is in the model root folder.| `<container><folderPath>`|
175175
|`useCdmStandardModelRoot` | Indicates that the model root is located at [https://cdm-schema.microsoft.com/CDM/logical/](https://github.com/microsoft/CDM/tree/master/schemaDocuments). Used to reference entity types defined in the Common Data Model CDN. Overrides `entityDefinitionStorage` and `entityDefinitionModelRoot` (if specified).<br/>| `"useCdmStandardModelRoot"` |
176-
|`cdmSource`|Defines how the `cdm` alias (if it's present in Common Data Model definition files) is resolved. If you use this option, it overrides any `cdm` adapter specified in the *config.json* file. Values are `builtin` or `referenced`. The default value is `referenced`.<br/><br/> If you set this option to `referenced`, the connector uses the latest published standard Common Data Model definitions at `https://cdm-schema.microsoft.com/logical/`. If you set this option to `builtin`, the connector uses the Common Data Model base definitions built in to the Common Data Model object model that the connector is using. <br/><br/> Note: <br/> * The Spark CDM connector might not be using the latest Common Data Model SDK, so it might not contain the latest published standard definitions. <br/> * The built-in definitions include only the top-level Common Data Model content, such as *foundations.cdm.json* or *primitives.cdm.json*. If you want to use lower-level standard Common Data Model definitions, either use `referenced` or include a `cdm` adapter in *config.json*.| `"builtin"\|"referenced"` |
176+
|`cdmSource`|Defines how the `cdm` alias (if it's present in Common Data Model definition files) is resolved. If you use this option, it overrides any `cdm` adapter specified in the *config.json* file. Values are `builtin` or `referenced`. The default value is `referenced`.<br/><br/> If you set this option to `referenced`, the connector uses the latest published standard Common Data Model definitions at `https://cdm-schema.microsoft.com/logical/`. If you set this option to `builtin`, the connector uses the Common Data Model base definitions built in to the Common Data Model object model that the connector is using. <br/><br/> Note: <br/> * The Spark CDM connector might not be using the latest Common Data Model SDK, so it might not contain the latest published standard definitions. <br/> * The built-in definitions include only the top-level Common Data Model content, such as *foundations.cdm.json* or *primitives.cdm.json*. If you want to use lower-level standard Common Data Model definitions, either use `referenced` or include a `cdm` adapter in *config.json*.| `"builtin"|"referenced"` |
177177

178178
In the preceding example, the full path to the customer entity definition object is `https://myAccount.dfs.core.windows.net/models/crm/core/sales/customer.cdm.json/customer`. In that path, *models* is the container in Azure Data Lake Storage.
179179

@@ -191,11 +191,11 @@ You can use the following options to change folder organization and file format.
191191

192192
|**Option** |**Description** |**Pattern or example usage** |
193193
|---------|---------|:---------:|
194-
|`useSubManifest`|If `true`, causes the target entity to be included in the root manifest via a submanifest. The submanifest and the entity definition are written into an entity folder beneath the root. Default is `false`.|`"true"\|"false"` |
195-
|`format`|Defines the file format. Current supported file formats are CSV and Parquet. Default is `csv`.|`"csv"\|"parquet"` <br/> |
196-
|`delimiter`|CSV only. Defines the delimiter that you're using. Default is comma. | `"\|"` |
197-
|`columnHeaders`| CSV only. If `true`, adds a first row to data files with column headers. Default is `true`.|`"true"\|"false"`|
198-
|`compression`|Write only. Parquet only. Defines the compression format that you're using. Default is `snappy`. |`"uncompressed" \| "snappy" \| "gzip" \| "lzo"` |
194+
|`useSubManifest`|If `true`, causes the target entity to be included in the root manifest via a submanifest. The submanifest and the entity definition are written into an entity folder beneath the root. Default is `false`.|`"true"|"false"` |
195+
|`format`|Defines the file format. Current supported file formats are CSV and Parquet. Default is `csv`.|`"csv"|"parquet"` <br/> |
196+
|`delimiter`|CSV only. Defines the delimiter that you're using. Default is comma. | `"|"` |
197+
|`columnHeaders`| CSV only. If `true`, adds a first row to data files with column headers. Default is `true`.|`"true"|"false"`|
198+
|`compression`|Write only. Parquet only. Defines the compression format that you're using. Default is `snappy`. |`"uncompressed" | "snappy" | "gzip" | "lzo"` |
199199
|`dataFolderFormat`|Allows a user-definable data folder structure within an entity folder. Allows you to substitute date and time values into folder names by using `DateTimeFormatter` formatting. Non-formatter content must be enclosed in single quotation marks. Default format is `"yyyy"-"MM"-"dd"`, which produces folder names like *2020-07-30*.| `year "yyyy" / month "MM"` <br/> `"Data"`|
200200

201201
### Save mode
@@ -208,7 +208,7 @@ The save mode specifies how the connector handles existing entity data in the Co
208208
|`SaveMode.Append` |Appends data that's being written in new partitions alongside the existing partitions.<br/><br/>This mode doesn't support changing the schema. If the schema of the data that's being written is incompatible with the existing entity definition, the connector throws an error.|
209209
|`SaveMode.ErrorIfExists`|Returns an error if partitions already exist.|
210210

211-
For details of how data files are named and organized on write, see the [Folder and file naming and organization](#folder-and-file-naming-and-organization) section later in this article.
211+
For details of how data files are named and organized on write, see the [Folder and file naming and organization](#naming-and-organization-of-folders-and-files) section later in this article.
212212

213213
## Authentication
214214

@@ -356,7 +356,7 @@ The connector interprets Common Data Model `DateTime` data type values as UTC. I
356356

357357
Common Data Model `DateTimeOffset` values intended for recording local time instants are handled differently in Spark and Parquet from CSV. CSV and other formats can express a local time instant as a structure that comprises a datetime, such as `2020-03-13 09:49:00-08:00`. Parquet and Spark don't support such structures. Instead, they use a `TIMESTAMP` data type that allows an instant to be recorded in UTC (or in an unspecified time zone).
358358

359-
The Spark CDM connector converts a `DateTimeOffset` value in CSV to a UTC time stamp. This value is persisted as a time stamp in Parquet. If the value is subsequently persisted to CSV, it will be serialized as a `DateTimeOffset` value with a +00:00 offset. There's no loss of temporal accuracy. The serialized values represent the same instant as the original values, although the offset is lost.
359+
The Spark CDM connector converts a `DateTimeOffset` value in CSV to a UTC time stamp. This value is persisted as a time stamp in Parquet. If the value is later persisted to CSV, it will be serialized as a `DateTimeOffset` value with a +00:00 offset. There's no loss of temporal accuracy. The serialized values represent the same instant as the original values, although the offset is lost.
360360

361361
Spark systems use their system time as the baseline and normally express time by using that local time. UTC times can always be computed through application of the local system offset. For Azure systems in all regions, the system time is always UTC, so all timestamp values are normally in UTC. When you're using an implicit write, where a Common Data Model definition is derived from a DataFrame, timestamp columns are translated to attributes with the Common Data Model DateTime data type, which implies a UTC time.
362362

0 commit comments

Comments
 (0)