Skip to content

Commit 9e2941a

Browse files
committed
edit pass: apache-spark-cdm-connector
1 parent 5202acb commit 9e2941a

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

articles/synapse-analytics/spark/data-sources/apache-spark-cdm-connector.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ In the preceding example, the full path to the customer entity definition object
181181

182182
If you don't specify a logical entity definition on write, the entity is written implicitly, based on the DataFrame schema.
183183

184-
When you're writing implicitly, a timestamp column is normally interpreted as a Common Data Model `DateTime` data type. You can override this interpretation to create an attribute of the Common Data Model `Time` data type by providing a metadata object that's associated with the column that specifies the data type. For details, see [Handling Common Data Model time data](#handling-common-data-model-time-data) later in this article.
184+
When you're writing implicitly, a time stamp column is normally interpreted as a Common Data Model `DateTime` data type. You can override this interpretation to create an attribute of the Common Data Model `Time` data type by providing a metadata object that's associated with the column that specifies the data type. For details, see [Handling Common Data Model time data](#handling-common-data-model-time-data) later in this article.
185185

186186
Support for writing time data exists for CSV files only. That support currently doesn't extend to Parquet.
187187

@@ -229,7 +229,7 @@ In both cases, no extra connector options are required.
229229

230230
SAS token credentials are an extra option for authentication to storage accounts. With SAS token authentication, the SAS token can be at the container or folder level. The appropriate permissions are required:
231231

232-
* Read permissions for a manifest or partition needs only read-level support.
232+
* Read permissions for a manifest or partition need only read-level support.
233233
* Write permissions need both read and write support.
234234

235235
| **Option** |**Description** |**Pattern and example usage** |
@@ -350,15 +350,15 @@ The connector doesn't support the Common Data Model `Binary` data type.
350350

351351
### Handling Common Data Model Date, DateTime, and DateTimeOffset data
352352

353-
The Spark CDM connector handles Common Data Model `Date` and `DateTime` data type as normal for Spark and Parquet. In CSV, the connector reads and writes those data types in ISO 8601 format.
353+
The Spark CDM connector handles Common Data Model `Date` and `DateTime` data types as normal for Spark and Parquet. In CSV, the connector reads and writes those data types in ISO 8601 format.
354354

355355
The connector interprets Common Data Model `DateTime` data type values as UTC. In CSV, the connector writes those values in ISO 8601 format. An example is `2020-03-13 09:49:00Z`.
356356

357357
Common Data Model `DateTimeOffset` values intended for recording local time instants are handled differently in Spark and Parquet from CSV. CSV and other formats can express a local time instant as a structure that comprises a datetime, such as `2020-03-13 09:49:00-08:00`. Parquet and Spark don't support such structures. Instead, they use a `TIMESTAMP` data type that allows an instant to be recorded in UTC (or in an unspecified time zone).
358358

359359
The Spark CDM connector converts a `DateTimeOffset` value in CSV to a UTC time stamp. This value is persisted as a time stamp in Parquet. If the value is later persisted to CSV, it will be serialized as a `DateTimeOffset` value with a +00:00 offset. There's no loss of temporal accuracy. The serialized values represent the same instant as the original values, although the offset is lost.
360360

361-
Spark systems use their system time as the baseline and normally express time by using that local time. UTC times can always be computed through application of the local system offset. For Azure systems in all regions, the system time is always UTC, so all timestamp values are normally in UTC. When you're using an implicit write, where a Common Data Model definition is derived from a DataFrame, timestamp columns are translated to attributes with the Common Data Model DateTime data type, which implies a UTC time.
361+
Spark systems use their system time as the baseline and normally express time by using that local time. UTC times can always be computed through application of the local system offset. For Azure systems in all regions, the system time is always UTC, so all time stamp values are normally in UTC. When you're using an implicit write, where a Common Data Model definition is derived from a DataFrame, time stamp columns are translated to attributes with the Common Data Model `DateTime` data type, which implies a UTC time.
362362

363363
If it's important to persist a local time and the data will be processed in Spark or persisted in Parquet, we recommend that you use a `DateTime` attribute and keep the offset in a separate attribute. For example, you can keep the offset as a signed integer value that represents minutes. In Common Data Model, DateTime values are in UTC, so you must apply the offset to compute local time.
364364

@@ -368,7 +368,7 @@ In most cases, persisting local time isn't important. Local times are often requ
368368

369369
Spark doesn't support an explicit `Time` data type. An attribute with the Common Data Model `Time` data type is represented in a Spark DataFrame as a column with a `Timestamp` data type. When The Spark CDM connector reads a time value, the time stamp in the DataFrame is initialized with the Spark epoch date 01/01/1970 plus the time value as read from the source.
370370

371-
When you use explicit write, you can map a time stamp column to either a `DateTime` or `Time` attribute. If you map a time stamp to a `Time` attribute, the date portion of the timestamp is stripped off.
371+
When you use explicit write, you can map a time stamp column to either a `DateTime` or `Time` attribute. If you map a time stamp to a `Time` attribute, the date portion of the time stamp is stripped off.
372372

373373
When you use implicit write, a time stamp column is mapped by default to a `DateTime` attribute. To map a time stamp column to a `Time` attribute, you must add a metadata object to the column in the DataFrame that indicates that the time stamp should be interpreted as a time value. The following code shows how to do this in Scala:
374374

@@ -399,7 +399,7 @@ Here's an example of an explicit write that's defined by a referenced entity def
399399

400400
```text
401401
+-- <CDMFolder>
402-
|-- default.manifest.cdm.json << with entity ref and partition info
402+
|-- default.manifest.cdm.json << with entity reference and partition info
403403
+-- <Entity>
404404
|-- <entity>.cdm.json << resolved physical entity definition
405405
|-- <data folder>
@@ -428,7 +428,7 @@ Here's an example of an implicit write in which the entity definition is derived
428428
+-- <Entity>
429429
|-- <entity>.cdm.json << resolved physical entity definition
430430
+-- LogicalDefinition
431-
| +-- <entity>.cdm.json << logical entity definition(s)
431+
| +-- <entity>.cdm.json << logical entity definitions
432432
|-- <data folder>
433433
|-- <data folder>
434434
+-- ...
@@ -438,12 +438,12 @@ Here's an example of an implicit write with a submanifest:
438438

439439
```text
440440
+-- <CDMFolder>
441-
|-- default.manifest.cdm.json << contains reference to sub-manifest
441+
|-- default.manifest.cdm.json << contains reference to submanifest
442442
+-- <Entity>
443443
|-- <entity>.cdm.json << resolved physical entity definition
444-
|-- <entity>.manifest.cdm.json << sub-manifest with reference to the entity and partition info
444+
|-- <entity>.manifest.cdm.json << submanifest with reference to the entity and partition info
445445
+-- LogicalDefinition
446-
| +-- <entity>.cdm.json << logical entity definition(s)
446+
| +-- <entity>.cdm.json << logical entity definitions
447447
|-- <data folder>
448448
|-- <data folder>
449449
+-- ...

0 commit comments

Comments
 (0)