You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`$schema`| string | The YAML schema. If you use the Azure Machine Learning VS Code extension to author the YAML file, including `$schema` at the top of your file enables you to invoke schema and resource completions. |||
44
-
|`type`| const |`mltable` to abstract the schema definition for tabular data so that it is easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe |`mltable`|`mltable`|
44
+
|`type`| const |`mltable` to abstract the schema definition for tabular data so that it's easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe |`mltable`|`mltable`|
45
45
|`paths`| array | Paths can be a `file` path, `folder` path or `pattern` for paths. `pattern` specifies a search pattern to allow globbing(* and **) of files and folders containing data. Supported URI types are `azureml`, `https`, `wasbs`, `abfss`, and `adl`. See [Core yaml syntax](reference-yaml-core-syntax.md) for more information on how to use the `azureml://` URI format. |`file`, `folder`, `pattern`||
46
46
|`transformations`| array | Defined sequence of transformations that are applied to data loaded from defined paths. |`read_delimited`, `read_parquet` , `read_json_lines` , `read_delta_lake`, `take` to take the first N rows from dataset, `take_random_sample` to take a random sample of records in the dataset approximately by the probability specified, `drop_columns`, `keep_columns`,... ||
47
47
@@ -84,6 +84,9 @@ These transformations apply to all mltable-artifact files:
84
84
- `convert_column_types`
85
85
- `columns`: The column name you want to convert type of.
86
86
- `column_type`: The type you want to convert the column to. For example: string, float, int, or datetime with specified formats.
87
+
- `extract_partition_format_into_columns`: Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type.
88
+
89
+
The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2022/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2022-01-01'. Our principle here is to support transforms specific to data delivery and not to get into wider feature engineering transforms.
87
90
88
91
## MLTable transformations: read_delimited
89
92
@@ -106,8 +109,8 @@ The following transformations are specific to delimited files.
106
109
- header: user can choose one of the following options: `no_header`, `from_first_file`, `all_files_different_headers`, `all_files_same_headers`. Defaults to `all_files_same_headers`.
107
110
- delimiter: The separator used to split columns.
108
111
- empty_as_string: Specify if empty field values should be loaded as empty strings. The default (`False`) will read empty field values as nulls. Passing this setting as `True` will read empty field values as empty strings. If the values are converted to numeric or datetime, then this setting has no effect, as empty values will be converted to nulls.
109
-
- include_path_column: Boolean to keep path information as column in the table. Defaults to `False`. This setting is useful when you are reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
110
-
- support_multi_line: By default (support_multi_line=`False`), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This setting should be set to `True` when the delimited files are known to contain quoted line breaks.
112
+
- include_path_column: Boolean to keep path information as column in the table. Defaults to `False`. This setting is useful when you're reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
113
+
- support_multi_line: By default (support_multi_line=`False`), all line breaks, including those line breaks in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This setting should be set to `True` when the delimited files are known to contain quoted line breaks.
111
114
112
115
## MLTable transformations: read_json_lines
113
116
```yaml
@@ -138,7 +141,7 @@ transformations:
138
141
Only flat Json files are supported.
139
142
Below are the supported transformations that are specific for json lines:
140
143
141
-
- `include_path_column`Boolean to keep path information as column in the MLTable. Defaults to False. This setting is useful when you are reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
144
+
- `include_path_column`Boolean to keep path information as column in the MLTable. Defaults to False. This setting is useful when you're reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
142
145
- `invalid_lines`How to handle lines that are invalid JSON. Supported values are `error` and `drop`. Defaults to `error`.
143
146
- `encoding`Specify the file encoding. Supported encodings are `utf8`, `iso88591`, `latin1`, `ascii`, `utf16`, `utf32`, `utf8bom` and `windows1252`. Default is `utf8`.
144
147
@@ -156,7 +159,7 @@ transformations:
156
159
### Parquet files transformations
157
160
If the user doesn't define options for `read_parquet` transformation, default options will be selected (see below).
158
161
159
-
- `include_path_column`: Boolean to keep path information as column in the table. Defaults to False. This setting is useful when you are reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
162
+
- `include_path_column`: Boolean to keep path information as column in the table. Defaults to False. This setting is useful when you're reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
160
163
161
164
## MLTable transformations: read_delta_lake
162
165
```yaml
@@ -165,14 +168,14 @@ type: mltable
165
168
paths:
166
169
- folder: abfss://my_delta_files
167
170
168
-
transforms:
171
+
transformations:
169
172
- read_delta_lake:
170
173
timestamp_as_of: '2022-08-26T00:00:00Z'
171
174
```
172
175
173
176
### Delta lake transformations
174
177
175
-
- `timestamp_as_of`: Timestamp to be specified for time-travel on the specific Delta Lake data.
178
+
- `timestamp_as_of`: Datetime string in RFC-3339/ISO-8601 format to be specified for time-travel on the specific Delta Lake data.
176
179
- `version_as_of`: Version to be specified for time-travel on the specific Delta Lake data.
0 commit comments