Skip to content

Commit afbb48e

Browse files
Merge pull request #222970 from SturgeonMi/patch-19
Update reference-yaml-mltable.md
2 parents 83b15cb + 2b87106 commit afbb48e

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

articles/machine-learning/reference-yaml-mltable.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The ideal scenarios to use mltable are:
2525
- The schema of your data is complex and/or changes frequently.
2626
- You only need a subset of data. (for example: a sample of rows or files, specific columns, etc.)
2727
- AutoML jobs requiring tabular data.
28-
If your scenario does not fit the above, then it is likely that [URIs](reference-yaml-data.md) are a more suitable type.
28+
If your scenario doesn't fit the above, then it's likely that [URIs](reference-yaml-data.md) are a more suitable type.
2929

3030
The source JSON schema can be found at https://azuremlschemas.azureedge.net/latest/MLTable.schema.json.
3131

@@ -41,7 +41,7 @@ The source JSON schema can be found at https://azuremlschemas.azureedge.net/late
4141
| Key | Type | Description | Allowed values | Default value |
4242
| --- | ---- | ----------- | -------------- | ------------- |
4343
| `$schema` | string | The YAML schema. If you use the Azure Machine Learning VS Code extension to author the YAML file, including `$schema` at the top of your file enables you to invoke schema and resource completions. | | |
44-
| `type` | const | `mltable` to abstract the schema definition for tabular data so that it is easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe | `mltable` | `mltable`|
44+
| `type` | const | `mltable` to abstract the schema definition for tabular data so that it's easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe | `mltable` | `mltable`|
4545
| `paths` | array | Paths can be a `file` path, `folder` path or `pattern` for paths. `pattern` specifies a search pattern to allow globbing(* and **) of files and folders containing data. Supported URI types are `azureml`, `https`, `wasbs`, `abfss`, and `adl`. See [Core yaml syntax](reference-yaml-core-syntax.md) for more information on how to use the `azureml://` URI format. |`file`, `folder`, `pattern` | |
4646
| `transformations`| array | Defined sequence of transformations that are applied to data loaded from defined paths. |`read_delimited`, `read_parquet` , `read_json_lines` , `read_delta_lake`, `take` to take the first N rows from dataset, `take_random_sample` to take a random sample of records in the dataset approximately by the probability specified, `drop_columns`, `keep_columns`,... ||
4747

@@ -84,6 +84,9 @@ These transformations apply to all mltable-artifact files:
8484
- `convert_column_types`
8585
- `columns`: The column name you want to convert type of.
8686
- `column_type`: The type you want to convert the column to. For example: string, float, int, or datetime with specified formats.
87+
- `extract_partition_format_into_columns`: Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type.
88+
89+
The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2022/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2022-01-01'. Our principle here is to support transforms specific to data delivery and not to get into wider feature engineering transforms.
8790

8891
## MLTable transformations: read_delimited
8992

@@ -106,8 +109,8 @@ The following transformations are specific to delimited files.
106109
- header: user can choose one of the following options: `no_header`, `from_first_file`, `all_files_different_headers`, `all_files_same_headers`. Defaults to `all_files_same_headers`.
107110
- delimiter: The separator used to split columns.
108111
- empty_as_string: Specify if empty field values should be loaded as empty strings. The default (`False`) will read empty field values as nulls. Passing this setting as `True` will read empty field values as empty strings. If the values are converted to numeric or datetime, then this setting has no effect, as empty values will be converted to nulls.
109-
- include_path_column: Boolean to keep path information as column in the table. Defaults to `False`. This setting is useful when you are reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
110-
- support_multi_line: By default (support_multi_line=`False`), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This setting should be set to `True` when the delimited files are known to contain quoted line breaks.
112+
- include_path_column: Boolean to keep path information as column in the table. Defaults to `False`. This setting is useful when you're reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
113+
- support_multi_line: By default (support_multi_line=`False`), all line breaks, including those line breaks in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This setting should be set to `True` when the delimited files are known to contain quoted line breaks.
111114

112115
## MLTable transformations: read_json_lines
113116
```yaml
@@ -138,7 +141,7 @@ transformations:
138141
Only flat Json files are supported.
139142
Below are the supported transformations that are specific for json lines:
140143

141-
- `include_path_column` Boolean to keep path information as column in the MLTable. Defaults to False. This setting is useful when you are reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
144+
- `include_path_column` Boolean to keep path information as column in the MLTable. Defaults to False. This setting is useful when you're reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
142145
- `invalid_lines` How to handle lines that are invalid JSON. Supported values are `error` and `drop`. Defaults to `error`.
143146
- `encoding` Specify the file encoding. Supported encodings are `utf8`, `iso88591`, `latin1`, `ascii`, `utf16`, `utf32`, `utf8bom` and `windows1252`. Default is `utf8`.
144147

@@ -156,7 +159,7 @@ transformations:
156159
### Parquet files transformations
157160
If the user doesn't define options for `read_parquet` transformation, default options will be selected (see below).
158161

159-
- `include_path_column`: Boolean to keep path information as column in the table. Defaults to False. This setting is useful when you are reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
162+
- `include_path_column`: Boolean to keep path information as column in the table. Defaults to False. This setting is useful when you're reading multiple files, and want to know which file a particular record originated from. And you can also keep useful information in file path.
160163

161164
## MLTable transformations: read_delta_lake
162165
```yaml
@@ -165,14 +168,14 @@ type: mltable
165168
paths:
166169
- folder: abfss://my_delta_files
167170
168-
transforms:
171+
transformations:
169172
- read_delta_lake:
170173
timestamp_as_of: '2022-08-26T00:00:00Z'
171174
```
172175

173176
### Delta lake transformations
174177

175-
- `timestamp_as_of`: Timestamp to be specified for time-travel on the specific Delta Lake data.
178+
- `timestamp_as_of`: Datetime string in RFC-3339/ISO-8601 format to be specified for time-travel on the specific Delta Lake data.
176179
- `version_as_of`: Version to be specified for time-travel on the specific Delta Lake data.
177180

178181
## Next steps

0 commit comments

Comments
 (0)