You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Customer intent: As an experienced data scientist, I need to package my data into a consumable and reusable object to train my machine learning models.
16
16
@@ -343,7 +343,7 @@ transformations:
343
343
header: all_files_same_headers
344
344
```
345
345
346
-
The important part here is that the MLTable-artifact does have not have any absolute paths, hence it is self-contained and all that is needed is stored in that one folder; regardless of whether that folder is stored on your local drive or in your cloud drive or on a public http server.
346
+
The important part here's that the MLTable-artifact doesn't have any absolute paths, hence it's self-contained and all that is needed is stored in that one folder; regardless of whether that folder is stored on your local drive or in your cloud drive or on a public http server.
347
347
348
348
This artifact file can be consumed in a command job as follows:
349
349
@@ -390,7 +390,7 @@ command: |
390
390
"
391
391
```
392
392
393
-
You can also has an MLTable file stored on their *local machine*, but no data files. The underlying data is stored on the cloud. In this case, the MLTable should reference the underlying data by means of an **absolute expression (i.e. a URI)**:
393
+
You can also have an MLTable file stored on the *local machine*, but no data files. The underlying data is stored on the cloud. In this case, the MLTable should reference the underlying data with an **absolute expression (i.e. a URI)**:
394
394
395
395
```
396
396
.
@@ -414,7 +414,7 @@ transformations:
414
414
415
415
416
416
### Supporting multiple files in a table
417
-
While above scenarios are creating rectangular data, it is also possible to create an mltable-artifact that just contains files:
417
+
While above scenarios are creating rectangular data, it's also possible to create an mltable-artifact that just contains files:
418
418
419
419
```
420
420
.
@@ -437,7 +437,7 @@ paths:
437
437
- file: http://foo.com/5.csv
438
438
```
439
439
440
-
As outlined above, mltable can be created from a URI or a local folder path:
440
+
As outlined above, MLTable can be created from a URI or a local folder path:
MLTable-artifacts can yield files that are not necessarily located in the `mltable`'s storage. Or it can **subset or shuffle** the data that resides in the storage using the `take_random_sample` transform for example. That view is only visible if the MLTable file is actually evaluated by the engine. The user can do that as described above by using the MLTable SDK by running `mltable.load` -- but that requires python and the installation of the SDK.
488
+
MLTable-artifacts can yield files that aren't necessarily located in the `mltable`'s storage. Or it can **subset or shuffle** the data that resides in the storage using the `take_random_sample` transform for example. That view is only visible if the MLTable file is evaluated by the engine. The user can do that as described above by using the MLTable SDK by running `mltable.load`, but that requires python and the installation of the SDK.
489
489
490
490
### Support globbing of files
491
491
Along with users being able to provide a `file` or `folder`, the MLTable artifact file will also allow customers to specify a *pattern* to do globbing of files:
@@ -506,15 +506,15 @@ transformations:
506
506
### Delimited text: Transformations
507
507
There are the following transformations that are *specific to delimited text*.
508
508
509
-
- `infer_column_types`: Boolean to infer column data types. Defaults to True. Type inference requires that the data source is accessible from current compute. Currently type inference will only pull first 200 rows. If the data contains multiple types of value, it is better to provide desired type as an override via `set_column_types` argument
509
+
- `infer_column_types`: Boolean to infer column data types. Defaults to True. Type inference requires that the data source is accessible from current compute. Currently type inference will only pull first 200 rows. If the data contains multiple types of value, it's better to provide desired type as an override via `set_column_types` argument
510
510
- `encoding`: Specify the file encoding. Supported encodings are 'utf8', 'iso88591', 'latin1', 'ascii', 'utf16', 'utf32', 'utf8bom' and 'windows1252'. Defaults to utf8.
511
511
- header: user can choose one of the following options:
512
512
- `no_header`
513
513
- `from_first_file`
514
514
- `all_files_different_headers`
515
515
- `all_files_same_headers`(default)
516
516
- `delimiter`: The separator used to split columns.
517
-
- `empty_as_string`: Specify if empty field values should be loaded as empty strings. The default (False) will read empty field values as nulls. Passing this as True will read empty field values as empty strings. If the values are converted to numeric or datetime then this has no effect, as empty values will be converted to nulls.
517
+
- `empty_as_string`: Specify if empty field values should be loaded as empty strings. The default (False) will read empty field values as nulls. Passing this as True will read empty field values as empty strings. If the values are converted to numeric or datetime, then this has no effect as empty values will be converted to nulls.
518
518
- `include_path_column`: Boolean to keep path information as column in the table. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.
519
519
- `support_multi_line`: By default (support_multi_line=False), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to True when the delimited files are known to contain quoted line breaks.
520
520
@@ -531,7 +531,8 @@ Below are the supported transformations that are specific for json lines:
531
531
- `encoding`Specify the file encoding. Supported encodings are `utf8`, `iso88591`, `latin1`, `ascii`, `utf16`, `utf32`, `utf8bom` and `windows1252`. Default is `utf8`.
532
532
533
533
## Global Transforms
534
-
As well as having transforms specific to the delimited text, parquet, Delta. There are other transforms that mltable-artifact files support:
534
+
535
+
MLTable-artifacts provide transformations specific to the delimited text, parquet, Delta. There are other transforms that mltable-artifact files support:
535
536
536
537
- `take`: Takes the first *n* records of the table
537
538
- `take_random_sample`: Takes a random sample of the table where each record has a *probability* of being selected. The user can also include a *seed*.
@@ -540,11 +541,11 @@ As well as having transforms specific to the delimited text, parquet, Delta. The
540
541
- `keep_columns`: Keeps only the specified columns in the table. This transform supports regex so that users can keep columns matching a particular pattern.
541
542
- `filter`: Filter the data, leaving only the records that match the specified expression. **NOTE: This will come post-GA as we need to define the filter query language**.
542
543
- `extract_partition_format_into_columns`: Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.
543
-
Our principle here is to support transforms *specific to data delivery* and not to get into wider feature engineering transforms.
544
+
Our principle here's to support transforms *specific to data delivery* and not to get into wider feature engineering transforms.
544
545
545
546
546
547
## Traits
547
-
The keen eyed among you may have spotted that `mltable` type supports a `traits` section. Traits define fixed characteristics of the table (i.e. they are **not** freeform metadata that users can add) and they do not perform any transformations but can be used by the engine.
548
+
The keen eyed among you may have spotted that `mltable` type supports a `traits` section. Traits define fixed characteristics of the table (that is, they are **not** freeform metadata that users can add) and they don't perform any transformations but can be used by the engine.
548
549
549
550
- `index_columns`: Set the table index using existing columns. This trait can be used by partition_by in the data plane to split data by the index.
550
551
- `timestamp_column`: Defines the timestamp column of the table. This trait can be used in filter transforms, or in other data plane operations (SDK) such as drift detection.
@@ -553,7 +554,7 @@ Moreover, *in the future* we can use traits to define RAI aspects of the data, f
553
554
554
555
- `sensitive_columns`: Here the user can define certain columns that contain sensitive information.
555
556
556
-
Again, this is not a transform but is informing the system of some additional properties in the data.
557
+
Again, this isn't a transform but is informing the system of some extra properties in the data.
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-identity-based-data-access.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -144,7 +144,7 @@ If you're training a model on a remote compute target and want to access the dat
144
144
145
145
By default, Azure Machine Learning can't communicate with a storage account that's behind a firewall or in a virtual network.
146
146
147
-
You can configure storage accounts to allow access only from within specific virtual networks. This configuration requires additional steps to ensure data isn't leaked outside of the network. This behavior is the same for credential-based data access. For more information, see [How to configure virtual network scenarios](how-to-access-data.md#virtual-network).
147
+
You can configure storage accounts to allow access only from within specific virtual networks. This configuration requires extra steps to ensure data isn't leaked outside of the network. This behavior is the same for credential-based data access. For more information, see [How to configure virtual network scenarios](how-to-access-data.md#virtual-network).
148
148
149
149
If your storage account has virtual network settings, that dictates what identity type and permissions access is needed. For example for data preview and data profile, the virtual network settings determine what type of identity is used to authenticate data access.
150
150
@@ -163,7 +163,7 @@ We recommend that you use [Azure Machine Learning datasets](./v1/how-to-create-r
163
163
164
164
Datasets package your data into a lazily evaluated consumable object for machine learning tasks like training. Also, with datasets you can [download or mount](how-to-train-with-datasets.md#mount-vs-download) files of any format from Azure storage services like Azure Blob Storage and Azure Data Lake Storage to a compute target.
165
165
166
-
To create a dataset, you can reference paths from datastores that also use identity-based data access.
166
+
To create a dataset, you can reference paths from datastores that also use identity-based data access.
167
167
168
168
* If you're underlying storage account type is Blob or ADLS Gen 2, your user identity needs Blob Reader role.
169
169
* If your underlying storage is ADLS Gen 1, permissions need can be set via the storage's Access Control List (ACL).
0 commit comments