You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#Customer intent: As an experienced data scientist, I need to package my data into a consumable and reusable object to train my machine learning models.
15
15
---
16
16
@@ -32,7 +32,7 @@ With Azure Machine Learning datasets, you can:
32
32
33
33
> [!IMPORTANT]
34
34
> Items in this article marked as "preview" are currently in public preview.
35
-
> The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.
35
+
> The preview version is provided without a service level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.
36
36
> For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
37
37
38
38
## Prerequisites
@@ -55,7 +55,7 @@ To create and work with datasets, you need:
55
55
> Some dataset classes have dependencies on the [azureml-dataprep](https://pypi.org/project/azureml-dataprep/) package, which is only compatible with 64-bit Python. If you develop on __Linux__, these classes rely on .NET Core 2.1, and only specific distributions support them. For more information about the supported distros, read the .NET Core 2.1 column in the [Install .NET on Linux](/dotnet/core/install/linux) article.
56
56
57
57
> [!IMPORTANT]
58
-
> While the package may work on older versions of Linux distros, we do not recommend use of a distro that is out of mainstream support. Distros that are out of mainstream support may have security vulnerabilities, because they do not receive the latest updates. We recommend using the latest supported version of your distro that is compatible with .
58
+
> While the package might work on older versions of Linux distros, we don't recommend use of a distro that is out of mainstream support. Distros that are out of mainstream support might have security vulnerabilities, because they don't receive the latest updates. We recommend using the latest supported version of your distro that is compatible with .
59
59
60
60
## Compute size guidance
61
61
@@ -71,7 +71,7 @@ There are two dataset types, based on how users consume datasets in training: Fi
71
71
72
72
### FileDataset
73
73
74
-
A [FileDataset](/python/api/azureml-core/azureml.data.file_dataset.filedataset) references single or multiple files in your datastores or public URLs. If your data is already cleaned, and ready to use in training experiments, you can [download or mount](how-to-train-with-datasets.md#mount-vs-download) the files to your compute as a FileDataset object.
74
+
A [FileDataset](/python/api/azureml-core/azureml.data.file_dataset.filedataset) references single or multiple files in your datastores or public URLs. If you have cleaned data that is ready for use in training experiments, you can [download or mount](how-to-train-with-datasets.md#mount-vs-download) the files to your compute as a FileDataset object.
75
75
76
76
We recommend FileDatasets for your machine learning workflows, because the source files can be in any format. This enables a wider range of machine learning scenarios, including deep learning.
77
77
@@ -87,8 +87,8 @@ Create a TabularDataset with [the Python SDK](#create-a-tabulardataset) or [Azur
87
87
88
88
>[!NOTE]
89
89
> [Automated ML](../concept-automated-ml.md) workflows generated via the Azure Machine Learning studio currently only support TabularDatasets.
90
-
>
91
-
>Also, for TabularDatasets generated from [SQL query results](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-from-sql-query), T-SQL (e.g. 'WITH' sub query) or duplicate column names are not supported. Complex T-SQL queries can cause performance issues. Duplicate column names in a dataset can cause ambiguity issues.
90
+
>
91
+
>Also, for TabularDatasets generated from [SQL query results](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-from-sql-query), T-SQL (e.g. 'WITH' sub query) or duplicate column names aren't supported. Complex T-SQL queries can cause performance issues. Duplicate column names in a dataset can cause ambiguity issues.
92
92
93
93
## Access datasets in a virtual network
94
94
@@ -108,7 +108,7 @@ To create datasets from a datastore with the Python SDK:
108
108
1. Create the dataset by referencing paths in the datastore. You can create a dataset from multiple paths in multiple datastores. There's no hard limit on the number of files or data size from which you can create a dataset.
109
109
110
110
> [!NOTE]
111
-
> For each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. For optimal performance, we recommend creating datasets that reference less than 100 paths in datastores.
111
+
> For each data path, a few requests are sent to the storage service to check whether it points to a file or a folder. This overhead might lead to degraded performance or failure. A dataset that references one folder with 1,000 files inside is considered referencing one data path. For optimal performance, we recommend creating datasets that reference fewer than 100 paths in datastores.
112
112
113
113
### Create a FileDataset
114
114
@@ -206,7 +206,7 @@ After you create and [register](#register-datasets) your dataset, you can load t
206
206
207
207
Filtering capabilities depends on the type of dataset you have.
208
208
> [!IMPORTANT]
209
-
> Filtering datasets with the [`filter()`](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-filter) preview method is an [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview feature, and may change at any time.
209
+
> Filtering datasets with the [`filter()`](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-filter) preview method is an [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview feature, and could change at any time.
210
210
>
211
211
For **TabularDatasets**, you can keep or remove columns with the [keep_columns()](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-keep-columns) and [drop_columns()](/python/api/azureml-core/azureml.data.tabulardataset#azureml-data-tabulardataset-drop-columns) methods.
> Create and register a TabularDataset from an in memory spark dataframe or a dask dataframe with the public preview methods, [`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-spark-dataframe) and [`register_dask_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-dask-dataframe). These methods are [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview features, and may change at any time.
341
+
> Create and register a TabularDataset from an in memory spark dataframe or a dask dataframe with the public preview methods, [`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-spark-dataframe) and [`register_dask_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#azureml-data-dataset-factory-tabulardatasetfactory-register-dask-dataframe). These methods are [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview features, and might change at any time.
342
342
>
343
343
> These methods upload data to your underlying storage, and as a result incur storage costs.
#Customer intent: As a data scientist, I want to prepare my data at scale, and to train my machine learning models from a single notebook using Azure Machine Learning.
15
15
---
@@ -71,7 +71,7 @@ After the session starts, you can check the session's metadata:
71
71
You can specify an [Azure Machine Learning environment](../concept-environments.md) to use during your Apache Spark session. Only Conda dependencies specified in the environment will take effect. Docker images aren't supported.
72
72
73
73
>[!WARNING]
74
-
> Python dependencies specified in environment Conda dependencies are not supported in Apache Spark pools. Currently, only fixed Python versions are supported
74
+
> Python dependencies specified in environment Conda dependencies aren't supported in Apache Spark pools. Currently, only fixed Python versions are supported
75
75
> Include `sys.version_info` in your script to check your Python version
76
76
77
77
This code creates the`myenv` environment variable, to install `azureml-core` version 1.20.0 and `numpy` version 1.17.0 before the session starts. You can then include this environment in your Apache Spark session `start` statement.
@@ -214,7 +214,7 @@ After you complete the data preparation, and you save your prepared data to stor
214
214
%synapse stop
215
215
```
216
216
217
-
## Create a dataset, to represent prepared data
217
+
## Create a dataset to represent prepared data
218
218
219
219
When you're ready to consume your prepared data for model training, connect to your storage with an [Azure Machine Learning datastore](how-to-access-data.md), and specify the file or file you want to use with an [Azure Machine Learning dataset](how-to-create-register-datasets.md).
220
220
@@ -238,7 +238,7 @@ input1 = train_ds.as_mount()
238
238
239
239
## Use a `ScriptRunConfig` to submit an experiment run to a Synapse Spark pool
240
240
241
-
If you're ready to automate and productionize your data wrangling tasks, you can submit an experiment run to [an attached Synapse Spark pool](how-to-link-synapse-ml-workspaces.md#attach-a-pool-with-the-python-sdk) with the [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig) object. In a similar way, if you have an Azure Machine Learning pipeline, you can use the [SynapseSparkStep to specify your Synapse Spark pool as the compute target](how-to-use-synapsesparkstep.md) for the data preparation step in your pipeline. Availability of your data to the Synapse Spark pool depends on your dataset type.
241
+
If you're ready to automate and productionize your data wrangling tasks, you can submit an experiment run to [an attached Synapse Spark pool](how-to-link-synapse-ml-workspaces.md#attach-a-pool-with-the-python-sdk) with the [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig) object. In a similar way, if you have an Azure Machine Learning pipeline, you can use the [SynapseSparkStep to specify your Synapse Spark pool as the compute target](how-to-use-synapsesparkstep.md) for your pipeline data preparation step. Availability of your data to the Synapse Spark pool depends on your dataset type.
242
242
243
243
* For a FileDataset, you can use the [`as_hdfs()`](/python/api/azureml-core/azureml.data.filedataset#as-hdfs--) method. When the run is submitted, the dataset is made available to the Synapse Spark pool as a Hadoop distributed file system (HFDS)
244
244
* For a [TabularDataset](how-to-create-register-datasets.md#tabulardataset), you can use the [`as_named_input()`](/python/api/azureml-core/azureml.data.abstract_dataset.abstractdataset#as-named-input-name-) method
In this article, you learn how to transform and save datasets in the Azure Machine Learning designer, to prepare your own data for machine learning.
19
19
20
-
You'll use the sample [Adult Census Income Binary Classification](samples-designer.md) dataset to prepare two datasets: one dataset that includes adult census information from only the United States, and another dataset that includes census information from non-US adults.
20
+
You'll use the sample [Adult Census Income Binary Classification](samples-designer.md) dataset to prepare two datasets. One dataset includes adult census information from only the United States, and another dataset includes census information from non-US adults.
21
21
22
-
In this article, you'll learn how to:
22
+
In this article, you learn how to:
23
23
24
24
1. Transform a dataset to prepare it for training.
25
25
1. Export the resulting datasets to a datastore.
26
26
1. View the results.
27
27
28
-
This how-to is a prerequisite for the [how to retrain designer models](how-to-retrain-designer.md) article. In that article, you'll learn how to use the transformed datasets to train multiple models, with pipeline parameters.
28
+
This how-to is a prerequisite for the [how to retrain designer models](how-to-retrain-designer.md) article. In that article, you learn how to use the transformed datasets to train multiple models with pipeline parameters.
29
29
30
30
> [!IMPORTANT]
31
-
> If you do not observe graphical elements mentioned in this document, such as buttons in studio or designer, you may not have the correct level of permissions to the workspace. Please contact your Azure subscription administrator to verify that you have been granted the correct level of access. For more information, visit [Manage users and roles](../how-to-assign-roles.md).
31
+
> If you don't observe the graphical elements mentioned in this document - for example, buttons in studio or designer, you might not have the correct level of permissions to the workspace. Contact your Azure subscription administrator to verify that you have the correct level of access. For more information, visit [Manage users and roles](../how-to-assign-roles.md).
32
32
33
33
## Transform a dataset
34
34
35
-
In this section, you'll learn how to import the sample dataset, and split the data into US and non-US datasets. Visit [how to import data](how-to-designer-import-data.md) for more information about how to import your own data into the designer.
35
+
In this section, you learn how to import the sample dataset, and split the data into US and non-US datasets. Visit [how to import data](how-to-designer-import-data.md) for more information about how to import your own data into the designer.
36
36
37
37
### Import data
38
38
@@ -52,7 +52,7 @@ Use these steps to import the sample dataset:
52
52
53
53
### Split the data
54
54
55
-
In this section, you'll use the [Split Data component](../algorithm-module-reference/split-data.md) to identify and split rows that contain "United-States" in the "native-country" column
55
+
In this section, you use the [Split Data component](../algorithm-module-reference/split-data.md) to identify and split rows that contain "United-States" in the "native-country" column
56
56
57
57
1. To the left of the canvas, in the component tab, expand the **Data Transformation** section, and find the **Split Data** component
58
58
@@ -91,7 +91,7 @@ Now that you set up your pipeline to split the data, you must specify where to p
91
91
For the **Split Data** component, the output port order is important. The first output port contains the rows where the regular expression is true. In this case, the first port contains rows for US-based income, and the second port contains rows for non-US based income
92
92
93
93
1. In the component details pane to the right of the canvas, set the following options:
94
-
94
+
95
95
**Datastore type**: Azure Blob Storage
96
96
97
97
**Datastore**: Select an existing datastore, or select "New datastore" to create a new one
@@ -101,9 +101,9 @@ Now that you set up your pipeline to split the data, you must specify where to p
101
101
**File format**: csv
102
102
103
103
> [!NOTE]
104
-
> This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. Visit [Connect to Azure storage services](how-to-connect-data-ui.md#create-datastores) for datastore setup instructions
104
+
> This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. Visit [Connect to Azure storage services](how-to-connect-data-ui.md#create-datastores) for datastore setup instructions.
105
105
106
-
You can create a datastore if you don't have one now. For example purposes, this article saves the datasets to the default blob storage account associated with the workspace. It saves the datasets into the `azureml` container, in a new folder named `data`
106
+
You can create a datastore if you don't have one. For example purposes, this article saves the datasets to the default blob storage account associated with the workspace. It saves the datasets into the `azureml` container, in a new folder named `data`.
107
107
108
108
1. Select the **Export Data** component connected to the *right*-most port of the **Split Data** component, to open the Export Data configuration pane
0 commit comments