Skip to content

Commit defe5ba

Browse files
committed
work with data section
1 parent 8020dcf commit defe5ba

File tree

1 file changed

+24
-22
lines changed

1 file changed

+24
-22
lines changed

articles/machine-learning/how-to-train-with-datasets.md

Lines changed: 24 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ ms.date: 04/20/2020
1919
# Train with datasets in Azure Machine Learning
2020
[!INCLUDE [applies-to-skus](../../includes/aml-applies-to-basic-enterprise-sku.md)]
2121

22-
In this article, you learn how to consume [Azure Machine Learning datasets](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset%28class%29?view=azure-ml-py) in your training experiments. You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
22+
In this article, you learn how to work with [Azure Machine Learning datasets](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset%28class%29?view=azure-ml-py) in your training experiments. You can use datasets in your local or remote compute target without worrying about connection strings or data paths.
2323

2424
Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training products like [ScriptRun](https://docs.microsoft.com/python/api/azureml-core/azureml.core.scriptrun?view=azure-ml-py), [Estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.estimator?view=azure-ml-py), [HyperDrive](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.hyperdrive?view=azure-ml-py) and [Azure Machine Learning pipelines](how-to-create-your-first-pipeline.md).
2525

@@ -36,6 +36,27 @@ To create and train with datasets, you need:
3636
> [!Note]
3737
> Some Dataset classes have dependencies on the [azureml-dataprep](https://docs.microsoft.com/python/api/azureml-dataprep/?view=azure-ml-py) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
3838
39+
## Work with datasets
40+
41+
You can access existing datasets across experiments within your workspace, and load them into a pandas dataframe for further exploration on your local environment.
42+
43+
The following code uses the [`get_context()`]() method in the [`Run`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py) class to access the existing input TabularDataset, `titanic`, in the training script. Then uses the [`to_pandas_dataframe()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation prior to training.
44+
45+
```Python
46+
%%writefile $script_folder/train_titanic.py
47+
48+
from azureml.core import Dataset, Run
49+
50+
run = Run.get_context()
51+
# get the input dataset by name
52+
dataset = run.input_datasets['titanic']
53+
54+
# load the TabularDataset to pandas DataFrame
55+
df = dataset.to_pandas_dataframe()
56+
```
57+
58+
If you need to load the prepared data into a new dataset from an in memory pandas dataframe, write the data to a local file, like a parquet, and create a new dataset from that file.
59+
3960
## Use datasets directly in training scripts
4061

4162
If you have structured data, create a TabularDataset and use it directly in your training script for your local or remote experiment.
@@ -44,8 +65,6 @@ In this example, you create a [TabularDataset](https://docs.microsoft.com/python
4465

4566
### Create a TabularDataset
4667

47-
48-
4968
The following code creates an unregistered TabularDataset from a web url. You can also create datasets from local files or paths in datastores. Learn more about [how to create datasets](https://aka.ms/azureml/howto/createdatasets).
5069

5170
```Python
@@ -54,7 +73,8 @@ from azureml.core.dataset import Dataset
5473
web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
5574
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)
5675
```
57-
TabularDataset objects provide the ability to load the data into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook. To leverage this capability, see [how to access input datasets](#access-input-datasets).
76+
77+
TabularDataset objects provide the ability to load the data into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook. To leverage this capability, see [work with datasets](#work-with-datasets).
5878

5979
### Configure the estimator
6080

@@ -81,24 +101,6 @@ experiment_run = experiment.submit(est)
81101
experiment_run.wait_for_completion(show_output=True)
82102
```
83103

84-
### Access input dataset
85-
86-
You can access and explore existing datasets across experiments within your workspace.
87-
88-
The following code uses the [`get_context()`]() method in the [`Run`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py) class to access the input TabularDataset, `titanic`, in the training script. Then uses the [`to_pandas_dataframe()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) method to load that dataset into a pandas dataframe for further data exploration and preparation.
89-
90-
```Python
91-
%%writefile $script_folder/train_titanic.py
92-
93-
from azureml.core import Dataset, Run
94-
95-
run = Run.get_context()
96-
# get the input dataset by name
97-
dataset = run.input_datasets['titanic']
98-
99-
# load the TabularDataset to pandas DataFrame
100-
df = dataset.to_pandas_dataframe()
101-
```
102104
## Mount files to remote compute targets
103105

104106
If you have unstructured data, create a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) and either mount or download your data files to make them available to your remote compute target for training. Learn about when to use [mount vs. download](#mount-vs.-download) for your remote training experiments.

0 commit comments

Comments
 (0)