Skip to content

Commit b6aadd6

Browse files
authored
Merge pull request #170993 from nibaccam/upload-df-ga
Data | GA upload + dataframes
2 parents d00e125 + cca6826 commit b6aadd6

File tree

1 file changed

+27
-36
lines changed

1 file changed

+27
-36
lines changed

articles/machine-learning/how-to-create-register-datasets.md

Lines changed: 27 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ Use the [`from_files()`](/python/api/azureml-core/azureml.data.dataset_factory.f
117117
If your storage is behind a virtual network or firewall, set the parameter `validate=False` in your `from_files()` method. This bypasses the initial validation step, and ensures that you can create your dataset from these secure files. Learn more about how to [use datastores and datasets in a virtual network](how-to-secure-workspace-vnet.md#datastores-and-datasets).
118118

119119
```Python
120+
from azureml.core import Workspace, Datastore, Dataset
121+
120122
# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
121123
datastore_paths = [(datastore, 'animals')]
122124
animal_ds = Dataset.File.from_files(path=datastore_paths)
@@ -126,12 +128,22 @@ web_paths = ['https://azureopendatastorage.blob.core.windows.net/mnist/train-ima
126128
'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz']
127129
mnist_ds = Dataset.File.from_files(path=web_paths)
128130
```
129-
To reuse and share datasets across experiment in your workspace, [register your dataset](#register-datasets).
130131

131-
> [!TIP]
132-
> Upload files from a local directory and create a FileDataset in a single method with the public preview method, [upload_directory()](/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory#upload-directory-src-dir--target--pattern-none--overwrite-false--show-progress-true-). This method is an [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview feature, and may change at any time.
133-
>
134-
> This method uploads data to your underlying storage, and as a result incur storage costs.
132+
If you want to upload all the files from a local directory, create a FileDataset in a single method with [upload_directory()](/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory#upload-directory-src-dir--target--pattern-none--overwrite-false--show-progress-true-). This method uploads data to your underlying storage, and as a result incur storage costs.
133+
134+
```Python
135+
from azureml.core import Workspace, Datastore, Dataset
136+
from azureml.data.datapath import DataPath
137+
138+
ws = Workspace.from_config()
139+
datastore = Datastore.get(ws, '<name of your datastore>')
140+
ds = Dataset.File.upload_directory(src_dir='<path to you data>',
141+
target=DataPath(datastore, '<path on the datastore>'),
142+
show_progress=True)
143+
144+
```
145+
146+
To reuse and share datasets across experiment in your workspace, [register your dataset](#register-datasets).
135147

136148
### Create a TabularDataset
137149

@@ -318,36 +330,21 @@ titanic_ds.take(3).to_pandas_dataframe()
318330

319331
## Create a dataset from pandas dataframe
320332

321-
To create a TabularDataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. The following code demonstrates this workflow.
333+
To create a TabularDataset from an in memory pandas dataframe
334+
use the [`register_pandas_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactoryy#register-pandas-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-) method. This method registers the TabularDataset to the workspace and uploads data to your underlying storage, which incurs storage costs.
322335

323336
```python
324-
# azureml-core of version 1.0.72 or higher is required
325-
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
326-
327-
from azureml.core import Workspace, Dataset
328-
local_path = 'data/prepared.csv'
329-
dataframe.to_csv(local_path)
330-
331-
# upload the local file to a datastore on the cloud
332-
333-
subscription_id = 'xxxxxxxxxxxxxxxxxxxxx'
334-
resource_group = 'xxxxxx'
335-
workspace_name = 'xxxxxxxxxxxxxxxx'
336-
337-
workspace = Workspace(subscription_id, resource_group, workspace_name)
338-
339-
# get the datastore to upload prepared data
340-
datastore = workspace.get_default_datastore()
337+
from azureml.core import Workspace, Datastore, Dataset
338+
import pandas as pd
341339

342-
# upload the local file from src_dir to the target_path in datastore
343-
datastore.upload(src_dir='data', target_path='data')
340+
pandas_df = pd.read_csv('<path to your csv file>')
341+
ws = Workspace.from_config()
342+
datastore = Datastore.get(ws, '<name of your datastore>')
343+
dataset = Dataset.Tabular.register_pandas_dataframe(pandas_df, datastore, "dataset_from_pandas_df", show_progress=True)
344344

345-
# create a dataset referencing the cloud location
346-
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/prepared.csv'))])
347345
```
348-
349346
> [!TIP]
350-
> Create and register a TabularDataset from an in memory spark or pandas dataframe with a single method with public preview methods, [`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#methods) and [`register_pandas_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#methods). These register methods are [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview features, and may change at any time.
347+
> Create and register a TabularDataset from an in memory spark dataframe or a dask dataframe with the public preview methods, [`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory##register-spark-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-) and [`register_dask_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#register-dask-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-). These methods are [experimental](/python/api/overview/azure/ml/#stable-vs-experimental) preview features, and may change at any time.
351348
>
352349
> These methods upload data to your underlying storage, and as a result incur storage costs.
353350
@@ -366,13 +363,7 @@ titanic_ds = titanic_ds.register(workspace=workspace,
366363
There are many templates at [https://github.com/Azure/azure-quickstart-templates/tree/master//quickstarts/microsoft.machinelearningservices](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.machinelearningservices) that can be used to create datasets.
367364

368365
For information on using these templates, see [Use an Azure Resource Manager template to create a workspace for Azure Machine Learning](how-to-create-workspace-template.md).
369-
370-
371-
## Create datasets from Azure Open Datasets
372-
373-
[Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/) are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. Open Datasets are in the cloud on Microsoft Azure and are included in both the SDK and the studio.
374-
375-
Learn how to create [Azure Machine Learning Datasets from Azure Open Datasets](../open-datasets/how-to-create-azure-machine-learning-dataset-from-open-dataset.md).
366+
376367

377368
## Train with datasets
378369

0 commit comments

Comments
 (0)