Skip to content

Commit 3353f7a

Browse files
authored
Merge pull request #101062 from MayMSFT/patch-29
Update how-to-create-register-datasets.md
2 parents 81becee + c2f8d02 commit 3353f7a

File tree

1 file changed

+22
-31
lines changed

1 file changed

+22
-31
lines changed

articles/machine-learning/how-to-create-register-datasets.md

Lines changed: 22 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -65,21 +65,10 @@ To create datasets from an [Azure datastore](how-to-access-data.md) by using the
6565

6666
1. Verify that you have `contributor` or `owner` access to the registered Azure datastore.
6767

68-
1. Create the dataset by referencing a path in the datastore:
68+
2. Create the dataset by referencing paths in the datastore.
69+
> [!Note]
70+
> You can create a dataset from multiple paths in multiple datastores. There is no hard limit on the number of files or data size that you can create a dataset from. However, for each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. We'd recommend creating dataset referencing less than 100 paths in datastores for optimal performance.
6971
70-
```Python
71-
from azureml.core.workspace import Workspace
72-
from azureml.core.datastore import Datastore
73-
from azureml.core.dataset import Dataset
74-
75-
datastore_name = 'your datastore name'
76-
77-
# get existing workspace
78-
workspace = Workspace.from_config()
79-
80-
# retrieve an existing datastore in the workspace by name
81-
datastore = Datastore.get(workspace, datastore_name)
82-
```
8372

8473
#### Create a TabularDataset
8574

@@ -88,12 +77,20 @@ You can create TabularDatasets through the SDK or by using Azure Machine Learnin
8877
Use the [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-delimited-files-path--validate-true--include-path-false--infer-column-types-true--set-column-types-none--separator------header-true--partition-format-none-) method on the `TabularDatasetFactory` class to read files in .csv or .tsv format, and to create an unregistered TabularDataset. If you're reading from multiple files, results will be aggregated into one tabular representation.
8978

9079
```Python
91-
# create a TabularDataset from multiple paths in datastore
92-
datastore_paths = [
93-
(datastore, 'weather/2018/11.csv'),
94-
(datastore, 'weather/2018/12.csv'),
95-
(datastore, 'weather/2019/*.csv')
96-
]
80+
from azureml.core import Workspace, Datastore, Dataset
81+
82+
datastore_name = 'your datastore name'
83+
84+
# get existing workspace
85+
workspace = Workspace.from_config()
86+
87+
# retrieve an existing datastore in the workspace by name
88+
datastore = Datastore.get(workspace, datastore_name)
89+
90+
# create a TabularDataset from 3 paths in datastore
91+
datastore_paths = [(datastore, 'ather/2018/11.csv'),
92+
(datastore, 'weather/2018/12.csv'),
93+
(datastore, 'weather/2019/*.csv')]
9794
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)
9895
```
9996

@@ -154,16 +151,12 @@ Use the [`from_files()`](https://docs.microsoft.com/python/api/azureml-core/azur
154151

155152
```Python
156153
# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
157-
datastore_paths = [
158-
(datastore, 'animals')
159-
]
154+
datastore_paths = [(datastore, 'animals')]
160155
animal_ds = Dataset.File.from_files(path=datastore_paths)
161156

162157
# create a FileDataset from image and label files behind public web urls
163-
web_paths = [
164-
'https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
165-
'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz'
166-
]
158+
web_paths = ['https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
159+
'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz']
167160
mnist_ds = Dataset.File.from_files(path=web_paths)
168161
```
169162

@@ -246,10 +239,8 @@ The dataset is now available in your workspace under **Datasets**. You can use i
246239
You can register a new dataset under the same name by creating a new version. A dataset version is a way to bookmark the state of your data so that you can apply a specific version of the dataset for experimentation or future reproduction. Learn more about [dataset versions](how-to-version-track-datasets.md).
247240
```Python
248241
# create a TabularDataset from Titanic training data
249-
web_paths = [
250-
'https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
251-
'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv'
252-
]
242+
web_paths = ['https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
243+
'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv']
253244
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)
254245

255246
# create a new version of titanic_ds

0 commit comments

Comments
 (0)