You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-create-register-datasets.md
+22-31Lines changed: 22 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,21 +65,10 @@ To create datasets from an [Azure datastore](how-to-access-data.md) by using the
65
65
66
66
1. Verify that you have `contributor` or `owner` access to the registered Azure datastore.
67
67
68
-
1. Create the dataset by referencing a path in the datastore:
68
+
2. Create the dataset by referencing paths in the datastore.
69
+
> [!Note]
70
+
> You can create a dataset from multiple paths in multiple datastores. There is no hard limit on the number of files or data size that you can create a dataset from. However, for each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. We'd recommend creating dataset referencing less than 100 paths in datastores for optimal performance.
69
71
70
-
```Python
71
-
from azureml.core.workspace import Workspace
72
-
from azureml.core.datastore import Datastore
73
-
from azureml.core.dataset import Dataset
74
-
75
-
datastore_name ='your datastore name'
76
-
77
-
# get existing workspace
78
-
workspace = Workspace.from_config()
79
-
80
-
# retrieve an existing datastore in the workspace by name
@@ -88,12 +77,20 @@ You can create TabularDatasets through the SDK or by using Azure Machine Learnin
88
77
Use the [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-delimited-files-path--validate-true--include-path-false--infer-column-types-true--set-column-types-none--separator------header-true--partition-format-none-) method on the `TabularDatasetFactory` class to read files in .csv or .tsv format, and to create an unregistered TabularDataset. If you're reading from multiple files, results will be aggregated into one tabular representation.
89
78
90
79
```Python
91
-
# create a TabularDataset from multiple paths in datastore
92
-
datastore_paths= [
93
-
(datastore, 'weather/2018/11.csv'),
94
-
(datastore, 'weather/2018/12.csv'),
95
-
(datastore, 'weather/2019/*.csv')
96
-
]
80
+
from azureml.core import Workspace, Datastore, Dataset
81
+
82
+
datastore_name ='your datastore name'
83
+
84
+
# get existing workspace
85
+
workspace = Workspace.from_config()
86
+
87
+
# retrieve an existing datastore in the workspace by name
@@ -246,10 +239,8 @@ The dataset is now available in your workspace under **Datasets**. You can use i
246
239
You can register a new dataset under the same name by creating a new version. A dataset version is a way to bookmark the state of your data so that you can apply a specific version of the dataset for experimentation or future reproduction. Learn more about [dataset versions](how-to-version-track-datasets.md).
247
240
```Python
248
241
# create a TabularDataset from Titanic training data
0 commit comments