Skip to content

Commit 7977763

Browse files
authored
Merge pull request #86872 from MayMSFT/patch-16
Update how-to-create-register-datasets.md
2 parents be1f50a + 9736f10 commit 7977763

File tree

1 file changed

+23
-2
lines changed

1 file changed

+23
-2
lines changed

articles/machine-learning/service/how-to-create-register-datasets.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,11 @@ To create and work with datasets, you need:
4141
4242
## Dataset Types
4343

44-
Datasets are categorized into various types based on how users consume them in training. Currently we support [TabularDatasets](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) which represent data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas DataFrame. A `TabularDataset` object can be created from csv, tsv, parquet files, SQL query results etc. For a complete list, please visit our documentation.
44+
Datasets are categorized into various types based on how users consume them in training. List of Dataset types:
45+
* [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas DataFrame. A `TabularDataset` object can be created from csv, tsv, parquet files, SQL query results etc. For a complete list, please visit our [documentation](https://aka.ms/tabulardataset-api-reference).
46+
* FileDataset references single or multiple files in your datastores or public urls. This provides you with the ability to download or mount the files to your compute. The files can be of any format, which enables a wider range of machine learning scenarios including deep learning.
4547

46-
To find out more about upcoming API changes, see [What is Azure Machine Learning service?](https://aka.ms/tabular-dataset)
48+
To find out more about upcoming API changes, see [here](https://aka.ms/tabular-dataset).
4749

4850
## Create datasets
4951

@@ -97,6 +99,25 @@ titanic_ds.take(3).to_pandas_dataframe()
9799
1|2|1|1|Cumings, Mrs. John Bradley (Florence Briggs Th...|female|38.0|1|0|PC 17599|71.2833|C85|C
98100
2|3|1|3|Heikkinen, Miss. Laina|female|26.0|0|0|STON/O2. 3101282|7.9250||S
99101

102+
### Create FileDatasets
103+
Use the `from_files()` method on `FileDatasetFactory` class to load files in any format, and create an unregistered FileDataset.
104+
105+
```Python
106+
# create a FileDataset from multiple paths in datastore
107+
datastore_paths = [
108+
(datastore, 'animals/dog/1.jpg'),
109+
(datastore, 'animals/dog/2.jpg'),
110+
(datastore, 'animals/dog/*.jpg')
111+
]
112+
animal_ds = Dataset.File.from_files(path=datastore_paths)
113+
114+
# create a FileDataset from image and label files behind public web urls
115+
web_paths = [
116+
'https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
117+
'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz'
118+
]
119+
mnist_ds = Dataset.File.from_files(path=web_paths)
120+
```
100121
## Register datasets
101122

102123
To complete the creation process, register your datasets with workspace:

0 commit comments

Comments
 (0)