Skip to content

Commit 6857584

Browse files
authored
Merge pull request #88856 from nibaccam/dataset-def
Data | Datasets-timeseries trait
2 parents 7c5b1cd + 3eebbd8 commit 6857584

File tree

1 file changed

+12
-10
lines changed

1 file changed

+12
-10
lines changed

articles/machine-learning/service/how-to-create-register-datasets.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,10 @@ To create and work with datasets, you need:
4141
4242
## Dataset Types
4343

44-
Datasets are categorized into various types based on how users consume them in training. List of Dataset types:
45-
* [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas DataFrame. A `TabularDataset` object can be created from csv, tsv, parquet files, SQL query results etc. For a complete list, please visit our [documentation](https://aka.ms/tabulardataset-api-reference). A timestamp can be specified from a column in the data or the path pattern data is stored in to enable a timeseries trait, which allows for easy and efficient filtering by time.
44+
Datasets are categorized into two types based on how users consume them in training.
45+
46+
* [TabularDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas DataFrame. A `TabularDataset` object can be created from csv, tsv, parquet files, SQL query results etc. For a complete list, please visit our [documentation](https://aka.ms/tabulardataset-api-reference).
47+
4648
* [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public urls. This provides you with the ability to download or mount the files to your compute. The files can be of any format, which enables a wider range of machine learning scenarios including deep learning.
4749

4850
To find out more about upcoming API changes, see [here](https://aka.ms/tabular-dataset).
@@ -75,11 +77,11 @@ datastore = Datastore.get(workspace, datastore_name)
7577

7678
### Create TabularDatasets
7779

78-
TabularDatasets can be created via the SDK or by using the workspace landing page (preview).
80+
TabularDatasets can be created via the SDK or by using the workspace landing page (preview). A timestamp can be specified from a column in the data or the path pattern data is stored in to enable a timeseries trait, which allows for easy and efficient filtering by time.
7981

80-
#### SDK
82+
#### Using the SDK
8183

82-
Use the `from_delimited_files()` method on `TabularDatasetFactory` class to read files in csv or tsv format, and create an unregistered TabularDataset. If you are reading from multiple files, results will be aggregated into one tabular representation.
84+
Use the [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-delimited-files-path--validate-true--include-path-false--infer-column-types-true--set-column-types-none--separator------header--promoteheadersbehavior-all-files-have-same-headers--3---partition-format-none-) method on `TabularDatasetFactory` class to read files in csv or tsv format, and create an unregistered TabularDataset. If you are reading from multiple files, results will be aggregated into one tabular representation.
8385

8486
```Python
8587
# create a TabularDataset from multiple paths in datastore
@@ -104,8 +106,7 @@ titanic_ds.take(3).to_pandas_dataframe()
104106
1|2|1|1|Cumings, Mrs. John Bradley (Florence Briggs Th...|female|38.0|1|0|PC 17599|71.2833|C85|C
105107
2|3|1|3|Heikkinen, Miss. Laina|female|26.0|0|0|STON/O2. 3101282|7.9250||S
106108

107-
108-
Use the `with_timestamp_columns()` method on `TabularDataset` class to enable easy and efficient filtering by time. More examples and details can be found [here](http://aka.ms/azureml-tsd-notebook).
109+
Use the [`with_timestamp_columns()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py#with-timestamp-columns-fine-grain-timestamp--coarse-grain-timestamp-none--validate-false-) method on `TabularDataset` class to enable easy and efficient filtering by time. More examples and details can be found [here](http://aka.ms/azureml-tsd-notebook).
109110

110111
```Python
111112
# create a TabularDataset with timeseries trait
@@ -124,7 +125,7 @@ data_slice = dataset.time_between(datetime(2019, 1, 1), datetime(2019, 2, 1))
124125
data_slice = dataset.time_recent(timedelta(weeks=1, days=1))
125126
```
126127

127-
#### Workspace landing page
128+
#### Using the workspace landing page
128129

129130
Sign in to the [workspace landing page](https://ml.azure.com) to create a dataset via the web experience. Currently, the workspace landing page only supports the creation of TabularDatasets.
130131

@@ -136,7 +137,7 @@ First, select **Datasets** in the **Assets** section of the left pane. Then, se
136137

137138
### Create FileDatasets
138139

139-
Use the `from_files()` method on `FileDatasetFactory` class to load files in any format, and create an unregistered FileDataset.
140+
Use the [`from_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory?view=azure-ml-py#from-files-path--validate-true-) method on `FileDatasetFactory` class to load files in any format, and create an unregistered FileDataset.
140141

141142
```Python
142143
# create a FileDataset from multiple paths in datastore
@@ -154,11 +155,12 @@ web_paths = [
154155
]
155156
mnist_ds = Dataset.File.from_files(path=web_paths)
156157
```
158+
157159
## Register datasets
158160

159161
To complete the creation process, register your datasets with workspace:
160162

161-
Use the `register()` method to register datasets to your workspace so they can be shared with others and reused across various experiments.
163+
Use the [`register()`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#register-workspace--name--description-none--tags-none--visible-true--exist-ok-false--update-if-exist-false-) method to register datasets to your workspace so they can be shared with others and reused across various experiments.
162164

163165
```Python
164166
titanic_ds = titanic_ds.register(workspace = workspace,

0 commit comments

Comments
 (0)