Skip to content

Commit 93700b0

Browse files
authored
Merge pull request #106274 from nibaccam/patch-6
Data | add code for creating dataset from dataframe
2 parents 52f858c + f237078 commit 93700b0

File tree

1 file changed

+30
-6
lines changed

1 file changed

+30
-6
lines changed

articles/machine-learning/how-to-create-register-datasets.md

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ To learn more about upcoming API changes, see [Dataset API change notice](https:
5555

5656
## Create datasets
5757

58-
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or workspace landing page (preview).
58+
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or https://ml.azure.com.
5959

6060
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
6161

@@ -72,8 +72,6 @@ To create datasets from an [Azure datastore](how-to-access-data.md) by using the
7272

7373
#### Create a TabularDataset
7474

75-
You can create TabularDatasets through the SDK or by using Azure Machine Learning studio.
76-
7775
Use the [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-delimited-files-path--validate-true--include-path-false--infer-column-types-true--set-column-types-none--separator------header-true--partition-format-none-) method on the `TabularDatasetFactory` class to read files in .csv or .tsv format, and to create an unregistered TabularDataset. If you're reading from multiple files, results will be aggregated into one tabular representation.
7876

7977
```Python
@@ -94,10 +92,10 @@ datastore_paths = [(datastore, 'ather/2018/11.csv'),
9492
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)
9593
```
9694

97-
By default, when you create a TabularDataset, column data types are inferred automatically. If the inferred types don't match your expectations, you can specify column types by using the following code. If your storage is behind a virtual network or firewall, include the parameters `validate=False` and `infer_column_types=False` in your `from_delimited_files()` method. This bypasses the initial validation check and ensures that you can create your dataset from these secure files. You can also [learn more about supported data types](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.datatype?view=azure-ml-py).
95+
By default, when you create a TabularDataset, column data types are inferred automatically. If the inferred types don't match your expectations, you can specify column types by using the following code. The parameter `infer_column_type` is only applicable for datasets created from delimited files.You can also [learn more about supported data types](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.datatype?view=azure-ml-py).
9896

99-
> [!NOTE]
100-
>The parameter `infer_column_type` is only applicable for datasets created from delimited files.
97+
> [!IMPORTANT]
98+
> If your storage is behind a virtual network or firewall, only creation of a dataset via the SDK is supported. To create your dataset, be sure to include the parameters `validate=False` and `infer_column_types=False` in your `from_delimited_files()` method. This bypasses the initial validation check and ensures that you can create your dataset from these secure files.
10199
102100
```Python
103101
from azureml.data.dataset_factory import DataType
@@ -116,6 +114,32 @@ titanic_ds.take(3).to_pandas_dataframe()
116114
1|2|True|1|Cumings, Mrs. John Bradley (Florence Briggs Th...|female|38.0|1|0|PC 17599|71.2833|C85|C
117115
2|3|True|3|Heikkinen, Miss. Laina|female|26.0|0|0|STON/O2. 3101282|7.9250||S
118116

117+
118+
To create a dataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. The following code demonstrates this workflow.
119+
120+
```python
121+
local_path = 'data/prepared.csv'
122+
dataframe.to_csv(local_path)
123+
upload the local file to a datastore on the cloud
124+
# azureml-core of version 1.0.72 or higher is required
125+
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
126+
from azureml.core import Workspace, Dataset
127+
128+
subscription_id = 'xxxxxxxxxxxxxxxxxxxxx'
129+
resource_group = 'xxxxxx'
130+
workspace_name = 'xxxxxxxxxxxxxxxx'
131+
132+
workspace = Workspace(subscription_id, resource_group, workspace_name)
133+
134+
# get the datastore to upload prepared data
135+
datastore = workspace.get_default_datastore()
136+
137+
# upload the local file from src_dir to the target_path in datastore
138+
datastore.upload(src_dir='data', target_path='data')
139+
create a dataset referencing the cloud location
140+
dataset = Dataset.Tabular.from_delimited_files(datastore.path('data/prepared.csv'))
141+
```
142+
119143
Use the [`from_sql_query()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-sql-query-query--validate-true--set-column-types-none-) method on the `TabularDatasetFactory` class to read from Azure SQL Database:
120144

121145
```Python

0 commit comments

Comments
 (0)