Skip to content

Commit 3cf0be7

Browse files
committed
code examples
1 parent f629290 commit 3cf0be7

File tree

1 file changed

+27
-3
lines changed

1 file changed

+27
-3
lines changed

articles/machine-learning/how-to-create-register-datasets.md

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,8 @@ Use the [`from_files()`](/python/api/azureml-core/azureml.data.dataset_factory.f
117117
If your storage is behind a virtual network or firewall, set the parameter `validate=False` in your `from_files()` method. This bypasses the initial validation step, and ensures that you can create your dataset from these secure files. Learn more about how to [use datastores and datasets in a virtual network](how-to-secure-workspace-vnet.md#datastores-and-datasets).
118118

119119
```Python
120+
from azureml.core import Workspace, Datastore, Dataset
121+
120122
# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
121123
datastore_paths = [(datastore, 'animals')]
122124
animal_ds = Dataset.File.from_files(path=datastore_paths)
@@ -130,6 +132,14 @@ mnist_ds = Dataset.File.from_files(path=web_paths)
130132
If you want to upload all the files from a local directory, create a FileDataset in a single method with [upload_directory()](/python/api/azureml-core/azureml.data.dataset_factory.filedatasetfactory#upload-directory-src-dir--target--pattern-none--overwrite-false--show-progress-true-). This method uploads data to your underlying storage, and as a result incur storage costs.
131133

132134
```Python
135+
from azureml.core import Workspace, Datastore, Dataset
136+
from azureml.data.datapath import DataPath
137+
138+
ws = Workspace.from_config()
139+
datastore = Datastore.get(ws, '<name of your datastore>')
140+
ds = Dataset.File.upload_directory(src_dir='<path to you data>',
141+
target=DataPath(datastore, '<path on the datastore>'),
142+
show_progress=True)
133143

134144
```
135145

@@ -323,15 +333,29 @@ titanic_ds.take(3).to_pandas_dataframe()
323333
You can create and register TabularDatasets from a pandas or spark dataframe.
324334

325335
To create a TabularDataset from an in memory pandas dataframe
326-
use the [`register_pandas_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactoryy#register-pandas-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-) method. This method registers the TabularDataset to the workspace and uploads data to your underlying storage.
336+
use the [`register_pandas_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactoryy#register-pandas-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-) method. This method registers the TabularDataset to the workspace and uploads data to your underlying storage, which incurs storage costs.
327337

328338
```python
339+
from azureml.core import Workspace, Datastore, Dataset
340+
import pandas as pd
341+
342+
pandas_df = pd.read_csv('<path to your csv file>')
343+
ws = Workspace.from_config()
344+
datastore = Datastore.get(ws, '<name of your datastore>')
345+
dataset = Dataset.Tabular.register_pandas_dataframe(pandas_df, datastore, "dataset_from_pandas_df", show_progress=True)
346+
329347
```
330348

331-
You can also create a TabularDataset from a spark dataframe with the
332-
[`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#register-spark-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-) method. This method registers the TabularDataset to the workspace and uploads data to your underlying storage.
349+
You can also create a TabularDataset from a readily available spark dataframe with the
350+
[`register_spark_dataframe()`](/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory#register-spark-dataframe-dataframe--target--name--description-none--tags-none--show-progress-true-) method. This method registers the TabularDataset to the workspace and uploads data to your underlying storage, which incurs storage costs.
333351

334352
```python
353+
from azureml.core import Workspace, Datastore, Dataset
354+
355+
ws = Workspace.from_config()
356+
datastore = Datastore.get(ws, '<name of your datastore>')
357+
dataset = Dataset.Tabular.register_spark_dataframe(spark_df, datastore, "dataset_from_spark_df", show_progress=True)
358+
335359
```
336360

337361
## Register datasets

0 commit comments

Comments
 (0)