You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-access-data.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,8 +6,8 @@ services: machine-learning
6
6
ms.service: machine-learning
7
7
ms.subservice: core
8
8
ms.topic: conceptual
9
-
ms.author: keli19
10
-
author: likebupt
9
+
ms.author: sihhu
10
+
author: MayMSFT
11
11
ms.reviewer: nibaccam
12
12
ms.date: 02/27/2020
13
13
ms.custom: seodec18
@@ -25,7 +25,7 @@ You can create datastores from [these Azure Storage solutions](#matrix). For uns
25
25
26
26
## Prerequisites
27
27
You'll need:
28
-
- An Azure subscription. If you don’t have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
28
+
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
29
29
30
30
- An Azure storage account with an [Azure blob container](https://docs.microsoft.com/azure/storage/blobs/storage-blobs-overview) or [Azure file share](https://docs.microsoft.com/azure/storage/files/storage-files-introduction).
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-create-register-datasets.md
+16-5Lines changed: 16 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,10 +31,10 @@ With Azure Machine Learning datasets, you can:
31
31
* Share data and collaborate with other users.
32
32
33
33
## Prerequisites
34
-
34
+
'
35
35
To create and work with datasets, you need:
36
36
37
-
* An Azure subscription. If you don’t have one, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
37
+
* An Azure subscription. If you don't have one, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
38
38
39
39
* An [Azure Machine Learning workspace](how-to-manage-workspace.md).
40
40
@@ -43,6 +43,18 @@ To create and work with datasets, you need:
43
43
> [!NOTE]
44
44
> Some dataset classes have dependencies on the [azureml-dataprep](https://docs.microsoft.com/python/api/azureml-dataprep/?view=azure-ml-py) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
45
45
46
+
## Compute size guidance
47
+
48
+
When creating a dataset review your compute processing power and the size of your data in memory.
49
+
The size of your data in storage is not the same as the size of data in a dataframe. For example, data in CSV files can expand up to 10x in a dataframe, so a 1 GB CSV file can become 10 GB in a dataframe.
50
+
51
+
The main factor is how large the dataset is in-memory, i.e. as a dataframe. We recommend your compute size and processing power contain 2x the size of RAM. So if your dataframe is 10GB, you want a compute target with 20+ GB of RAM to ensure that the dataframe can comfortable fit in memory and be processed.
52
+
If your data is compressed, it can expand further; 20 GB of relatively sparse data stored in compressed parquet format can expand to ~800 GB in memory. Since Parquet files store data in a columnar format, if you only need half of the columns, then you only need to load ~400 GB in memory.
53
+
54
+
If you're using Pandas, there's no reason to have more than 1 vCPU since that's all it will use. You can easily parallelize to many vCPUs on a single Azure Machine Learning compute instance/node via Modin and Dask/Ray, and scale out to a large cluster if needed, by simply changing `import pandas as pd` to `import modin.pandas as pd`.
55
+
56
+
If you can't get a big enough virtual for the data, you have two options: use a framework like Spark or Dask to perform the processing on the data 'out of memory', i.e. the dataframe is loaded into RAM partition by partition and processed, with the final result being gathered at the end. If this is too slow, Spark or Dask allow you to scale out to a cluster which can still be used interactively.
57
+
46
58
## Dataset types
47
59
48
60
There are two dataset types, based on how users consume them in training:
@@ -55,9 +67,9 @@ To learn more about upcoming API changes, see [Dataset API change notice](https:
55
67
56
68
## Create datasets
57
69
58
-
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or https://ml.azure.com.
70
+
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or at https://ml.azure.com.
59
71
60
-
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
72
+
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
61
73
62
74
### Use the SDK
63
75
@@ -69,7 +81,6 @@ To create datasets from an [Azure datastore](how-to-access-data.md) by using the
69
81
> [!Note]
70
82
> You can create a dataset from multiple paths in multiple datastores. There is no hard limit on the number of files or data size that you can create a dataset from. However, for each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. We'd recommend creating dataset referencing less than 100 paths in datastores for optimal performance.
71
83
72
-
73
84
#### Create a TabularDataset
74
85
75
86
Use the [`from_delimited_files()`](https://docs.microsoft.com/python/api/azureml-core/azureml.data.dataset_factory.tabulardatasetfactory?view=azure-ml-py#from-delimited-files-path--validate-true--include-path-false--infer-column-types-true--set-column-types-none--separator------header-true--partition-format-none-) method on the `TabularDatasetFactory` class to read files in .csv or .tsv format, and to create an unregistered TabularDataset. If you're reading from multiple files, results will be aggregated into one tabular representation.
0 commit comments