Skip to content

Commit 0be8cd4

Browse files
committed
wording
1 parent bd69ca7 commit 0be8cd4

File tree

2 files changed

+16
-12
lines changed

2 files changed

+16
-12
lines changed

articles/machine-learning/how-to-access-data.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: core
88
ms.topic: conceptual
9-
ms.author: keli19
10-
author: likebupt
9+
ms.author: sihhu
10+
author: MayMSFT
1111
ms.reviewer: nibaccam
1212
ms.date: 02/27/2020
1313
ms.custom: seodec18
@@ -25,7 +25,7 @@ You can create datastores from [these Azure Storage solutions](#matrix). For uns
2525

2626
## Prerequisites
2727
You'll need:
28-
- An Azure subscription. If you dont have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
28+
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
2929

3030
- An Azure storage account with an [Azure blob container](https://docs.microsoft.com/azure/storage/blobs/storage-blobs-overview) or [Azure file share](https://docs.microsoft.com/azure/storage/files/storage-files-introduction).
3131

articles/machine-learning/how-to-create-register-datasets.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ With Azure Machine Learning datasets, you can:
3434

3535
To create and work with datasets, you need:
3636

37-
* An Azure subscription. If you dont have one, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
37+
* An Azure subscription. If you don't have one, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree).
3838

3939
* An [Azure Machine Learning workspace](how-to-manage-workspace.md).
4040

@@ -43,6 +43,18 @@ To create and work with datasets, you need:
4343
> [!NOTE]
4444
> Some dataset classes have dependencies on the [azureml-dataprep](https://docs.microsoft.com/python/api/azureml-dataprep/?view=azure-ml-py) package. For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.
4545
46+
## Compute size guidance
47+
48+
COmpute processing power and data size
49+
The size of your data in storage is not the same as the size of data in a dataframe. For example, data in CSV files can expand up to 10x in a dataframe, so a 1 GB CSV file can become 10 GB in a dataframe.
50+
51+
The main factor is how large the dataset is in-memory, i.e. as a dataframe. We recommend your compute size and processing power contain 2x the size of RAM. So if your dataframe is 10GB, you want a compute target with 20+ GB of RAM so the dataframe can comfortable fit in memory and be processed.
52+
If your data is compressed, it can expand further; 20 GB of relatively sparse data stored in compressed parquet format can expand to ~800 GB in memory. Since Parquet stores data in a columnar format, if you only need half of the columns then you only need to load ~400 GB in memory.
53+
54+
If you're using Pandas, there's no reason to have more than 1 vCPU since that's all it will use. You can easily parallelize to many vCPUs on a single Azure Machine Learning compute instance/node via Modin and Dask/Ray, and scale out to a large cluster if needed, by simply changing `import pandas as pd` to `import modin.pandas as pd`.
55+
56+
If you can't get a big enough VM for the data, you have two options: use a framework like Spark or Dask to perform the processing on the data 'out of memory', i.e. the dataframe is loaded into RAM partition by partition and processed, with the final result being gathered at the end. If this is too slow, Spark or Dask allow you to scale out to a cluster which can still be used interactively.
57+
4658
## Dataset types
4759

4860
There are two dataset types, based on how users consume them in training:
@@ -59,14 +71,6 @@ By creating a dataset, you create a reference to the data source location, along
5971

6072
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
6173

62-
Be sure to check how large the dataset is in-memory (usually as a dataframe) - you typically want a machine with ~2x this size of RAM so you
63-
64-
If you're using Pandas, there's no reason to have more than 1 vCPU since that's all it will use. You can easily parallelize to many vCPUs on a single AMLS compute instance/node via Modin and Dask/Ray (and scale out to a large cluster if needed) by simply changing `import pandas as pd` to `import modin.pandas as pd`.
65-
66-
If you can't get a big enough VM for the data, you have two options: use a framework like Spark or Dask to perform the processing on the data 'out of memory', i.e. the dataframe is loaded into RAM partition by partition and processed, with the final result being gathered at the end. If this is too slow, Spark or Dask allow you to scale out to a cluster which can still be used interactively.
67-
68-
Note: the size of your data in storage (i.e. 1 GB CSV file) is not the same as the size of data in a dataframe. This can be computed by (rows) x (columns) x (bytes/dtype). For CSV files, the data usually expands ~2-10x in a dataframe, so the 1 GB CSV file becomes 2-10 GB. If your data is compressed, it can expand further - 20 GB of relatively sparse data stored in compressed parquet format can expand to ~800 GB in memory. Since Parquet stores data in a columnar format, if you only need 1/2 of the columns then you only need to load ~400 GB in memory.
69-
7074
### Use the SDK
7175

7276
To create datasets from an [Azure datastore](how-to-access-data.md) by using the Python SDK:

0 commit comments

Comments
 (0)