Skip to content

Commit bd69ca7

Browse files
committed
add compute size guidance
1 parent 42afda6 commit bd69ca7

File tree

1 file changed

+10
-3
lines changed

1 file changed

+10
-3
lines changed

articles/machine-learning/how-to-create-register-datasets.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,17 @@ To learn more about upcoming API changes, see [Dataset API change notice](https:
5555

5656
## Create datasets
5757

58-
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or workspace landing page (preview).
58+
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or at https://ml.azure.com.
5959

60-
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
60+
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
61+
62+
Be sure to check how large the dataset is in-memory (usually as a dataframe) - you typically want a machine with ~2x this size of RAM so you
63+
64+
If you're using Pandas, there's no reason to have more than 1 vCPU since that's all it will use. You can easily parallelize to many vCPUs on a single AMLS compute instance/node via Modin and Dask/Ray (and scale out to a large cluster if needed) by simply changing `import pandas as pd` to `import modin.pandas as pd`.
65+
66+
If you can't get a big enough VM for the data, you have two options: use a framework like Spark or Dask to perform the processing on the data 'out of memory', i.e. the dataframe is loaded into RAM partition by partition and processed, with the final result being gathered at the end. If this is too slow, Spark or Dask allow you to scale out to a cluster which can still be used interactively.
67+
68+
Note: the size of your data in storage (i.e. 1 GB CSV file) is not the same as the size of data in a dataframe. This can be computed by (rows) x (columns) x (bytes/dtype). For CSV files, the data usually expands ~2-10x in a dataframe, so the 1 GB CSV file becomes 2-10 GB. If your data is compressed, it can expand further - 20 GB of relatively sparse data stored in compressed parquet format can expand to ~800 GB in memory. Since Parquet stores data in a columnar format, if you only need 1/2 of the columns then you only need to load ~400 GB in memory.
6169

6270
### Use the SDK
6371

@@ -69,7 +77,6 @@ To create datasets from an [Azure datastore](how-to-access-data.md) by using the
6977
> [!Note]
7078
> You can create a dataset from multiple paths in multiple datastores. There is no hard limit on the number of files or data size that you can create a dataset from. However, for each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. We'd recommend creating dataset referencing less than 100 paths in datastores for optimal performance.
7179
72-
7380
#### Create a TabularDataset
7481

7582
You can create TabularDatasets through the SDK or by using Azure Machine Learning studio.

0 commit comments

Comments
 (0)