You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-create-register-datasets.md
+10-3Lines changed: 10 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,9 +55,17 @@ To learn more about upcoming API changes, see [Dataset API change notice](https:
55
55
56
56
## Create datasets
57
57
58
-
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or workspace landing page (preview).
58
+
By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost. You can create both `TabularDataset` and `FileDataset` data sets by using the Python SDK or at https://ml.azure.com.
59
59
60
-
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
60
+
For the data to be accessible by Azure Machine Learning, datasets must be created from paths in [Azure datastores](how-to-access-data.md) or public web URLs.
61
+
62
+
Be sure to check how large the dataset is in-memory (usually as a dataframe) - you typically want a machine with ~2x this size of RAM so you
63
+
64
+
If you're using Pandas, there's no reason to have more than 1 vCPU since that's all it will use. You can easily parallelize to many vCPUs on a single AMLS compute instance/node via Modin and Dask/Ray (and scale out to a large cluster if needed) by simply changing `import pandas as pd` to `import modin.pandas as pd`.
65
+
66
+
If you can't get a big enough VM for the data, you have two options: use a framework like Spark or Dask to perform the processing on the data 'out of memory', i.e. the dataframe is loaded into RAM partition by partition and processed, with the final result being gathered at the end. If this is too slow, Spark or Dask allow you to scale out to a cluster which can still be used interactively.
67
+
68
+
Note: the size of your data in storage (i.e. 1 GB CSV file) is not the same as the size of data in a dataframe. This can be computed by (rows) x (columns) x (bytes/dtype). For CSV files, the data usually expands ~2-10x in a dataframe, so the 1 GB CSV file becomes 2-10 GB. If your data is compressed, it can expand further - 20 GB of relatively sparse data stored in compressed parquet format can expand to ~800 GB in memory. Since Parquet stores data in a columnar format, if you only need 1/2 of the columns then you only need to load ~400 GB in memory.
61
69
62
70
### Use the SDK
63
71
@@ -69,7 +77,6 @@ To create datasets from an [Azure datastore](how-to-access-data.md) by using the
69
77
> [!Note]
70
78
> You can create a dataset from multiple paths in multiple datastores. There is no hard limit on the number of files or data size that you can create a dataset from. However, for each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. This overhead may lead to degraded performance or failure. A dataset referencing one folder with 1000 files inside is considered referencing one data path. We'd recommend creating dataset referencing less than 100 paths in datastores for optimal performance.
71
79
72
-
73
80
#### Create a TabularDataset
74
81
75
82
You can create TabularDatasets through the SDK or by using Azure Machine Learning studio.
0 commit comments