Skip to content

Commit 35288f8

Browse files
committed
edit
1 parent f9cba5a commit 35288f8

File tree

1 file changed

+1
-3
lines changed

1 file changed

+1
-3
lines changed

articles/machine-learning/how-to-read-write-data-v2.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -968,9 +968,7 @@ Files are read in *blocks* of 1-4 MB in size. Files smaller than a block are rea
968968

969969
For small files, the latency interval mostly involves handling the requests to storage, instead of data transfers. Therefore, we offer these recommendations to increase the file size:
970970

971-
- For unstructured data (images, video, etc.), archive (zip/tar) small files together, to store them as a larger file that can be read in multiple chunks. These larger archived files can be opened in the compute resource, and [PyTorch Archive DataPipes](https://meta-pytorch.org/data/0.9/dp_tutorial.html)
972-
973-
can extract the smaller files.
971+
- For unstructured data (images, video, etc.), archive (zip/tar) small files together, to store them as a larger file that can be read in multiple chunks. These larger archived files can be opened in the compute resource, and [PyTorch Archive DataPipes](https://meta-pytorch.org/data/0.9/dp_tutorial.html) can extract the smaller files.
974972
- For structured data (CSV, parquet, etc.), examine your ETL process, to make sure that it coalesces files to increase size. Spark has `repartition()` and `coalesce()` methods to help increase file sizes.
975973

976974
If you can't increase your file sizes, explore your [Azure Storage options](#azure-storage-options).

0 commit comments

Comments
 (0)