Skip to content

Commit ad973ab

Browse files
authored
ML storage (#260)
this could very well grown to a full page with recommendations for IO profiling and monitoring tools but I don't think we have much ready at the moment, so we could very well take this as it is now and revisit it in the future?
1 parent c52d979 commit ad973ab

File tree

2 files changed

+18
-1
lines changed

2 files changed

+18
-1
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
JAX
2+
NPZ
23
nvitop
34
NVRTC
45
placeholders
6+
Zarr

docs/software/ml/index.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,22 @@ Most ML workloads are containerized to ensure portability, reproducibility, and
66

77
Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.
88

9-
First time users are recommended to consult the [LLM tutorials][ref-tutorials-ml] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples.
9+
First time users are recommended to consult the [LLM tutorials][ref-tutorials-ml] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples.
10+
11+
## Optimizing Data Loading for Machine Learning
12+
13+
Efficient data management is essential for scalable machine learning workloads. Before beginning ML training, ensure your data access strategy maximizes throughput and minimizes impact on shared filesystems.
14+
15+
Key recommendations:
16+
17+
- **Avoid reading many small files or random small chunks from large files.** Instead, structure your datasets so that each access reads large, contiguous blocks of data, of at least 100MB per read request.
18+
- **Use formats designed for efficient sequential access.** For computer vision, formats like [TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord) or [WebDataset](https://github.com/webdataset/webdataset) are commonly used. For other modalities, [HDF5](https://www.h5py.org/), [Zarr](https://zarr.readthedocs.io/), or even simple formats like a single NPZ file per training step can be effective — provided the data is chunked to match your reading pattern and that the chunks are large enough.
19+
- **Customize data packaging for your model.** Preprocessing and chunking data to fit the specific variables required by your ML model can improve performance, but may require creating a tailored copy of the dataset.
20+
- **Monitor your data loading performance.** The most reliable metric is the raw throughput of your dataloader (without the training model). However, be aware that this does not fully capture the impact on the shared filesystem.
21+
22+
Optimizing data access not only improves training speed but also reduces contention on shared storage resources, benefiting all users.
23+
24+
For detailed information on storage options and best practices for managing data on CSCS systems, refer to the [Storage Guide][ref-guides-storage].
1025

1126
## Running ML applications with containers (recommended)
1227

0 commit comments

Comments
 (0)