Short answer: No, not fundamentally. The timing mechanics and what gets measured are preserved.
The core measurement sequence in main.py _train() is completely unchanged:
stats.start_loading() # timer starts
for batch in loader.next(): # blocks until worker delivers batch
stats.batch_loaded() # "load time" = elapsed since start_loading
stats.start_compute()
framework.compute(batch, computation_time) # sleep() simulating GPU
stats.batch_processed() # "compute time" = elapsed since start_compute
comm.barrier() # allreduce simulation
stats.start_loading() # timer resets for next stepNone of this was touched. The sleep() function, model(), and compute() are also unchanged.
The concern about whether data can be fetched during the GPU sleep: yes, it still happens, through PyTorch's DataLoader prefetch mechanism. The actual configuration (read_threads: 4, multiprocessing_context: spawn) means 4 independent worker processes are always fetching ahead. While the computation_time: 0.323 s sleep is running in the main process, all 4 workers are fetching S3 objects for the next batch. This is not explicitly coded in the training loop — it's a fundamental property of PyTorch's DataLoader prefetch buffer.
This behavior is identical to how it worked with local filesystem reads. The readers are just a plug-in to the DataLoader worker; whether they call open("/local/file") or s3dlio.get_many(["s3://bucket/obj"]) doesn't change the timing instrumentation.
The script dlio_s3dlio_train.sh invokes workload=unet3d_h100_s3dlio which specifies:
framework: pytorch
reader:
data_loader: pytorch
read_threads: 4
multiprocessing_context: spawn
storage:
storage_type: s3
storage_library: s3dlioThis routes through TorchDataLoader → TorchDataset.__getitem__ → reader.read_index(image_idx, step).
The critical finding is: NPZReaderS3Iterable.next() is never called in this path. The PyTorch DataLoader calls read_index() per sample, not next(). So the bulk prefetch in next() is entirely inert in this actual use case.
NPZReaderS3Iterable.read_index():
filename, _ = self.global_index_map[image_idx]
if filename not in self._object_cache:
self._object_cache.update(self._prefetch([filename])) # single S3 GET
return super().read_index(image_idx, step)- First sample from a file: one
s3dlio.get_many([uri])call → array in worker's cache - All subsequent samples from same file: in-memory lookup, no I/O
- This happens inside the DataLoader worker, inside the prefetch mechanism — same timing semantics as a local file
open()
NPZReaderS3Iterable.next() fetches all objects for the epoch in parallel before yielding any batch. This would be a timing concern because:
- The first step's "load time" would include the entire epoch's S3 I/O (potentially minutes)
- Subsequent steps would show ≈ 0 load time (data is already in memory)
- Per-step statistics would be meaningless
However, this code path is only triggered by the TFDataLoader, which calls reader.next() through TensorflowDataset._generator(). Since data_loader: pytorch is in use, this code is never reached. It would be worth a code comment to warn future users, but it is not currently a problem.
The configs/dlio/workload/unet3d_h100_s3dlio.yaml file still contains the hardcoded endpoint and personal paths that were cleaned from tests/object-store/. Specifically:
endpoint_url: http://172.16.1.40:9000source /home/eval/Documents/Code/mlp-storage/.envin the comments
This was outside the scope of the previous cleanup pass and is a separate issue from timing correctness.
| Aspect | Status |
|---|---|
| Training loop timing structure | Unchanged ✅ |
compute() / GPU sleep simulation |
Unchanged ✅ |
| Prefetch during GPU sleep | Unchanged — still works via PyTorch DataLoader ✅ |
stats.batch_loaded / batch_processed markers |
Unchanged ✅ |
| S3 I/O mechanism | Replaces disk I/O inside DataLoader workers, invisible to timer ✅ |
next() bulk prefetch |
Only fires in TF data loader path — irrelevant to PyTorch config ✅ |
| Potential timing distortion | None in PyTorch mode ✅ |
The benchmark measures the same thing as before: the time the main training process spends waiting for the DataLoader to deliver a batch, with the same definitions of "load time" and "compute time" as in the original code.