Skip to content

Latest commit

 

History

History
107 lines (72 loc) · 4.88 KB

File metadata and controls

107 lines (72 loc) · 4.88 KB

Analysis: Does Object Storage Change the Timing Loop?

Short answer: No, not fundamentally. The timing mechanics and what gets measured are preserved.


1. The Training Loop Is Unmodified

The core measurement sequence in main.py _train() is completely unchanged:

stats.start_loading()          # timer starts
for batch in loader.next():    # blocks until worker delivers batch
    stats.batch_loaded()       # "load time" = elapsed since start_loading
    stats.start_compute()
    framework.compute(batch, computation_time)  # sleep() simulating GPU
    stats.batch_processed()    # "compute time" = elapsed since start_compute
    comm.barrier()             # allreduce simulation
    stats.start_loading()      # timer resets for next step

None of this was touched. The sleep() function, model(), and compute() are also unchanged.


2. "Fetch While Sleeping" Still Works — Unchanged

The concern about whether data can be fetched during the GPU sleep: yes, it still happens, through PyTorch's DataLoader prefetch mechanism. The actual configuration (read_threads: 4, multiprocessing_context: spawn) means 4 independent worker processes are always fetching ahead. While the computation_time: 0.323 s sleep is running in the main process, all 4 workers are fetching S3 objects for the next batch. This is not explicitly coded in the training loop — it's a fundamental property of PyTorch's DataLoader prefetch buffer.

This behavior is identical to how it worked with local filesystem reads. The readers are just a plug-in to the DataLoader worker; whether they call open("/local/file") or s3dlio.get_many(["s3://bucket/obj"]) doesn't change the timing instrumentation.


3. Which Code Path Is Actually Used (Your Config)

The script dlio_s3dlio_train.sh invokes workload=unet3d_h100_s3dlio which specifies:

framework: pytorch
reader:
  data_loader: pytorch
  read_threads: 4
  multiprocessing_context: spawn
storage:
  storage_type: s3
  storage_library: s3dlio

This routes through TorchDataLoaderTorchDataset.__getitem__reader.read_index(image_idx, step).

The critical finding is: NPZReaderS3Iterable.next() is never called in this path. The PyTorch DataLoader calls read_index() per sample, not next(). So the bulk prefetch in next() is entirely inert in this actual use case.


4. What the S3 Iterable Readers Actually Do in PyTorch Mode

NPZReaderS3Iterable.read_index():

filename, _ = self.global_index_map[image_idx]
if filename not in self._object_cache:
    self._object_cache.update(self._prefetch([filename]))  # single S3 GET
return super().read_index(image_idx, step)
  • First sample from a file: one s3dlio.get_many([uri]) call → array in worker's cache
  • All subsequent samples from same file: in-memory lookup, no I/O
  • This happens inside the DataLoader worker, inside the prefetch mechanism — same timing semantics as a local file open()

5. One Potential Concern: The next() Bulk Prefetch (TF Path Only)

NPZReaderS3Iterable.next() fetches all objects for the epoch in parallel before yielding any batch. This would be a timing concern because:

  • The first step's "load time" would include the entire epoch's S3 I/O (potentially minutes)
  • Subsequent steps would show ≈ 0 load time (data is already in memory)
  • Per-step statistics would be meaningless

However, this code path is only triggered by the TFDataLoader, which calls reader.next() through TensorflowDataset._generator(). Since data_loader: pytorch is in use, this code is never reached. It would be worth a code comment to warn future users, but it is not currently a problem.


6. One Actual Issue Found (Not Timing-Related)

The configs/dlio/workload/unet3d_h100_s3dlio.yaml file still contains the hardcoded endpoint and personal paths that were cleaned from tests/object-store/. Specifically:

  • endpoint_url: http://172.16.1.40:9000
  • source /home/eval/Documents/Code/mlp-storage/.env in the comments

This was outside the scope of the previous cleanup pass and is a separate issue from timing correctness.


Summary

Aspect Status
Training loop timing structure Unchanged ✅
compute() / GPU sleep simulation Unchanged ✅
Prefetch during GPU sleep Unchanged — still works via PyTorch DataLoader ✅
stats.batch_loaded / batch_processed markers Unchanged ✅
S3 I/O mechanism Replaces disk I/O inside DataLoader workers, invisible to timer ✅
next() bulk prefetch Only fires in TF data loader path — irrelevant to PyTorch config ✅
Potential timing distortion None in PyTorch mode ✅

The benchmark measures the same thing as before: the time the main training process spends waiting for the DataLoader to deliver a batch, with the same definitions of "load time" and "compute time" as in the original code.