dali tfrecord pipeline sharding on multi GPU environment #12792
Replies: 1 comment
-
one working example by changing the pipeline a little bit, it looks like the epoch in dali dataloader is different from pytorch dataloader,
""" dali dataloader output: """ |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey guys, I got a self contained and working example of pytorch lightning and dali tfrecord pipeline on mutli GPU environment, and some have questions regarding to GPU sharding and training by pytorch lightning.
questions:
make_dali_dataloader
is a list ofDict[str, torch.Tensor]
and the list length matches number of shards in example 1 dataloader and example 2 dataloader. Is that expected? and is eachDict[str, torch.Tensor]
in the output list the sharding data?global_rank
to retrieve sharding data inprocess_batch
function? which one of following two process_batch methods should I use, if both are wrong, can you provide an example?or
make_dali_dataloader
matches GPU devices (1stmake_dali_dataloader
), the total training examples are about 1 epoch. but when number of shards inmake_dali_dataloader
does not match GPU devices, the total training examples can be more than 1 epoch, in my case, 1 epoch should be 1k, but 2ndmake_dali_dataloader
returns total of 2.8k from 8 GPUs, how is one epoch defined in DDP training?major dependency versions running on ubuntu 18.04
self contained script to reproduce
Thank you very much for reading my questions!
Beta Was this translation helpful? Give feedback.
All reactions