-
Dear all, Thanks a lot for proving us uproot4. I really enjoy using it. I was wondering if it's possible to read events with GPU. Currently, I'm reading 0.5 Million events like up4_events = uproot.open("./isoMuon/P4f_zzorww_isoMu.root:zzorww;1", num_workers=8) it takes like 220 ms. I also have Nvidia P100 in my system and I'm really curious how much I can reduce this 220ms reading time. Cheers, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
There's a number of things to think about; the first is that the data have to get from disk to the GPU somehow. It will usually get there via RAM, though supercomputers can go directly from network to GPUs (and they have internal networks that are faster than disk access; it's usually the other way around if the data have to go through the internet). Here's a good summary of the storage latency hierarchy: https://blog.codinghorror.com/the-infinite-space-between-words/ Whether you go from disk to RAM to GPU or directly from disk to GPU (such a thing would require special hardware), the bottleneck is going to be the disk (unless your operating system has already cached those parts of the disk in RAM for you; be aware of that in tests—I use vmtouch). A first suggestion if you want to do some computations on the GPU would be to read it using Uproot as usual and then copy the data from NumPy arrays (RAM) to CuPy arrays (GPU global memory). The data for a TBranch of a ROOT file is split into chunks called TBaskets. Combining these chunks into a single, contiguous array requires an extra data-copy, and you could save some time by making that data-copy also a copy from RAM to GPU. If your TBranch has an uproot.AsDtype interpretation, you can provide a pre-allocated CuPy array with uproot.AsDtype.inplace and then Uproot will fill that array instead of creating a new one if you pass the AsDtypeInPlace as a new interpretation to uproot.TBranch.array. If the total time is dominated by the disk read, this won't be a noticeable improvement, but it's a possible help if all other stars align. On top of that, your data are probably compressed, and the decompression stage has to happen in RAM. There aren't many decompression algorithms written for GPUs, and I suspect it's because compression/decompression requires sequential access to deal with variable-length structures, which doesn't fit the GPU's model of parallelism very well. If your data are compressed with the slowest algorithm, LZMA, then that can even dominate over disk-reading. If somebody someday manages to write LZMA decompressors on GPUs, that would be a breakthrough. It's hard, though. I just did another search for such things, and the closest I found were "LZMA-like compression ratios," meaning not the specific LZMA format, but another one. That doesn't help if your data are in a format whose only decompressors are CPU-bound. It still looks like decompression is CPU-bound for the foreseeable future. On top of that, if your data are not simple types, but have to be interpreted uproot.AsObjects, then it's even worse: they're being interpreted with Python code. There's a project in development to replace that Python code with considerably faster, though still CPU-bound, Forth code (https://arxiv.org/abs/2102.13516). We should see factors of 100× speedup for those cases, though the algorithms are still (necessarily!) very GPU-unfriendly. To be clear, this does not apply to AsDtype, AsJagged, AsStridedObjects, etc. Once the data are on GPUs, you'll need tools to perform calculations on them there. CuPy has a good set of tools (which can be used with Numba), but only if your data are non-jagged uproot.AsDtype. Awkward Array backends for GPUs are in development, but not ready yet. If your data are strictly tabular, then that's not an issue for you. So, to summarize, uproot.AsDtype.inplace with a CuPy array is one thing you can do, and it can make a difference if your data are not compressed and they are in your operating system's cache or some very fast disk (e.g. NVMe). But if you have a more typical situation, the disk and/or decompression are bottlenecks that the data have to pass through before they can even get to the GPU. |
Beta Was this translation helpful? Give feedback.
-
Hi @jpivarski , thank you very much for your detailed and comprehensive explanations! Indeed, as a HEP fellow, I have Cheers, |
Beta Was this translation helpful? Give feedback.
There's a number of things to think about; the first is that the data have to get from disk to the GPU somehow. It will usually get there via RAM, though supercomputers can go directly from network to GPUs (and they have internal networks that are faster than disk access; it's usually the other way around if the data have to go through the internet).
Here's a good summary of the storage latency hierarchy: https://blog.codinghorror.com/the-infinite-space-between-words/
Whether you go from disk to RAM to GPU or directly from disk to GPU (such a thing would require special hardware), the bottleneck is going to be the disk (unless your operating system has already cached those parts of the di…