Uproot4 with GPU option #535

EnginEren · 2022-01-05T16:04:53Z

EnginEren
Jan 5, 2022

Dear all,

Thanks a lot for proving us uproot4. I really enjoy using it.

I was wondering if it's possible to read events with GPU. Currently, I'm reading 0.5 Million events like

up4_events = uproot.open("./isoMuon/P4f_zzorww_isoMu.root:zzorww;1", num_workers=8)

it takes like 220 ms. I also have Nvidia P100 in my system and I'm really curious how much I can reduce this 220ms reading time.

Cheers,
Engin

Answered by jpivarski

Jan 5, 2022

There's a number of things to think about; the first is that the data have to get from disk to the GPU somehow. It will usually get there via RAM, though supercomputers can go directly from network to GPUs (and they have internal networks that are faster than disk access; it's usually the other way around if the data have to go through the internet).

Here's a good summary of the storage latency hierarchy: https://blog.codinghorror.com/the-infinite-space-between-words/

Whether you go from disk to RAM to GPU or directly from disk to GPU (such a thing would require special hardware), the bottleneck is going to be the disk (unless your operating system has already cached those parts of the di…

View full answer

jpivarski · 2022-01-05T16:49:51Z

jpivarski
Jan 5, 2022
Maintainer

There's a number of things to think about; the first is that the data have to get from disk to the GPU somehow. It will usually get there via RAM, though supercomputers can go directly from network to GPUs (and they have internal networks that are faster than disk access; it's usually the other way around if the data have to go through the internet).

Here's a good summary of the storage latency hierarchy: https://blog.codinghorror.com/the-infinite-space-between-words/

Whether you go from disk to RAM to GPU or directly from disk to GPU (such a thing would require special hardware), the bottleneck is going to be the disk (unless your operating system has already cached those parts of the disk in RAM for you; be aware of that in tests—I use vmtouch). A first suggestion if you want to do some computations on the GPU would be to read it using Uproot as usual and then copy the data from NumPy arrays (RAM) to CuPy arrays (GPU global memory).

The data for a TBranch of a ROOT file is split into chunks called TBaskets. Combining these chunks into a single, contiguous array requires an extra data-copy, and you could save some time by making that data-copy also a copy from RAM to GPU. If your TBranch has an uproot.AsDtype interpretation, you can provide a pre-allocated CuPy array with uproot.AsDtype.inplace and then Uproot will fill that array instead of creating a new one if you pass the AsDtypeInPlace as a new interpretation to uproot.TBranch.array. If the total time is dominated by the disk read, this won't be a noticeable improvement, but it's a possible help if all other stars align.

On top of that, your data are probably compressed, and the decompression stage has to happen in RAM. There aren't many decompression algorithms written for GPUs, and I suspect it's because compression/decompression requires sequential access to deal with variable-length structures, which doesn't fit the GPU's model of parallelism very well. If your data are compressed with the slowest algorithm, LZMA, then that can even dominate over disk-reading. If somebody someday manages to write LZMA decompressors on GPUs, that would be a breakthrough. It's hard, though. I just did another search for such things, and the closest I found were "LZMA-like compression ratios," meaning not the specific LZMA format, but another one. That doesn't help if your data are in a format whose only decompressors are CPU-bound. It still looks like decompression is CPU-bound for the foreseeable future.

On top of that, if your data are not simple types, but have to be interpreted uproot.AsObjects, then it's even worse: they're being interpreted with Python code. There's a project in development to replace that Python code with considerably faster, though still CPU-bound, Forth code (https://arxiv.org/abs/2102.13516). We should see factors of 100× speedup for those cases, though the algorithms are still (necessarily!) very GPU-unfriendly. To be clear, this does not apply to AsDtype, AsJagged, AsStridedObjects, etc.

Once the data are on GPUs, you'll need tools to perform calculations on them there. CuPy has a good set of tools (which can be used with Numba), but only if your data are non-jagged uproot.AsDtype. Awkward Array backends for GPUs are in development, but not ready yet. If your data are strictly tabular, then that's not an issue for you.

So, to summarize, uproot.AsDtype.inplace with a CuPy array is one thing you can do, and it can make a difference if your data are not compressed and they are in your operating system's cache or some very fast disk (e.g. NVMe). But if you have a more typical situation, the disk and/or decompression are bottlenecks that the data have to pass through before they can even get to the GPU.

0 replies

EnginEren · 2022-01-06T09:22:19Z

EnginEren
Jan 6, 2022
Author

Hi @jpivarski ,

thank you very much for your detailed and comprehensive explanations!

Indeed, as a HEP fellow, I have AsJagged(AsDtype('>i4')). (i.e the number of isolated-muons varies from event to event). That means, I'm looking forward to Awkward Array backends for GPUs.

Cheers,
Engin

1 reply

jpivarski Jan 6, 2022
Maintainer

Good to know! For the foreseeable future, Uproot will be reading AsJagged using the CPU only, and some of these cases (std::vector) will get a mild improvement from the AwkwardForth project (summer 2022), but others (variable length array, not STL) are already as optimized as they can be. (AsJagged(AsDtype('>i4')) is not currently implemented with a Python loop, in either case, so there isn't a ~100× potential speedup for us to take advantage of.) I don't foresee adding an "AsJaggedInPlace" like "AsDtypeInPlace" because the interface for such a thing would be pretty complicated. If it existed, it would be hard to use.

Once read on the CPU, you'd be able to copy it to a GPU and perform calculations on it on the GPU in a future version of Awkward Array. Last summer, we did some GPU development, to the extent that you could perform NumPy ufuncs and a few ak.* operations on GPU-bound Awkward Arrays, but that and a few other projects pointed to the need to refactor Awkward Array to support these things better. We're in the middle of that refactoring, and Awkward 2.0 will be released this spring which would be ready for developing the GPU backend in an easier way. (In the process of preparing Awkward 2.0 for more complete GPU support, we broke the Awkward 1.x experimental support that was there. We should view last summer's work as a learning experience for the implementation: don't try to use it!)

We have a deadline of full GPU support by 2024 (this project), but most parts should be available before the end date because that project includes use-cases, not just implementation. The reason I'm telling you this is to let you know that the timescale for GPU support is on the scale of a year or two, and that might or might not be relevant for your project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uproot4 with GPU option #535

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uproot4 with GPU option #535

Uh oh!

EnginEren Jan 5, 2022

Replies: 2 comments · 1 reply

Uh oh!

jpivarski Jan 5, 2022 Maintainer

Uh oh!

EnginEren Jan 6, 2022 Author

Uh oh!

jpivarski Jan 6, 2022 Maintainer

EnginEren
Jan 5, 2022

Replies: 2 comments 1 reply

jpivarski
Jan 5, 2022
Maintainer

EnginEren
Jan 6, 2022
Author

jpivarski Jan 6, 2022
Maintainer