Best Practices/How-tos for Handling Large Amounts of Data? #274

wctaylor · 2021-02-17T13:40:23Z

wctaylor
Feb 17, 2021

I'm curious to see what the "pros" do for handling large datasets. In my case, I have ~1k datasets, and each dataset is ~20k rootfiles. Each dataset is 4-5 GB on-disk. In other words, there's 4-5 TB distributed across ~20 million files.

My current strategy is not very sophisticated, though I guess it seems to work. I work on a HPC cluster, and I generally have a SLURM job run over each dataset and do something, then try to pull it all together at the end into something manageable.

That said, in other areas, it seems that tools like Dask allow for decently interactive handling of 1 TB+ inside Jupyter notebooks, so in principle it might be possible to interactively explore the whole set relatively quickly, but it doesn't seem like Dask is well suited for physics data/Awkward Arrays.

So how might one interactively explore this much data in a Jupyter notebook?

Is it possible or feasible?
Should I first try to concatenate everything into one massive rootfile, and then use either uproot.iterate or uproot.lazy to work with that file? What is the best way to utilize the HPC environment through uproot/awkward?
What other tips/tricks should I be aware of?

I'd love to get some insights on how others would handle this situation. And if others find it valuable, too, maybe it could make it into the docs.

Thanks for any input.

tamasgal · 2021-02-17T14:09:08Z

tamasgal
Feb 17, 2021

I do not really do interactive work with large datasets; to me interactivity is mostly important for discovering high-level data which is already condensed and fit into a single computer's resource limitations.

I'll rather answer in a more generic way: to me it sounds like you could make benefit of a workflow system which supports HPC environments and offers a few other crucial features like provenance, reproducibility and shareability. Such a workflow system is usually combined with containerisation since you need to be able to distribute the software including all the dependencies in a unified way also to heterogenous HPC systems.

I highly recommend two frameworks: nextflow if you want an all-in-one package (supports SLURM, PBS, SGE, etc.) and the Common Workflow Language if you want to do something more modular. The former offers workflow descriptions and a job-runner/manager in one single command line application (Java 8+) the latter is just a standards definition to describe tools and workflows with e.g. YAML files and the runner is then free of choice (e.g. Toil), as long as it implements one of the CWL standards.

Again, this is nothing about interactive play with terabytes of data spread over multiple nodes, but a more sophisticated workflow management, which in my opinion essential in science for several good reasons.

1 reply

wctaylor Feb 17, 2021
Author

Thanks! That is still valuable input, and indeed, it looks like something that I should become familiar with

jpivarski · 2021-02-17T16:30:33Z

jpivarski
Feb 17, 2021
Maintainer

Awkward Arrays need to be able to work better with Dask. That's a to-do item. I have tried to just interpret lazy Awkward Arrays as Dask delayed arrays, but you quickly run into assumptions that the array has a shape and dtype, which is exactly what Awkward Arrays don't have—that's what's awkward about them. I think we might need to make a new collection type for it, alongside Dask's array, bag, and frame abstractions. We've talked with the Dask team about that and scoped out the work.

On the other hand, you can also use Dask by decorating functions, and those functions can contain any operations to compute. This might be the best option for using Dask at the moment.

In particular, this is how Coffea uses Dask with Awkward Arrays. Coffea has a "Processor" abstraction on top of Dask that does the map (work with arrays) + reduce (combine final histograms) process. That's how a lot of physicists scale up their work.

I should point out that uproot.lazy allows you to treat a collection of ROOT files as a single lazy array—you don't need to combine them for that. However, this function does need to open all the files immediately, to determine the lengths of all the TTrees, to index them all into a single abstract array. If you have enough files, just opening them all is prohibitive, even without reading their contents.

Coffea has a "NanoEvents" framework that goes a step further in laziness, allowing you to address a collection of files without opening them all. It has to find out the number of events in each file, first, though. (I don't remember how it does that.)

I've been learning more about the Parquet file format recently, and this has a nice "dataset" abstraction that pulls all of the metadata out of each file in a collection, putting it all in a single metadata-only file. That way, you don't have to open the individual files to find out such things as how many entries are in each. I've added the ability to read and write these things from Awkward Array (scikit-hep/awkward#706 and scikit-hep/awkward#368 (comment)). If it's interesting, I'll write some documentation for Parquet datasets.

1 reply

wctaylor Feb 17, 2021
Author

Thanks! Maybe I need to get more familiar with Coffea, then. I have been making use of Parquet to save some intermediate results, since uproot4 doesn't currently support writing ROOT files. So if there's a more effective way I could make use of it, I'd be happy to learn that as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best Practices/How-tos for Handling Large Amounts of Data? #274

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best Practices/How-tos for Handling Large Amounts of Data? #274

Uh oh!

Uh oh!

wctaylor Feb 17, 2021

Replies: 2 comments · 2 replies

Uh oh!

tamasgal Feb 17, 2021

Uh oh!

wctaylor Feb 17, 2021 Author

Uh oh!

jpivarski Feb 17, 2021 Maintainer

Uh oh!

wctaylor Feb 17, 2021 Author

wctaylor
Feb 17, 2021

Replies: 2 comments 2 replies

tamasgal
Feb 17, 2021

wctaylor Feb 17, 2021
Author

jpivarski
Feb 17, 2021
Maintainer

wctaylor Feb 17, 2021
Author