Replies: 2 comments 2 replies
-
I do not really do interactive work with large datasets; to me interactivity is mostly important for discovering high-level data which is already condensed and fit into a single computer's resource limitations. I'll rather answer in a more generic way: to me it sounds like you could make benefit of a workflow system which supports HPC environments and offers a few other crucial features like provenance, reproducibility and shareability. Such a workflow system is usually combined with containerisation since you need to be able to distribute the software including all the dependencies in a unified way also to heterogenous HPC systems. I highly recommend two frameworks: nextflow if you want an all-in-one package (supports SLURM, PBS, SGE, etc.) and the Common Workflow Language if you want to do something more modular. The former offers workflow descriptions and a job-runner/manager in one single command line application (Java 8+) the latter is just a standards definition to describe tools and workflows with e.g. YAML files and the runner is then free of choice (e.g. Toil), as long as it implements one of the CWL standards. Again, this is nothing about interactive play with terabytes of data spread over multiple nodes, but a more sophisticated workflow management, which in my opinion essential in science for several good reasons. |
Beta Was this translation helpful? Give feedback.
-
Awkward Arrays need to be able to work better with Dask. That's a to-do item. I have tried to just interpret lazy Awkward Arrays as Dask delayed arrays, but you quickly run into assumptions that the array has a On the other hand, you can also use Dask by decorating functions, and those functions can contain any operations to compute. This might be the best option for using Dask at the moment. In particular, this is how Coffea uses Dask with Awkward Arrays. Coffea has a "Processor" abstraction on top of Dask that does the map (work with arrays) + reduce (combine final histograms) process. That's how a lot of physicists scale up their work. I should point out that uproot.lazy allows you to treat a collection of ROOT files as a single lazy array—you don't need to combine them for that. However, this function does need to open all the files immediately, to determine the lengths of all the TTrees, to index them all into a single abstract array. If you have enough files, just opening them all is prohibitive, even without reading their contents. Coffea has a "NanoEvents" framework that goes a step further in laziness, allowing you to address a collection of files without opening them all. It has to find out the number of events in each file, first, though. (I don't remember how it does that.) I've been learning more about the Parquet file format recently, and this has a nice "dataset" abstraction that pulls all of the metadata out of each file in a collection, putting it all in a single metadata-only file. That way, you don't have to open the individual files to find out such things as how many entries are in each. I've added the ability to read and write these things from Awkward Array (scikit-hep/awkward#706 and scikit-hep/awkward#368 (comment)). If it's interesting, I'll write some documentation for Parquet datasets. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm curious to see what the "pros" do for handling large datasets. In my case, I have ~1k datasets, and each dataset is ~20k rootfiles. Each dataset is 4-5 GB on-disk. In other words, there's 4-5 TB distributed across ~20 million files.
My current strategy is not very sophisticated, though I guess it seems to work. I work on a HPC cluster, and I generally have a SLURM job run over each dataset and do something, then try to pull it all together at the end into something manageable.
That said, in other areas, it seems that tools like Dask allow for decently interactive handling of 1 TB+ inside Jupyter notebooks, so in principle it might be possible to interactively explore the whole set relatively quickly, but it doesn't seem like Dask is well suited for physics data/Awkward Arrays.
So how might one interactively explore this much data in a Jupyter notebook?
Is it possible or feasible?
Should I first try to concatenate everything into one massive rootfile, and then use either uproot.iterate or uproot.lazy to work with that file? What is the best way to utilize the HPC environment through uproot/awkward?
What other tips/tricks should I be aware of?
I'd love to get some insights on how others would handle this situation. And if others find it valuable, too, maybe it could make it into the docs.
Thanks for any input.
Beta Was this translation helpful? Give feedback.
All reactions