Is there a way to use dask to help reduce memory usage? #2341
Replies: 2 comments 2 replies
-
Rewriting in numba will likely give you the benefit of fewer allocations; Awkward Array natively requires a new array to be allocated for each operation (we do have non-public support for numexpr, but this can only help for ufunc operations at the moment). If you are just worried about total memory usage, you can delete the intermediate arrays once you're finished with them. This should lead to a reduction in total usage. Note that tracking and reporting memory usage can be a bit complex, so don't be surprised if you don't necessarily see what you're expecting. Dask can help here because it can do the lifetime tracking itself (I.e delete arrays once they're no longer needed). It also can optimise the case where you are reading more columns from your data source than are actually needed, although this can also be done by hand. |
Beta Was this translation helpful? Give feedback.
-
This is a good question. The fact that you're doing 3-way combinations of collections with "jet" in their name (there are often a large number of jets per event) is suggestive that this is going to use a lot of memory. "$n$ choose 3" can be a big number when The simplest thing you can do is ignore the memory use and have Dask break it into smaller chunks. That is, run Further down in this response, I noticed that pfcands[start:stop] as an argument to this function, you'd have to subtract fatjetscands.pFCandsIdx - start in the slice of If you want to go memory-hunting, the easiest way to find the biggest problem spot is to get It might help to know that you can run >>> jets = ak.Array([
... [
... {"pt": 1, "eta": 2, "phi": 3},
... {"pt": 1, "eta": 2, "phi": 3},
... {"pt": 1, "eta": 2, "phi": 3},
... {"pt": 1, "eta": 2, "phi": 3},
... ],
... [
... {"pt": 1, "eta": 2, "phi": 3},
... ],
... [
... {"pt": 1, "eta": 2, "phi": 3},
... {"pt": 1, "eta": 2, "phi": 3},
... {"pt": 1, "eta": 2, "phi": 3},
... {"pt": 1, "eta": 2, "phi": 3},
... {"pt": 1, "eta": 2, "phi": 3},
... ]
... ])
>>> combo_jets = ak.combinations(jets, 3)
>>> combo_jets.show(type=True)
type: 3 * var * (
{
pt: int64,
eta: int64,
phi: int64
},
{
pt: int64,
eta: int64,
phi: int64
},
{
pt: int64,
eta: int64,
phi: int64
}
)
[[({pt: 1, eta: 2, phi: 3}, {pt: 1, ...}, {...}), ..., ({...}, {...}, ...)],
[],
[({pt: 1, eta: 2, phi: 3}, {pt: 1, ...}, {...}), ..., ({...}, {...}, ...)]] instead of doing each jagged numeric field separately: >>> ak.combinations(jets.pt, 3).show(type=True)
type: 3 * var * (
int64,
int64,
int64
)
[[(1, 1, 1), (1, 1, 1), (1, 1, 1), (1, 1, 1)],
[],
[(1, 1, 1), (1, 1, 1), (1, 1, 1), (...), ..., (1, 1, 1), (1, 1, 1), (1, 1, 1)]]
>>> ak.combinations(jets.eta, 3).show(type=True)
type: 3 * var * (
int64,
int64,
int64
)
[[(2, 2, 2), (2, 2, 2), (2, 2, 2), (2, 2, 2)],
[],
[(2, 2, 2), (2, 2, 2), (2, 2, 2), (...), ..., (2, 2, 2), (2, 2, 2), (2, 2, 2)]]
>>> ak.combinations(jets.phi, 3).show(type=True)
type: 3 * var * (
int64,
int64,
int64
)
[[(3, 3, 3), (3, 3, 3), (3, 3, 3), (3, 3, 3)],
[],
[(3, 3, 3), (3, 3, 3), (3, 3, 3), (...), ..., (3, 3, 3), (3, 3, 3), (3, 3, 3)]] Maybe you did each field separately because you thought it would use less memory. If so, it's worth knowing that the opposite is likely to be true, for two reasons: (a) all the fields in a record share the same You can see this implementation detail by looking at <ListOffsetArray len='3'>
<offsets><Index dtype='int64' len='4'>[ 0 4 4 14]</Index></offsets>
<content><RecordArray is_tuple='true' len='14'>
<content index='0'>
<IndexedArray len='14'>
<index><Index dtype='int64' len='14'>
[0 0 0 1 5 5 5 5 5 5 6 6 6 7]
</Index></index>
<content><RecordArray is_tuple='false' len='10'>
<content index='0' field='pt'>
<NumpyArray dtype='int64' len='10'>[1 1 1 1 1 1 1 1 1 1]</NumpyArray>
</content>
<content index='1' field='eta'>
<NumpyArray dtype='int64' len='10'>[2 2 2 2 2 2 2 2 2 2]</NumpyArray>
</content>
<content index='2' field='phi'>
<NumpyArray dtype='int64' len='10'>[3 3 3 3 3 3 3 3 3 3]</NumpyArray>
</content>
</RecordArray></content>
</IndexedArray>
</content>
<content index='1'>
<IndexedArray len='14'>
<index><Index dtype='int64' len='14'>
[1 1 2 2 6 6 6 7 7 8 7 7 8 8]
</Index></index>
<content><RecordArray is_tuple='false' len='10'>
<content index='0' field='pt'>
<NumpyArray dtype='int64' len='10'>[1 1 1 1 1 1 1 1 1 1]</NumpyArray>
</content>
<content index='1' field='eta'>
<NumpyArray dtype='int64' len='10'>[2 2 2 2 2 2 2 2 2 2]</NumpyArray>
</content>
<content index='2' field='phi'>
<NumpyArray dtype='int64' len='10'>[3 3 3 3 3 3 3 3 3 3]</NumpyArray>
</content>
</RecordArray></content>
</IndexedArray>
</content>
<content index='2'>
<IndexedArray len='14'>
<index><Index dtype='int64' len='14'>
[2 3 3 3 7 8 9 8 9 9 8 9 9 9]
</Index></index>
<content><RecordArray is_tuple='false' len='10'>
<content index='0' field='pt'>
<NumpyArray dtype='int64' len='10'>[1 1 1 1 1 1 1 1 1 1]</NumpyArray>
</content>
<content index='1' field='eta'>
<NumpyArray dtype='int64' len='10'>[2 2 2 2 2 2 2 2 2 2]</NumpyArray>
</content>
<content index='2' field='phi'>
<NumpyArray dtype='int64' len='10'>[3 3 3 3 3 3 3 3 3 3]</NumpyArray>
</content>
</RecordArray></content>
</IndexedArray>
</content>
</RecordArray></content>
</ListOffsetArray> The outer RecordArray is the 3-tuple of combinations fields >>> combo_jets.layout.content.content("0").index
<Index dtype='int64' len='14'>
[0 0 0 1 5 5 5 5 5 5 6 6 6 7]
</Index>
>>> combo_jets.layout.content.content("1").index
<Index dtype='int64' len='14'>
[1 1 2 2 6 6 6 7 7 8 7 7 8 8]
</Index>
>>> combo_jets.layout.content.content("2").index
<Index dtype='int64' len='14'>
[2 3 3 3 7 8 9 8 9 9 8 9 9 9]
</Index> and inside of that is the input RecordArray (not a copy). >>> record_in_0 = combo_jets.layout.content.content("0").content
>>> record_in_1 = combo_jets.layout.content.content("1").content
>>> record_in_2 = combo_jets.layout.content.content("2").content
>>> record_in_0 is record_in_1
True
>>> record_in_0 is record_in_2
True
>>> record_in_1 is record_in_2
True In other words, this was all designed to minimize memory usage if you naively call Calling >>> combo_numeric = ak.combinations(jets.eta, 3)
>>> combo_numeric.layout
<ListOffsetArray len='3'>
<offsets><Index dtype='int64' len='4'>[ 0 4 4 14]</Index></offsets>
<content><RecordArray is_tuple='true' len='14'>
<content index='0'>
<NumpyArray dtype='int64' len='14'>[2 2 2 2 2 2 2 2 2 2 2 2 2 2]</NumpyArray>
</content>
<content index='1'>
<NumpyArray dtype='int64' len='14'>[2 2 2 2 2 2 2 2 2 2 2 2 2 2]</NumpyArray>
</content>
<content index='2'>
<NumpyArray dtype='int64' len='14'>[2 2 2 2 2 2 2 2 2 2 2 2 2 2]</NumpyArray>
</content>
</RecordArray></content>
</ListOffsetArray>
>>> combo_numeric.layout.content.content("0").data
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>>> nparray_in_0 = combo_numeric.layout.content.content("0").data
>>> nparray_in_1 = combo_numeric.layout.content.content("1").data
>>> nparray_in_2 = combo_numeric.layout.content.content("2").data
>>> np.shares_memory(nparray_in_0, nparray_in_1)
False
>>> np.shares_memory(nparray_in_0, nparray_in_2)
False
>>> np.shares_memory(nparray_in_1, nparray_in_2)
False It is a trade-off: the IndexedArray saves us from rearranging/duplicating its You are interested in three fields, the You do a lot of flattening and unflattening, and I don't know why. You have to flatten and unflatten if an in-between step is Since the ListOffsetArray has an Oh, wait—I think I get it—the In that case, I guess you have to flatten it, but if you go a low-level route, manipulating ListOffsetArray directly instead of using the high-level And then there's also Numba. Someone once asked me a similar question, but they were doing 5-way combinations, and the intermediate arrays with |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I have created a function that uses awkward arrays to calculate an energy correlation function:
but because of the size of some of these arrays I run into memory problems pretty quickly. I assumed my only option was to rewrite in numba, but it was suggested to me that I might be able to use dask to keep my code array-oriented rather than loopy. Is there currently an easy way to incorporate dask into awkward centric code, like mine above?
Any advice is greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions