Best way to create a new deeply nested array with shape and depth matching an existing array? #3101

raymondEhlers · 2024-05-05T08:24:19Z

raymondEhlers
May 5, 2024

To start, a bit of context to avoid the X-Y problem: I have a deeply nested input array which I need to use for a calculation in numba. The final output of this calculation should be of the same shape + depth as the input array. I do the numba calculation and return the output in a flattened np array. I'm now left with converting that output into the same shape + depth as the input array.

However, unflattening is where I run into trouble. For one level of depth, unflatten + ak.num works very nicely. However, as I started adding levels of depth, it appears that I need to make one call to unflatten per level of depth (not sure if this is best, but it's how I've managed to make it work). As I add each level of depth, it becomes more and more difficult to reason about and find the right arguments to ak.num to calculate the values that ak.unflatten is looking for.

As a concrete example:

In [115]: input_array
Out[115]: <Array [[[[0, 1, -4], [...], ..., [2]]], ...] type='2 * var * var * var * i...'>

In [116]: input_array.to_list()
Out[116]:
[[[[0, 1, -4], [2, 3], [1, -4], [0], [1], [-4], [3], [2]]],
 [[[0, 1, -4], [2, 3], [1, -4], [0], [1], [-4], [3], [2]]]]

In [117]: input_array.type.show()
2 * var * var * var * int64

In [118]: output_array
Out[118]:
array([0, 1, 2, 3, 4, 1, 2, 0, 1, 2, 4, 3, 0, 1, 2, 3, 4, 1, 2, 0, 1, 2,
       4, 3])

In [119]: output_array_same_shape = ??

I've tried every combination of unflatten, num, and sum that I can think of but I'm unable to find the right way to get output_array into the same depth + shape as input_array. The closest I had towards matching the docs was:

In [1] output_array_2 = ak.unflatten(output_array, ak.sum(ak.sum(ak.num(input_array, axis=-1), axis=1), axis=1))

In [2] output_array_3 = ak.unflatten(output_array_2, ak.flatten(ak.sum(ak.num(input_array, axis=-1), axis=1), axis=1))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[97], line 1
----> 1 res3 = ak.unflatten(res2, ak.flatten(ak.sum(ak.num(constituents_user_index, axis=-1), axis=1), axis=1))

File ~/software/dev/mammoth/.venv-3.11/lib/python3.11/site-packages/awkward/_dispatch.py:62, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     60 # Failed to find a custom overload, so resume the original function
     61 try:
---> 62     next(gen_or_result)
     63 except StopIteration as err:
     64     return err.value

File ~/software/dev/mammoth/.venv-3.11/lib/python3.11/site-packages/awkward/operations/ak_unflatten.py:90, in unflatten(array, counts, axis, highlevel, behavior, attrs)
     87 yield (array,)
     89 # Implementation
---> 90 return _impl(array, counts, axis, highlevel, behavior, attrs)

File ~/software/dev/mammoth/.venv-3.11/lib/python3.11/site-packages/awkward/operations/ak_unflatten.py:207, in _impl(array, counts, axis, highlevel, behavior, attrs)
    204     return out
    206 if axis == 0 or maybe_posaxis(layout, axis, 1) == 0:
--> 207     out = unflatten_this_layout(layout)
    209 else:
    211     def recursively_apply_to_content(
    212         action, layout, depth, depth_context, lateral_context, options, **kwargs
    213     ):

File ~/software/dev/mammoth/.venv-3.11/lib/python3.11/site-packages/awkward/operations/ak_unflatten.py:186, in _impl.<locals>.unflatten_this_layout(layout)
    167 position = (
    168     index_nplike.searchsorted(
    169         current_offsets,
   (...)
    175     - 1
    176 )
    177 if (
    178     current_offsets.size is not unknown_length
    179     and layout.length is not unknown_length
   (...)
    184     )
    185 ):
--> 186     raise ValueError(
    187         "structure imposed by 'counts' does not fit in the array or partition "
    188         f"at axis={axis}"
    189     )
    191 offsets = current_offsets[: position + 1]
    192 current_offsets = current_offsets[
    193     position:
    194 ] - index_nplike.shape_item_as_index(layout.length)

ValueError: structure imposed by 'counts' does not fit in the array or partition at axis=0

This error occurred while calling

    ak.unflatten(
        <Array [[0, 1, 2, 3, 4, ..., 1, 2, 4, 3], ...] type='2 * var * int64'>
        <Array [3, 2, 2, 1, 1, 1, 1, ..., 2, 1, 1, 1, 1, 1] type='16 * int64'>
    )

Even if this worked and I could come up with a recursive function, it still seems far to complicated for something that is conceptually so simple. I've considered something like full_like, but I run into awkward-array immutability. I keep running into this problem each time I revisit this code base, so I suspect I'm missing something fundamental. What is the best way to create such a deeply nested array?

Thanks!

Answered by jpivarski

May 6, 2024

What you're proposing as ak.shape_like is essentially ak.unflatten, if it worked on multiple levels at once. And if it took an input array's structure directly, instead of asking for counts (which could be taken from ak.num).

Actually,

def shape_like(flat_array, array_with_desired_structure):
    flat_array_as_layout = ak.to_layout(flat_array)
    # flat_array needs to be flat
    assert flat_array_as_layout.is_numpy

    def transformation(layout, **kwargs):
        # this function only makes sense for non-branching layouts
        assert not layout.is_record and not layout.is_union

        if layout.is_numpy:
            # they need to have the same number of numerical values

View full answer

raymondEhlers · 2024-05-05T08:41:29Z

raymondEhlers
May 5, 2024
Author

As a brief follow up, I did finally find a formulation that worked here (copied below for completeness), but my question remains: conceptually, this seems like I would expect to be a 1-2 liner rather than something that I need to carefully think through each time I end up in a similar -- but not exactly the same -- situation. Thanks!

In [157]: counts = ak.sum(ak.num(input_array, axis=-1), axis=1)

In [160]: output_array_2 = ak.unflatten(output_array, ak.sum(counts, axis=1))

In [158]: output_array_3 = ak.unflatten(output_array_2, ak.flatten(counts), axis=1)

In [159]: output_array_4 = ak.unflatten(output_array_3, ak.num(input_array, axis=1))
# output_array_4 is now in the same shape as input_array!

0 replies

agoose77 · 2024-05-05T15:43:06Z

agoose77
May 5, 2024
Maintainer

Applying numba at the final (axis=-1) axis is a common pattern. The best way to do this tersely is to use ak.transform:

import awkward as ak
import numpy as np

def kernel(inputs, output):
   for i, input_ in enumerate(inputs):
       output[i] = np.prod(input_)

def transform(layout, depth, **kwargs):
    if depth == 2 and layout.is_list:  # Act at list of outer lists
        kernel(ak.Array(layout), output)
        return ak.to_content(output)

result = ak.transform(transform, array)

5 replies

raymondEhlers May 5, 2024
Author

Thanks for your quick response! I wasn't this familiar with how to use ak.transform, so I can imagine this being helpful in some contexts, but I don't see how to apply it to my particular case. I've frequently used the workflow of inputs -> flat output -> unflatten, so I assumed that was fairly universal, but it seems I may have still fallen into the XY problem 😂 . So I'll add some more details of my particular problem to help clarify why I don't see how to use it here.

I have a set of unique indices to identify particles that are stored separately in two separate lists: the input_particles_user_index and the constituents_user_index. Those indices are stored at two different depths:

[in 1]: input_particles_user_index.show()
[[[0, 1, -4, 2, 3]],
 [[0, 1, -4, 2, 3]]]
In [2]: input_particles_user_index.type.show()
2 * var * var * int64
# NOTE: The dimensions correspond to n_events * n_jets/event * n_particles/jet * one index/particle
In [5]: constituents_user_index.show()
[[[[0, 1, -4], [2, 3], [1, -4], [0], [1], [-4], [3], [2]]],
 [[[0, 1, -4], [2, 3], [1, -4], [0], [1], [-4], [3], [2]]]]
In [6]: constituents_user_index.type.show()
2 * var * var * var * int64

I need to match up the values stored in these two lists, and store the index where the values is located in the input_particles_user_index. Which is to say, my desired output is:

In [8]: desired_values.show()
[[[[0, 1, 2], [2, 3], [1, 2], [0], [1], [2], [3], [2]]],
 [[[0, 1, 2], [2, 3], [1, 2], [0], [1], [2], [3], [2]]]]

Note that this is a simplified test case - in general, I cannot just go through and e.g. just blindly map all -4 -> 2, as was done in this case. Other events will have different mappings, so I believe I need a more general solution. My solution at the moment looks like:

#@nb.njit  # type: ignore[misc]
def _find_constituent_indices_via_user_index_for_subjets(
    input_particles_user_indices: ak.Array, constituents_user_index: ak.Array, number_of_constituents: int
) -> ak.Array:
    output = np.ones(number_of_constituents, dtype=np.int64) * -1
    output_counter = 0
    for event_user_index, event_constituents_user_index in zip(input_particles_user_indices, constituents_user_index):  # noqa: B905
        for input_jets_user_index, jet_constituents_user_index in zip(event_user_index, event_constituents_user_index):  # noqa: B905
            for subjet_constituents_user_index in jet_constituents_user_index:
                for jet_constituent_index in subjet_constituents_user_index:
                    print(f"{jet_constituent_index=}, {event_user_index=}")
                    for i_original_constituent, user_index in enumerate(input_jets_user_index):
                        if jet_constituent_index == user_index:
                            output[output_counter] = i_original_constituent
                            output_counter += 1
                            break
                    else:
                        _msg = "Could not find match " + str(jet_constituent_index)
                        print(_msg)  # noqa: T201
                        # NOTE: Can't pass the message directly with numba since it would have to be a compile time constant (as of Mar 2023).
                        #       As an alternative, we print the message, and then we raise the exception. As long as we don't catch it, it
                        #       achieves basically the same thing.
                        raise ValueError

    return output


def find_constituent_indices_via_user_index_for_subjets(input_particles_user_indices: ak.Array, constituents_user_index: ak.Array) -> ak.Array:
    res = _find_constituent_indices_via_user_index_for_subjets(
        input_particles_user_indices=input_particles_user_indices,
        constituents_user_index=constituents_user_index,
        number_of_constituents=ak.count(constituents_user_index),
    )

    # How do I make res (a flat np array) have the same shape as constituents_user_index?

    return res_in_the_right_shape

find_constituent_indices_via_user_index_for_subjets(input_particles_user_index, constituents_user_index)

Based on your response, I could imagine using ak.transform in the case of just operating on constituents_user_index, but since I need to zip together an additional input at a different depth, I don't see how to implement it. I've used similar approaches for e.g. jet matching where I need to zip lists together. Is this possible using ak.transform?

jpivarski May 6, 2024
Maintainer

I am getting lost in the indexes. Is there a reason why the Numba function is insufficient? (It will probably be faster than an equivalent array-oriented approach: if the array-oriented approach is not easier, then it's not doing its job.)

In general, ak.transform can deal with two arrays having different depths. The function that you write as the transformation can be any function, and it can stop the recursive descent when one input has type var * int64 and the other has type int64. You could even run a Numba function inside of the transformation, which would liberate your _find_constituent_indices_via_user_index_for_subjets from having to descend a specific number of nesting levels (fewer nested for loops, and a fixed number of nested for loops for data that might be additional levels of depth—i.e. a more reusable function).

If your desired output has the same structure (nested lists of the same lengths) as the outer N list levels and then you do something different for the inner M (or M in one input argument and M + 1 in the other), then using ak.transform to descend through the outer N list levels and Numba to deal with M (and M + 1) is ideal. ak.transform's job is to create functions that change some inner bit of some Awkward Arrays and leave the outer bits untouched, which it sounds like is what you're trying to do here.

raymondEhlers May 6, 2024
Author

I am getting lost in the indexes. Is there a reason why the Numba function is insufficient? (It will probably be faster than an equivalent array-oriented approach: if the array-oriented approach is not easier, then it's not doing its job.)

Sorry, I know this function is rather difficult to read! (I often don't clean up until I know it works, but I should have once I decided to post the issue). Other than the readability / non-generalizability (which I can live with), the numba function works perfectly fine - the issue that I ran into was how to convert the output (a flat np array) into the right shape. The "solution" that I came up with in #3101 (comment) was the product of an hour or more of trying different combinations. Even now, I don't have good intuition for why those combinations are the right ones, other than that they apparently give the right answer. (And I skipped the array builder since I know it can hurt performance + I already had another array with the right structure, so I figured I could copy it). So I thought there must be a better way - hence this discussion post

In general, ak.transform can deal with two arrays having different depths. The function that you write as the transformation can be any function, and it can stop the recursive descent when one input has type var * int64 and the other has type int64. You could even run a Numba function inside of the transformation, which would liberate your _find_constituent_indices_via_user_index_for_subjets from having to descend a specific number of nesting levels (fewer nested for loops, and a fixed number of nested for loops for data that might be additional levels of depth—i.e. a more reusable function).

If your desired output has the same structure (nested lists of the same lengths) as the outer N list levels and then you do something different for the inner M (or M in one input argument and M + 1 in the other), then using ak.transform to descend through the outer N list levels and Numba to deal with M (and M + 1) is ideal. ak.transform's job is to create functions that change some inner bit of some Awkward Arrays and leave the outer bits untouched, which it sounds like is what you're trying to do here.

And based on what I'm hearing from you both, it sounds like ak.transform is the better way. (I'm still a little surprised there isn't some sort of ak.shape_like(flat_array, ak_array_with_desired_structure), but I suppose what you're saying is that ak.transform covers that functionality and more. And perhaps this doesn't fit so well with the underlying layouts(?)).

At this moment, I don't immediately see exactly how to implement my case, but I think this due to my lack of familiarity with ak.transform, so I will do some reading and testing and report back on my solution in this case.

Thanks to you and Angus for your help!

jpivarski May 6, 2024
Maintainer

What you're proposing as ak.shape_like is essentially ak.unflatten, if it worked on multiple levels at once. And if it took an input array's structure directly, instead of asking for counts (which could be taken from ak.num).

Actually,

def shape_like(flat_array, array_with_desired_structure):
    flat_array_as_layout = ak.to_layout(flat_array)
    # flat_array needs to be flat
    assert flat_array_as_layout.is_numpy

    def transformation(layout, **kwargs):
        # this function only makes sense for non-branching layouts
        assert not layout.is_record and not layout.is_union

        if layout.is_numpy:
            # they need to have the same number of numerical values
            assert len(layout) == len(flat_array)
            return flat_array_as_layout

    return ak.transform(transformation, array_with_desired_structure)

and

>>> flat_array = [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]
>>> array_with_desired_structure = [[[0, 1, 2], [], [3, 4]], [], [[5], None], [[None, None], [6, 7, 8, 9]]]
>>> shape_like(flat_array, array_with_desired_structure).show(type=True)
type: 4 * var * option[var * ?float64]
[[[0, 1.1, 2.2], [], [3.3, 4.4]],
 [],
 [[5.5], None],
 [[None, None], [6.6, 7.7, 8.8, 9.9]]]

Answer selected by raymondEhlers

raymondEhlers May 6, 2024
Author

Thanks for your follow up! Between your example and reading the ak.transform docs, it all finally clicks together both for my example, as well as the general approach of ak.transform. On the own, I couldn't fully follow the logic, so this is perfect. Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to create a new deeply nested array with shape and depth matching an existing array? #3101

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best way to create a new deeply nested array with shape and depth matching an existing array? #3101

Uh oh!

Uh oh!

raymondEhlers May 5, 2024

Replies: 2 comments · 5 replies

Uh oh!

Uh oh!

raymondEhlers May 5, 2024 Author

Uh oh!

agoose77 May 5, 2024 Maintainer

Uh oh!

raymondEhlers May 5, 2024 Author

Uh oh!

jpivarski May 6, 2024 Maintainer

Uh oh!

raymondEhlers May 6, 2024 Author

Uh oh!

jpivarski May 6, 2024 Maintainer

Uh oh!

raymondEhlers May 6, 2024 Author

raymondEhlers
May 5, 2024

Replies: 2 comments 5 replies

raymondEhlers
May 5, 2024
Author

agoose77
May 5, 2024
Maintainer

raymondEhlers May 5, 2024
Author

jpivarski May 6, 2024
Maintainer

raymondEhlers May 6, 2024
Author

jpivarski May 6, 2024
Maintainer

raymondEhlers May 6, 2024
Author