Replies: 4 comments 4 replies
-
RE the lazy aspect to this question, I assume what's happening here is that |
Beta Was this translation helpful? Give feedback.
-
That's right. Assigning an array array_of_3vectors["mass"] = 125.0 can set a array_of_lists_of_lists["new_field"] = array_of_lists will broadcast each element from the shallower This is a fundamental problem with our design of lazy arrays: the idea is that it will be materialized when needed, but it can be hard for users to guess when it's needed. (Maybe you didn't know that field-assignment broadcasts, for instance.) There's a long history of bug-fixes for lazy arrays being materialized "too early," but without a clear definition of when is "too early." In some cases, it was obvious: some code was just checking something that could be determined from type alone, such as the value of the But for broadcasting, we can't delay that. To delay a calculation, we have to be able to say what its Form is going to be without actually performing the calculation. There are even some slices in which that's not possible: the Form of the output depends on the specific values of the array. (That is, if some lists are empty or well-aligned or something, the output would have a different Form than if they weren't, and in order to make a new VirtualArray, we have to predict the Form it would have upon evaluation.) Dask solves the lazy array problem differently, and that's where we're directing new effort on this: One difference is that a Dask collection has a Development of an Awkward-Dask collection is starting next month. As for solving your problem: are you trying to make a record array with virtual fields? Instead of assigning new fields into an existing object, you could create the array all at once by calling the array = ak.Array({"field1": ak.virtual(...), "field2": ak.virtual(...)}) or array = ak.zip({"field1": ak.virtual(...), "field2": ak.virtual(...)}, depth_limit=1) If the fields were lists of virtual data (i.e. the list offsets are not virtual, but the contents are), then you could use |
Beta Was this translation helpful? Give feedback.
-
The key thing about the memory leak is that it happens when a lazy array is used as a function argument but not when a non-lazy array is used as a function argument. I think what's happening is that Python can't see reference cycles like A → B → A when A is a Python object and B is a C++ object. All of the mallocs and frees are appropriately paired up (because the C++ code uses @nsmith- found a similar issue when a VirtualArray's cache contained the VirtualArray itself. The cache is a Python mapping (to let you use Our solution with the VirtualArray cache problem was to make the ArrayCache hold only a weak reference to the Python cache object, and then connect the strong reference through What you've found is the same story, but with ArrayGenerator's hold on Python objects as arguments to the function, rather than ArrayCache's hold on a Python object as a MutableMapping. I think @nsmith- predicted yet another way of forming cycles: by the ArrayGenerator's function holding a closure to the array itself, but holding arguments is effectively the same. That sounded unlikely to come up in practice: why would the function to fill in an array need to reference the holder where that function would go? Furthermore, in a purely immutable world, it wouldn't be possible to set up that situation: you wouldn't be able to reference a VirtualArray in one of the arguments that constructs the VirtualArray because it hasn't been made yet. @nsmith-'s example of a closure referencing itself is possible because Python's global namespaces are mutable: that's how you can write a recursive function—Python doesn't attempt to evaluate a reference to array["virtual"] = something_involving(array) but building it all at once with the Just as in the vagueness about when a lazy array materializes, this memory leak is a consequence of a bad design decision: if we implemented the array nodes in Python, rather than C++, then none of this would have ever come up. As the attempt to patch VirtualArray caches shows, trying to patch this with weak references will bring in its own problems. Just as the vagueness of lazy array materialization will be solved by relying instead on Dask, the problems introduced by C++ are being addressed by refactoring Awkward Array's middle layer from C++ to Python. There's a talk on that; it's an ongoing project, but the gist is below: We replace the C++ box with a Python box: The motivation described in that talk was for JAX (hence the "differentiated kernels" that becomes possible), but many other issues boiled down to the same thing. JAX and Dask can both be supported better if those libraries' tracer objects can "see" down to the level of kernels, but the long-standing problem of reference cycles would also be fixed by letting Python's garbage collector "see" down to this level. Kernel functions, by the way, do zero memory management (purely borrowed references). So the memory leak you found is real, but our fix for it is going to be this refactoring. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the insightful discussion. i decided to give the class VirtualArrayCopier:
def __init__(self, array):
self.data = {f: array[f] for f in ak.fields(array)}
self.behavior = array.behavior
def __setitem__(self, key, value):
self.data[key] = value
def get(self):
array = ak.Array(self.data)
array.behavior = self.behavior
return array
def wrap_with_copy(self, func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
return func(self.get(), *args, **kwargs)
return wrapper By using the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Version of Awkward Array
1.4.0 (and git main)
Description and code to reproduce
When inserting a virtual array into an array created with
ak.from_buffers
withlazy=True
, the generator function of the virtual array gets called right away. This however happens only for the first virtual array being inserted. I would expect the generator function not being called at all, as the virtual arrays content is not required at this point.Code example:
This will print
called1
, even though it shouldn't. Iflazy=False
in theak.from_buffers
call, it doesn't print anything, like it should.In the same context i noticed a continuous increase in memory usage, in case the array is also used as part of the
args
parameter. To me this looks a lot like something isn't being freed. Code example:The resident set size, which is printed, will continue to increase with every loop. This does not happen (there is almost no change to the number) if
array["virtual"]
is only set once. It also does not happen ifbig
instead ofarray
is used in theargs
parameter.Beta Was this translation helpful? Give feedback.
All reactions