-
Notifications
You must be signed in to change notification settings - Fork 99
Examples dump: Gitter
This is a dump of Gitter's Awkward Array channel (https://gitter.im/Scikit-HEP/awkward-array) up to 2022-07-06, for the sake of mining for tutorial examples. Gitter channels are all public.
Date: 2020-07-27 00:11:57 From: Jim Pivarski (@jpivarski)
Test!
Date: 2020-07-27 18:48:06 From: Franz Király (@fkiraly)
seems to be working - new gitter?
Date: 2020-07-27 18:48:23 From: Jim Pivarski (@jpivarski)
I'm about to press "send" on my response on GitHub; be ready in a sec.
Date: 2020-07-27 18:50:27 From: Jim Pivarski (@jpivarski)
Okay! Is my comment on GitHub in the right ballpark or am I missing something?
Date: 2020-07-27 18:51:49 From: Franz Király (@fkiraly)
sounds to me like we are not talking about the same thing?
Date: 2020-07-27 18:51:57 From: Jim Pivarski (@jpivarski)
No?
Date: 2020-07-27 18:52:02 From: Franz Király (@fkiraly)
let me perhaps explain with two examples
Date: 2020-07-27 18:52:28 From: Franz Király (@fkiraly)
(A) unequally sampled time series
Date: 2020-07-27 18:52:45 From: Jim Pivarski (@jpivarski)
That's my (2)
Date: 2020-07-27 18:52:51 From: Franz Király (@fkiraly)
how?
Date: 2020-07-27 18:53:35 From: Franz Király (@fkiraly)
I thought in (2) you say you don´t want a data valued index
Date: 2020-07-27 18:53:44 From: Jim Pivarski (@jpivarski)
If you didn't have the convenience of Pandas/xarray, you'd handle unequally sampled x-y data with two NumPy arrays, right?
Date: 2020-07-27 18:54:01 From: Franz Király (@fkiraly)
two pandas.series
Date: 2020-07-27 18:54:09 From: Jim Pivarski (@jpivarski)
I mean if you didn't have Pandas.
Date: 2020-07-27 18:54:11 From: Franz Király (@fkiraly)
or in a pandas.DataFrame, with a lot of nans
Date: 2020-07-27 18:54:29 From: Franz Király (@fkiraly)
with integer index?
Date: 2020-07-27 18:54:54 From: Franz Király (@fkiraly)
I mean if you didn't have Pandas.
Date: 2020-07-27 18:55:11 From: Jim Pivarski (@jpivarski)
One array would be the "x" coordinates (time) and another would be the "y" coordinates. It wouldn't be convenient because the library wouldn't know that these two arrays are related; you'd have to impose that knowledge yourself.
Date: 2020-07-27 18:55:42 From: Franz Király (@fkiraly)
no, that's not what unequally sampled means in our parlance
Date: 2020-07-27 18:55:47 From: Franz Király (@fkiraly)
should have been more precise
Date: 2020-07-27 18:55:53 From: Franz Király (@fkiraly)
well, can mean that
Date: 2020-07-27 18:55:55 From: Franz Király (@fkiraly)
give me a sec
Date: 2020-07-27 18:56:13 From: Franz Király (@fkiraly)
so, there is the case where you have a multi-variate series, but different variables are observed at different time points
Date: 2020-07-27 18:56:41 From: Franz Király (@fkiraly)
then there is also the case where you have non-equally-spaced series, which is what you probably understood - indices do not form a regular progression
Date: 2020-07-27 18:56:46 From: Franz Király (@fkiraly)
I mean the first
Date: 2020-07-27 18:57:48 From: Franz Király (@fkiraly)
the key problem in representing time series is that, sure, you can always "impose the knowledge" as you say, but that puts an extreme burden on the modelling interface - if there are multiple related representations, you end up writing lengthy conventions on what the "meaning" of the various inputs/outputs is, which can easily render the interface unusable for all practical purposes
Date: 2020-07-27 18:58:25 From: Jim Pivarski (@jpivarski)
Oh, I wasn't suggesting that we should do data analysis at a lower level, I was just making sure that we had a common baseline of understanding.
Date: 2020-07-27 18:58:42 From: Franz Király (@fkiraly)
in the end we can all use numpy.array as the basis for all computations, with increasingly intricate "decoding based on imposed knowledge", so it's less an argument-in-principle than a point of trade-off
Date: 2020-07-27 18:59:09 From: Franz Király (@fkiraly)
ok, looks like we agree on that
Date: 2020-07-27 18:59:54 From: Jim Pivarski (@jpivarski)
So, if each dependent variable (variable that depends on time), call them "x", "y", and "z", is sampled at different time points "t", then at the lowest level, you'd have to have arrays t_x, x, t_y, y, t_z, z.
Date: 2020-07-27 18:59:59 From: Franz Király (@fkiraly)
point is, the interface for us (at sktime) is already becoming unnecessarily complicated due to lack of "easy-to-use" data container
Date: 2020-07-27 19:00:13 From: Franz Király (@fkiraly)
So, if each dependent variable (variable that depends on time), call them "x", "y", and "z", is sampled at different time points "t", then at the lowest level, you'd have to have arrays
t_x,x,t_y,y,t_z,z.
Date: 2020-07-27 19:00:14 From: Franz Király (@fkiraly)
correct
Date: 2020-07-27 19:00:28 From: Jim Pivarski (@jpivarski)
(I recently had to do that in a data analysis; I had a time array for every data array. Obviously, taht wasn't great, but it worked.)
Date: 2020-07-27 19:00:51 From: Franz Király (@fkiraly)
sure, but you wouldn't want to make this your interface signature
Date: 2020-07-27 19:00:55 From: Franz Király (@fkiraly)
now imagine you have many such series
Date: 2020-07-27 19:01:07 From: Jim Pivarski (@jpivarski)
So Pandas isn't helping you a lot here, either, because even if you made a single t index in a single DataFrame, it would have a lot of NaN values (as you said above).
Date: 2020-07-27 19:01:15 From: Jim Pivarski (@jpivarski)
(I had 3.)
Date: 2020-07-27 19:01:21 From: Franz Király (@fkiraly)
t_x_1, x_1, t_y_1, y_1, t_x_2, x_2
Date: 2020-07-27 19:01:37 From: Franz Király (@fkiraly)
So Pandas isn't helping you a lot here, either, because even if you made a single
tindex in a single DataFrame, it would have a lot of NaN values (as you said above).
Date: 2020-07-27 19:01:39 From: Franz Király (@fkiraly)
precisely
Date: 2020-07-27 19:02:03 From: Franz Király (@fkiraly)
now imagine you have many multivariate series of this kind, and it just gets worse
Date: 2020-07-27 19:02:10 From: Jim Pivarski (@jpivarski)
Yeah. So far, it looks like you need something more than even Pandas, which isn't what I had been thinking. I had been thinking you wanted basically what Pandas is, but with complex data types in the "x", "y", "z".
Date: 2020-07-27 19:02:24 From: Franz Király (@fkiraly)
it's a way to get around it
Date: 2020-07-27 19:02:37 From: Franz Király (@fkiraly)
another way would:
Date: 2020-07-27 19:02:38 From: Jim Pivarski (@jpivarski)
What does Awkward offer for this situation?
Date: 2020-07-27 19:02:46 From: Franz Király (@fkiraly)
- allowing indexing by arbitrary sets
Date: 2020-07-27 19:03:05 From: Franz Király (@fkiraly)
- allowing to query values by any tuples of arbitrary length
Date: 2020-07-27 19:03:16 From: Franz Király (@fkiraly)
- named columns and indices, preferably
Date: 2020-07-27 19:03:39 From: Franz Király (@fkiraly)
awkward offers the "ragged array" way of addressing this
Date: 2020-07-27 19:03:52 From: Franz Király (@fkiraly)
it can be used inside an indexed data structure
Date: 2020-07-27 19:04:01 From: Franz Király (@fkiraly)
where you manually add the indexing
Date: 2020-07-27 19:04:24 From: Franz Király (@fkiraly)
simply put the ts in one dimension, and the x,y etc in another dimension, and 3rd index being time steps
Date: 2020-07-27 19:04:56 From: Franz Király (@fkiraly)
would be nice if there was awkward array and awkward data frame though
Date: 2020-07-27 19:05:01 From: Franz Király (@fkiraly)
Nd
Date: 2020-07-27 19:06:16 From: Jim Pivarski (@jpivarski)
What would the data structure look like if it were JSON?
Date: 2020-07-27 19:06:25 From: Franz Király (@fkiraly)
for sktime, we were looking at awkward array as it seemed to be perhaps the most solid way of getting ragged arrays in python
Date: 2020-07-27 19:06:31 From: Franz Király (@fkiraly)
What would the data structure look like if it were JSON?
Date: 2020-07-27 19:06:36 From: Franz Király (@fkiraly)
I don't speak json that well
Date: 2020-07-27 19:06:48 From: Franz Király (@fkiraly)
I can tell you in math
Date: 2020-07-27 19:06:56 From: Jim Pivarski (@jpivarski)
Okay, that then.
Date: 2020-07-27 19:07:25 From: Franz Király (@fkiraly)
comfortable with sets, tuples etc?
Date: 2020-07-27 19:07:30 From: Jim Pivarski (@jpivarski)
Yes.
Date: 2020-07-27 19:07:51 From: Franz Király (@fkiraly)
Ok, there are index sets S_1, ..., S_k
Date: 2020-07-27 19:08:03 From: Jim Pivarski (@jpivarski)
Okay.
Date: 2020-07-27 19:08:24 From: Franz Király (@fkiraly)
indices are of the kind (s_1,..., s_j), with j\le k, and s_i\in S_i
Date: 2020-07-27 19:08:24 From: Jim Pivarski (@jpivarski)
What would be an example S_i?
Date: 2020-07-27 19:08:41 From: Franz Király (@fkiraly)
S_i is typically the inhabitant set of a type
Date: 2020-07-27 19:08:49 From: Franz Király (@fkiraly)
or a collection of strings
Date: 2020-07-27 19:09:13 From: Franz Király (@fkiraly)
often also "the integers" or "the reals" or "all possible datetime indices"
Date: 2020-07-27 19:09:40 From: Franz Király (@fkiraly)
the data structure stores values at some of these indices (not necessarily all in the cartesian product)
Date: 2020-07-27 19:09:58 From: Franz Király (@fkiraly)
you would have sth like val( (s_1,...,s_j) )
Date: 2020-07-27 19:10:16 From: Franz Király (@fkiraly)
the type of this can and will in general depend on the choice of s_i
Date: 2020-07-27 19:10:25 From: Franz Király (@fkiraly)
so it's a type heterogeneous data structure
Date: 2020-07-27 19:10:30 From: Franz Király (@fkiraly)
Example
Date: 2020-07-27 19:10:51 From: Franz Király (@fkiraly)
S_1 = strings
Date: 2020-07-27 19:10:54 From: Franz Király (@fkiraly)
S_2 = strings
Date: 2020-07-27 19:11:12 From: Franz Király (@fkiraly)
S_3 = datetimes
Date: 2020-07-27 19:11:30 From: Franz Király (@fkiraly)
you have keys/indices
Date: 2020-07-27 19:11:56 From: Franz Király (@fkiraly)
("series-measurement", "temperature", "2018-01-12")
Date: 2020-07-27 19:12:24 From: Franz Király (@fkiraly)
("forecast", "pressure", "2020")
Date: 2020-07-27 19:13:03 From: Franz Király (@fkiraly)
you might also have dependency of admissible values of S_j on S_i for i<j
Date: 2020-07-27 19:13:48 From: Franz Király (@fkiraly)
val( ("series-measurement", "temperature", "2018-01-12") ) = 37 degrees Celsius
Date: 2020-07-27 19:14:30 From: Franz Király (@fkiraly)
in a multivariate series panel setting you might have a fourth index, for instance
Date: 2020-07-27 19:14:31 From: Jim Pivarski (@jpivarski)
In this example, is the second string a subcategory of the first string?
Date: 2020-07-27 19:14:36 From: Franz Király (@fkiraly)
no
Date: 2020-07-27 19:14:55 From: Franz Király (@fkiraly)
first string = variable that tells you which multivariate series it is
Date: 2020-07-27 19:15:01 From: Franz Király (@fkiraly)
second string = variable within the multivariate series
Date: 2020-07-27 19:15:20 From: Jim Pivarski (@jpivarski)
That's what I meant.
Date: 2020-07-27 19:15:33 From: Franz Király (@fkiraly)
no, it's a "real index"
Date: 2020-07-27 19:15:59 From: Jim Pivarski (@jpivarski)
If you were representing this with Pandas, would the first string tell you what DataFrame to look in and the second string tell you what column to look in?
Date: 2020-07-27 19:16:06 From: Franz Király (@fkiraly)
yes
Date: 2020-07-27 19:16:15 From: Jim Pivarski (@jpivarski)
Okay.
Date: 2020-07-27 19:16:44 From: Franz Király (@fkiraly)
with the added expectation if there is another integer index, that the columns you find in the data frame may always be the same
Date: 2020-07-27 19:17:08 From: Franz Király (@fkiraly)
you also may want to be able to look for values at "higher nodes"
Date: 2020-07-27 19:17:10 From: Franz Király (@fkiraly)
like
Date: 2020-07-27 19:17:29 From: Jim Pivarski (@jpivarski)
If you have both ("series-measurement", "temperature") and ("series-measurement", "pressure"), would they be defined at the same time samples?
Date: 2020-07-27 19:17:48 From: Franz Király (@fkiraly)
no
Date: 2020-07-27 19:17:52 From: Jim Pivarski (@jpivarski)
Okay.
Date: 2020-07-27 19:17:52 From: Franz Király (@fkiraly)
not in general
Date: 2020-07-27 19:18:04 From: Franz Király (@fkiraly)
I know there's "nice" subcases
Date: 2020-07-27 19:18:15 From: Franz Király (@fkiraly)
but I'm trying to construct a "useful" case that's badly supported
Date: 2020-07-27 19:18:54 From: Franz Király (@fkiraly)
you may also want to have val( ("series-measurement", "temperature-unit") ) which gives "Celsius"
Date: 2020-07-27 19:18:59 From: Jim Pivarski (@jpivarski)
So you wouldn't want a DataFrame for every first string value (unless you're okay with lots of NaNs). If you're trying to avoid NaNs, you'd have gropus of groups of Series.
Date: 2020-07-27 19:19:15 From: Jim Pivarski (@jpivarski)
What's val(...)?
Date: 2020-07-27 19:19:20 From: Franz Király (@fkiraly)
value at the index/key
Date: 2020-07-27 19:19:24 From: Jim Pivarski (@jpivarski)
Okay.
Date: 2020-07-27 19:19:28 From: Franz Király (@fkiraly)
So you wouldn't want a DataFrame for every first string value (unless you're okay with lots of NaNs).
Date: 2020-07-27 19:19:45 From: Franz Király (@fkiraly)
No, I would prefer sth where I can freely index without assumption of which keys have values assigned!
Date: 2020-07-27 19:20:28 From: Franz Király (@fkiraly)
The typical assumptions here are that the indices with assigned values:
- form some kind of Cartesian product
- are of equal length, as a tuple
Date: 2020-07-27 19:21:43 From: Markus Löning (@mloning)
Would it perhaps be easier to set up a quick phone call (and prepare some slides perhaps) to discuss this?
Date: 2020-07-27 19:21:47 From: Franz Király (@fkiraly)
e.g., the set [n_1] x [n_2] x ... x [n_d] for some integers n_i, in numpy, where [a] denotes range(a)
Date: 2020-07-27 19:22:04 From: Jim Pivarski (@jpivarski)
Okay. So a hypothetical implementation (to check my understanding): suppose you had a Python dict
{"series-measurement": ..., "forecast": ...}
and for each value of that, you had
{"temperature": ..., "pressure": ...}
and for each value of that, you had Pandas Series with a time index and data values.
Date: 2020-07-27 19:22:39 From: Jim Pivarski (@jpivarski)
It's my understanding that the piece you'd be lacking is comfortable querying. (Not having to check for undefined values.)
Date: 2020-07-27 19:23:03 From: Jim Pivarski (@jpivarski)
Undefined keys in the dicts and undefined index values in the Series.
Date: 2020-07-27 19:23:24 From: Franz Király (@fkiraly)
It's my understanding that the piece you'd be lacking is comfortable querying. (Not having to check for undefined values.)
Date: 2020-07-27 19:23:26 From: Franz Király (@fkiraly)
correct
Date: 2020-07-27 19:23:42 From: Franz Király (@fkiraly)
but you also want to be able to check for defined indices at any slice
Date: 2020-07-27 19:23:52 From: Franz Király (@fkiraly)
e.g., what is my set of time indices
Date: 2020-07-27 19:24:05 From: Franz Király (@fkiraly)
but that's perhaps a "data frame" function, not an "array" function, so I can live without
Date: 2020-07-27 19:25:04 From: Franz Király (@fkiraly)
though in time series analysis it's sometimes important to be quickly able to gather the indices with an observed value
Date: 2020-07-27 19:25:13 From: Franz Király (@fkiraly)
Would it perhaps be easier to set up a quick phone call (and prepare some slides perhaps) to discuss this?
Date: 2020-07-27 19:25:21 From: Jim Pivarski (@jpivarski)
One part I glossed over, because I think I got it, is that your time queries are intervals: "2020" is an interval, but I think the data are defined at instants of time.
Date: 2020-07-27 19:25:24 From: Franz Király (@fkiraly)
@mloning, are you volunteering to prepare a slide deck :-)
Date: 2020-07-27 19:25:36 From: Franz Király (@fkiraly)
is that your time queries are intervals: "2020" is an interval, but I think the data are defined at instants of time.
Date: 2020-07-27 19:25:51 From: Franz Király (@fkiraly)
in the model I outlined, it would be interpreted as instants
Date: 2020-07-27 19:26:03 From: Jim Pivarski (@jpivarski)
The queries?
Date: 2020-07-27 19:26:04 From: Franz Király (@fkiraly)
though of course you may also have intervals/segments, but that's a completely different story
Date: 2020-07-27 19:26:07 From: Franz Király (@fkiraly)
albeit an important one
Date: 2020-07-27 19:26:14 From: Franz Király (@fkiraly)
The queries?
Date: 2020-07-27 19:26:39 From: Franz Király (@fkiraly)
yes, in the queries you have to give a key - whether it's interpreted as a time point or a time period is not too important for the "core" data structure
Date: 2020-07-27 19:26:48 From: Franz Király (@fkiraly)
would be nice to have some sugar that deals with that kind of interpretation, too
Date: 2020-07-27 19:26:52 From: Franz Király (@fkiraly)
but perhaps not crucial
Date: 2020-07-27 19:27:20 From: Franz Király (@fkiraly)
"2020" was more to illustrate that the different series may have different index sets
Date: 2020-07-27 19:27:45 From: Jim Pivarski (@jpivarski)
I see: different time granularities.
Date: 2020-07-27 19:27:48 From: Franz Király (@fkiraly)
yes
Date: 2020-07-27 19:28:04 From: Franz Király (@fkiraly)
and the 3rd index might not even be time, or exist, depending what you choose for 2nd index
Date: 2020-07-27 19:28:24 From: Franz Király (@fkiraly)
(though it's the same set in my model above if it exists)
Date: 2020-07-27 19:29:37 From: Jim Pivarski (@jpivarski)
Hmmm. Getting back to the original question of "Can Awkward do that?" it's sounding like "No." Awkward Array doesn't have any facilities for index-lookup, which you'd want some sort of tree or sorted data to bisection-search over.
Date: 2020-07-27 19:30:03 From: Franz Király (@fkiraly)
well, the trees would be very shallow and very simple
Date: 2020-07-27 19:30:12 From: Franz Király (@fkiraly)
the above outlines the "luxury" version
Date: 2020-07-27 19:30:26 From: Franz Király (@fkiraly)
the "simple" version is simply: integer or datetime indices, possibly sparse
Date: 2020-07-27 19:30:55 From: Franz Király (@fkiraly)
awkward array currently has functionality for the index set being integers, and equal length indices, if I'm not mistaken
Date: 2020-07-27 19:31:09 From: Jim Pivarski (@jpivarski)
Also on the dynamic nature of these queries, where it sounds like the type of a tuple entry depends on previous values in the tuple is possible using unions but UnionArray's space and time overhead grow with the complexity of the type.
Date: 2020-07-27 19:31:38 From: Franz Király (@fkiraly)
it sounds like the type of a tuple entry depends on previous values
Date: 2020-07-27 19:31:52 From: Franz Király (@fkiraly)
that may happen - already in a data frame, the column type depends on the column you choose (First index)
Date: 2020-07-27 19:31:54 From: Jim Pivarski (@jpivarski)
If you were planning to use Awkward in Pandas because Pandas has indexing, the Awkward Array would have to be in the Pandas index, rather than the column, which is not even the current implementation that I'm thinking of dismantling.
Date: 2020-07-27 19:32:20 From: Franz Király (@fkiraly)
currently, we have been using awkward array for the 2nd and 3rd index, I believe?
Date: 2020-07-27 19:32:24 From: Franz Király (@fkiraly)
@mloning?
Date: 2020-07-27 19:32:34 From: Franz Király (@fkiraly)
1st index telling you "this is a column that has series in it"
Date: 2020-07-27 19:32:44 From: Franz Király (@fkiraly)
2nd index being row index, 3rd index being time index
Date: 2020-07-27 19:32:57 From: Jim Pivarski (@jpivarski)
awkward array currently has functionality for the index set being integers, and equal length indices, if I'm not mistaken
It has functionality for data structures being ragged, but that's different from being able to query these structures.
Date: 2020-07-27 19:33:15 From: Franz Király (@fkiraly)
sure
Date: 2020-07-27 19:33:27 From: Franz Király (@fkiraly)
we wrote some of the query funtionality
Date: 2020-07-27 19:33:31 From: Jim Pivarski (@jpivarski)
Okay.
Date: 2020-07-27 19:33:57 From: Franz Király (@fkiraly)
I think the prototype was using ExtensionArray and in it an awkward array
Date: 2020-07-27 19:34:07 From: Markus Löning (@mloning)
yes
Date: 2020-07-27 19:34:16 From: Jim Pivarski (@jpivarski)
Okay. How does it work?
Date: 2020-07-27 19:34:26 From: Franz Király (@fkiraly)
is patrick here?
Date: 2020-07-27 19:34:45 From: Franz Király (@fkiraly)
qualitatively:
Date: 2020-07-27 19:34:59 From: Markus Löning (@mloning)
It's a nested structure, so you have a multidimensional ak array inside the column, we also tried it out with numpy
Date: 2020-07-27 19:35:08 From: Franz Király (@fkiraly)
you index with row index and time index, and it looks up the value in the awkward array internally
Date: 2020-07-27 19:35:25 From: Franz Király (@fkiraly)
to the user, it looks like rows with series in them
Date: 2020-07-27 19:35:41 From: Franz Király (@fkiraly)
sort of
Date: 2020-07-27 19:37:00 From: Jim Pivarski (@jpivarski)
If you ran ak.to_json(...) on one of the rows, what would the result look like?
Date: 2020-07-27 19:38:01 From: Markus Löning (@mloning)
Patrick would know this ...
Date: 2020-07-27 19:38:02 From: Franz Király (@fkiraly)
not at a computer with IDE
Date: 2020-07-27 19:38:21 From: Franz Király (@fkiraly)
phone call with patrick?
Date: 2020-07-27 19:38:35 From: Jim Pivarski (@jpivarski)
I'd be up for that. I have a Zoom room, if that helps.
Date: 2020-07-27 19:38:46 From: Markus Löning (@mloning)
Yes that would be great!
Date: 2020-07-27 19:39:09 From: Franz Király (@fkiraly)
wait
Date: 2020-07-27 19:39:20 From: Franz Király (@fkiraly)
none of the rows are awkward arrays
Date: 2020-07-27 19:39:25 From: Franz Király (@fkiraly)
so the question doesn't make sense
Date: 2020-07-27 19:39:30 From: Markus Löning (@mloning)
We need to schedule with Patrick
Date: 2020-07-27 19:39:35 From: Franz Király (@fkiraly)
ah, yes, probably
Date: 2020-07-27 19:39:51 From: Jim Pivarski (@jpivarski)
Okay, I can also wait until everybody's ready.
Date: 2020-07-27 19:40:06 From: Franz Király (@fkiraly)
he's not around today, probably
Date: 2020-07-27 19:40:11 From: Jim Pivarski (@jpivarski)
I was asking about the conversion to JSON just to get a more concrete understanding.
Date: 2020-07-27 19:40:27 From: Franz Király (@fkiraly)
perhaps we can get there differently
Date: 2020-07-27 19:40:33 From: Franz Király (@fkiraly)
what are you trying to understand
Date: 2020-07-27 19:40:57 From: Franz Király (@fkiraly)
our requirements & use case?
Date: 2020-07-27 19:41:09 From: Franz Király (@fkiraly)
in the specific prototype exercise we were looking for a way that's
Date: 2020-07-27 19:41:12 From: Jim Pivarski (@jpivarski)
I'm trying to understand what, exactly, Awkward buys you.
Date: 2020-07-27 19:41:16 From: Franz Király (@fkiraly)
- low-maintenance for us
Date: 2020-07-27 19:41:20 From: Franz Király (@fkiraly)
- preferably off-shelf
Date: 2020-07-27 19:41:43 From: Jim Pivarski (@jpivarski)
As opposed to other off-the-shelf libraries.
Date: 2020-07-27 19:41:51 From: Markus Löning (@mloning)
mainly support for ragged arrays
Date: 2020-07-27 19:42:11 From: Franz Király (@fkiraly)
- supporting the "samples of univariate series (not necessarily equally sampled)" and/or the "multi-variate series (not necessarily equally sampled)" use case
Date: 2020-07-27 19:42:18 From: Franz Király (@fkiraly)
no off-shelf package supports that
Date: 2020-07-27 19:42:20 From: Jim Pivarski (@jpivarski)
Yeah, and that's the part I haven't gotten just yet: in the conversation about indexing and querying above, I didn't see where ragged arrays come in.
Date: 2020-07-27 19:42:33 From: Franz Király (@fkiraly)
the problem can be solved via a detour over ragged arrays
Date: 2020-07-27 19:42:39 From: Franz Király (@fkiraly)
2D, to be precise
Date: 2020-07-27 19:42:51 From: Markus Löning (@mloning)
you may have multiple subjects (e.g. patients) with multivariate time series data (e.g. blood pressure/heart rate), where time series have unequal length across subjects/variables
Date: 2020-07-27 19:42:55 From: Franz Király (@fkiraly)
use case 1, "samples of univariate series" - 1st index = sample, 2nd index = time
Date: 2020-07-27 19:43:11 From: Franz Király (@fkiraly)
use case 2, "multi-variate series" - 1st index = variable, 2nd index = time
Date: 2020-07-27 19:43:23 From: Franz Király (@fkiraly)
as the series can be of different length, you get ragged arrays
Date: 2020-07-27 19:43:55 From: Jim Pivarski (@jpivarski)
So I'll focus on case 1 for the moment.
Date: 2020-07-27 19:44:16 From: Franz Király (@fkiraly)
for a 3rd use case, "samples of multi-variate series", you get 3D ragged arrays: 1st index = sample, 2nd index = variable, 3rd index = time
Date: 2020-07-27 19:44:43 From: Franz Király (@fkiraly)
So I'll focus on case 1 for the moment.
Date: 2020-07-27 19:44:44 From: Franz Király (@fkiraly)
sure
Date: 2020-07-27 19:44:52 From: Franz Király (@fkiraly)
go ahead
Date: 2020-07-27 19:46:11 From: Jim Pivarski (@jpivarski)
[{"quantity": "temperature", "data": [98.6, 99.7, 100.3], "time": [1, 2, 3]},
{"quantity": "pressure", "data": [22.8, 23.6, 20.2, 25.5], "time": [0.5, 1.5, 2.5, 3.0]}
Date: 2020-07-27 19:47:11 From: Franz Király (@fkiraly)
this is no.2
Date: 2020-07-27 19:47:12 From: Franz Király (@fkiraly)
not no.1
Date: 2020-07-27 19:47:28 From: Jim Pivarski (@jpivarski)
Okay. Then I'll focus on case 2! :)
Date: 2020-07-27 19:48:33 From: Franz Király (@fkiraly)
yes - each list element carries the same data information as a pandas.series
Date: 2020-07-27 19:49:13 From: Jim Pivarski (@jpivarski)
I'm going to keep thinking about it, but I wouldn't be inclined to store data this way because I wouldn't want to put data of different types (different units) in the same arrays, as we have here. (Under this structure, the [98.6, 99.7, 100.3, 22.8, 23.6, 20.2, 25.5] is all one internal array.)
Date: 2020-07-27 19:50:19 From: Jim Pivarski (@jpivarski)
I'd be inclined to solve the original problem using regular Python classes for the bookkeeping with NumPy arrays for the big data.
Date: 2020-07-27 19:51:04 From: Jim Pivarski (@jpivarski)
Unless the number of variables n is close to or bigger than the number of samples N.
Date: 2020-07-27 19:51:24 From: Franz Király (@fkiraly)
sure, and that way you may end up re-implementing a minor variant of pandas...
Date: 2020-07-27 19:51:34 From: Jim Pivarski (@jpivarski)
When I said, "what Awkward buys you," one of the comparisons to be made is vs. Python itself.
Date: 2020-07-27 19:51:51 From: Franz Király (@fkiraly)
ah
Date: 2020-07-27 19:52:02 From: Jim Pivarski (@jpivarski)
Aren't you re-implementing a minor variant of Pandas, whether your base is Awkward or Python?
Date: 2020-07-27 19:52:19 From: Jim Pivarski (@jpivarski)
Since you're building all the indexing and querying yourself.
Date: 2020-07-27 19:52:27 From: Franz Király (@fkiraly)
I guess the two main - potential - benefits are (a) speed and (b) pre-existing abstraction
Date: 2020-07-27 19:52:59 From: Franz Király (@fkiraly)
the nested format (series within pandas) is slower
Date: 2020-07-27 19:53:23 From: Jim Pivarski (@jpivarski)
Then maybe you're near n ~ N...
Date: 2020-07-27 19:53:52 From: Franz Király (@fkiraly)
I would guess that would also hold for "naive native python construct"
Date: 2020-07-27 19:54:16 From: Jim Pivarski (@jpivarski)
one moment...
Date: 2020-07-27 19:54:18 From: Franz Király (@fkiraly)
n/N can be similar
Date: 2020-07-27 19:58:16 From: Jim Pivarski (@jpivarski)
That could explain it, and that would be a good reason for building it on Awkward. The advantage comes from putting all the (let's say) thousand time samples of the (let's say) thousand variables into one array of length 1 million. A thousand smaller NumPy arrays is not as space-efficient.
Date: 2020-07-27 19:59:18 From: Jim Pivarski (@jpivarski)
If all of the quantities (temperature, pressure, etc.) have different data types, this will not work well because they'll be in a UnionArray, which is actually composed of different arrays for each of the types.
Date: 2020-07-27 19:59:45 From: Jim Pivarski (@jpivarski)
Enforcing all of your quantities to be floating point with the same precision would be advantageous.
Date: 2020-07-27 19:59:51 From: Franz Király (@fkiraly)
patrick made some systematic runtime and efficiency experiments. The entire data container project came to pass by two reasons: (i) inefficiencies we noticed in timing, and (ii) awkward user experience for the more hacky data container type, if I remember correctly.
Date: 2020-07-27 20:00:34 From: Markus Löning (@mloning)
Enforcing all of your quantities to be floating point with the same precision would be advantageous.
yes that's what you'd do when working with numpy
Date: 2020-07-27 20:00:36 From: Franz Király (@fkiraly)
have different data types, this will not work well because they'll be in a UnionArray, which is actually composed of different arrays for each of the types.
but ultimately there's a small number of "interesting" types, which could be handled by putting them in buckets
Date: 2020-07-27 20:02:03 From: Jim Pivarski (@jpivarski)
Maybe I'm exaggerating the difficulty. If you have 5 data types, you'll have 7 arrays: one for each data type, one array of 8-bit "tags" and another array of 32-to-64-bit "indexes" for random access.
Date: 2020-07-27 20:02:23 From: Jim Pivarski (@jpivarski)
Is the primary concern the speed of lookup?
Date: 2020-07-27 20:02:50 From: Jim Pivarski (@jpivarski)
As opposed to manipulation?
Date: 2020-07-27 20:04:26 From: Jim Pivarski (@jpivarski)
(Because UnionArrays are also a stumbling block for manipulation. The originally motivating use case had homogeneous data; UnionArrays were added so that there would be appropriate output types for some operations, but not expected to be relied upon heavily.)
Date: 2020-07-27 20:04:53 From: Franz Király (@fkiraly)
Is the primary concern the speed of lookup? As opposed to manipulation?
Date: 2020-07-27 20:04:55 From: Franz Király (@fkiraly)
both are
Date: 2020-07-27 20:04:57 From: Franz Király (@fkiraly)
- speed
Date: 2020-07-27 20:05:00 From: Franz Király (@fkiraly)
- ease of use
Date: 2020-07-27 20:05:05 From: Jim Pivarski (@jpivarski)
What kinds of manipulations?
Date: 2020-07-27 20:05:40 From: Franz Király (@fkiraly)
read/write, with sub-setting, indexing, slicing
Date: 2020-07-27 20:05:50 From: Franz Király (@fkiraly)
look-up by-value
Date: 2020-07-27 20:06:03 From: Franz Király (@fkiraly)
and by-condition
Date: 2020-07-27 20:06:42 From: Franz Király (@fkiraly)
more generally, the numpy kind of operation on pooled array structure
Date: 2020-07-27 20:06:57 From: Jim Pivarski (@jpivarski)
Okay, most of that is querying. By "manipulation," I meant doing computations that can change the structure of the data. (Computations via NumPy ufuncs that do not change the structure are basically pass-throughs.)
Date: 2020-07-27 20:08:08 From: Franz Király (@fkiraly)
that may happen in models - it's typically some merge/aggregate operation and/or matrix operation on the entire thing, or slices
Date: 2020-07-27 20:08:12 From: Jim Pivarski (@jpivarski)
Okay. And back to the original question of whether Awkward needs to be a Pandas ExtensionArray to do that: if you have a layer on top of Awkward anyway, to translate queries from the space of sets you described into ragged arrays, then couldn't that in-between layer also define an ExtensionArray?
Date: 2020-07-27 20:08:20 From: Franz Király (@fkiraly)
or dynamic programming algorithms
Date: 2020-07-27 20:08:38 From: Franz Király (@fkiraly)
then couldn't that in-between layer also define an ExtensionArray?
Date: 2020-07-27 20:08:40 From: Franz Király (@fkiraly)
that's what we did
Date: 2020-07-27 20:09:01 From: Jim Pivarski (@jpivarski)
Oh! So if I remove Awkward's built-in ExtensionArray support, it already wouldn't affect you. Right?
Date: 2020-07-27 20:09:01 From: Franz Király (@fkiraly)
but it creates a maintenance burden
Date: 2020-07-27 20:09:25 From: Franz Király (@fkiraly)
Oh! So if I remove Awkward's built-in ExtensionArray support, it already wouldn't affect you. Right?
It would increase our maintenance burden if we went with awkward array
Date: 2020-07-27 20:09:53 From: Franz Király (@fkiraly)
optimally, we have minimal maintenance burden when using awkward array, but it does not sound like that would happen...
Date: 2020-07-27 20:10:49 From: Jim Pivarski (@jpivarski)
One irreducible thing is that you have to do a translation from a space of sets into ragged arrays. That part has to be there and maintained by you.
Date: 2020-07-27 20:11:01 From: Franz Király (@fkiraly)
point is, since currently don't really have capacity for creating or maintaining data containers (sktime is primarily a modelling framework toolbox), how off-shelf a dependency is is a key factor
Date: 2020-07-27 20:11:32 From: Franz Király (@fkiraly)
I think @mloning's idea of a phone call with patrick might be a good idea to explore this fully
Date: 2020-07-27 20:11:43 From: Jim Pivarski (@jpivarski)
The thing I'm wondering is whether Awkward's ExtensionArray layer even helps, given that you need to have a layer in between, too.
But we can pick this up on a phone call.
Date: 2020-07-27 20:11:57 From: Markus Löning (@mloning)
makes sense
Date: 2020-07-27 20:11:57 From: Jim Pivarski (@jpivarski)
Are you in UK time?
Date: 2020-07-27 20:12:03 From: Markus Löning (@mloning)
Yes
Date: 2020-07-27 20:12:03 From: Franz Király (@fkiraly)
well that's precisely one of the questions we've been wondering too
Date: 2020-07-27 20:12:07 From: Franz Király (@fkiraly)
yes
Date: 2020-07-27 20:12:46 From: Franz Király (@fkiraly)
but you have to evaluate this against alternatives that satisfy the functional requirement
Date: 2020-07-27 20:12:52 From: Jim Pivarski (@jpivarski)
I can start as early as 7am Chicago time which is 13:00 for you.
Date: 2020-07-27 20:13:08 From: Markus Löning (@mloning)
We should also think about our requirements again given that there isn't a solution that meets all of them at the moment and perhaps see which one we'd be prepared to give up first
Date: 2020-07-27 20:13:15 From: Jim Pivarski (@jpivarski)
Except for Thursday, when I have another meeting at that tiem.
Date: 2020-07-27 20:13:28 From: Franz Király (@fkiraly)
I'd say, please consider me optional, but patrick as required
Date: 2020-07-27 20:13:33 From: Franz Király (@fkiraly)
and markus
Date: 2020-07-27 20:14:10 From: Franz Király (@fkiraly)
I have your email, assuming it's the one you just posted
Date: 2020-07-27 20:14:18 From: Markus Löning (@mloning)
@fkiraly it'd be good to have you there :)
Date: 2020-07-27 20:14:23 From: Markus Löning (@mloning)
we'll be in touch!
Date: 2020-07-27 20:14:29 From: Markus Löning (@mloning)
Thanks a lot!
Date: 2020-07-27 20:14:31 From: Jim Pivarski (@jpivarski)
Okay, see you later!
Date: 2020-07-27 20:14:49 From: Franz Király (@fkiraly)
@fkiraly it'd be good to have you there :
but if I'm there I can less easily complain about the decision you made
Date: 2020-07-28 06:02:19 From: Patrick Rockenschaub (@prockenschaub)
Seems I am late to the party...
Date: 2020-07-28 06:03:15 From: Patrick Rockenschaub (@prockenschaub)
I will try to put a real-world example together and show-case the way we were thinking about this for the last few months
Date: 2020-07-28 06:04:47 From: Patrick Rockenschaub (@prockenschaub)
I should also be available any day this week after 1pm UK time
Date: 2020-08-17 14:00:32 From: Alexander Held (@alexander-held)
Hi, I noticed that I couldn't add to an array via +=, for example:
import awkward1 as ak
arr = [[1,2,3], [4]]
ones = [[1,1,1], [1]]
arr = ak.from_iter(arr)
ones = ak.from_iter(ones)
print(arr + ones)
arr += ones
print(obj)results in
Traceback (most recent call last):
File "test.py", line 10, in <module>
arr += ones
File "[...]/lib/python3.8/site-packages/numpy/lib/mixins.py", line 39, in func
return ufunc(self, other, out=(self,))
TypeError: operand type(s) all returned NotImplemented from __array_ufunc__(<ufunc 'add'>, '__call__', <Array [[1, 2, 3], [4]] type='2 * var * int64'>, <Array [[1, 1, 1], [1]] type='2 * var * int64'>, out=(<Array [[1, 2, 3], [4]] type='2 * var * int64'>,)): 'Array', 'Array', 'Array'
It's very easy to work around this but I got curious: is there a deeper reason for why this does not work?
Date: 2020-08-17 14:53:40 From: Jim Pivarski (@jpivarski)
There's a deeper reason: Awkward Arrays are immutable. Perhaps the error message should be more explicit, though.
Date: 2020-08-17 14:58:38 From: Alexander Held (@alexander-held)
ah this makes sense, thanks Jim!
Date: 2020-08-17 15:12:03 From: Henry Schreiner (@henryiii)
Even just changing the contents, not the structure?
Date: 2020-08-17 15:12:21 From: Henry Schreiner (@henryiii)
(I feel this was discussed somewhere already)
Date: 2020-08-17 16:38:15 From: Jim Pivarski (@jpivarski)
Some element like array["column", 1000, "subcolumn", 12] might also be array[999, "another-column"] or even another_array[12]. If you assign it in place, even without changing structure, you can have some confusing long-distance behaviors.
I know NumPy does that, too, and view vs copy in NumPy is a little nuisance. The structure of an Awkward array is sufficiently more complicated (and hidden, under the layout property) that view vs copy would be a big nuisance.
Date: 2020-08-19 12:03:44 From: Markus Löning (@mloning)
@jpivarski FYI https://data-apis.org/blog/announcing_the_consortium/
Date: 2020-08-21 08:08:41 From: Riccardo De Maria (@rdemaria)
Hello, I am interested in the topic of mutability. My domain is simulations, therefore I like to mutate states and immutability results in performance issues. I understand well that the memory layout depends on the content, but don't understand why it is difficult to allow only mutations that do not change the type and memory layout of the data. For instance a double can be replaced by another double, or an array of doubles can be replaced by an array of doubles of the same length. Do I miss something?
Date: 2020-08-21 12:10:52 From: Jim Pivarski (@jpivarski)
The main thing is that a given double may be in three or four places within an array, or in several arrays, and changing it in one place might change its value unexpectedly in multiple places. Whereas it can be hard in NumPy to know whether a given operation copies or views the original data (the latter leading to unexpected long-distance mutations), most Awkward operations copy AND view—parts of the array are copied (if they need to be changed) and other parts are viewed. This, also, is motivated by performance: it's called structural sharing, and it allows slices, rearrangements, and restructuring of an array of big records without having to apply those operations to all fields of the big records. The fields of the unmodified array ARE the fields of the modified array. The operation doesn't get fully pushed down until you look at one of the fields.
Date: 2020-08-21 12:13:12 From: Jim Pivarski (@jpivarski)
However, it is possible to modify Awkward arrays in place, but not by performing Awkward operations. When NumPy arrays are wrapped as Awkward arrays, they are not copied, and you can still modify the NumPy array in place and get long distance changes in all the Awkward arrays that view the same data. This tutorial explains how to do that:
Date: 2020-08-21 12:13:15 From: Jim Pivarski (@jpivarski)
https://awkward-array.org/how-to-convert-numpy.html#mutability-of-awkward-arrays-from-numpy
Date: 2020-08-21 13:44:18 From: Riccardo De Maria (@rdemaria)
Thanks for the explanation it is clear now!
Date: 2020-10-20 15:54:30 From: Sven Dildick (@dildick)
Hi Jim, I would like to use awkward-arrays to make plots of efficiencies for an object to match to others objects.
Date: 2020-10-20 15:54:49 From: Jim Pivarski (@jpivarski)
Okay.
Date: 2020-10-20 15:54:54 From: Sven Dildick (@dildick)
All these object properties are stored in awkward-arrays.
Date: 2020-10-20 15:55:08 From: Sven Dildick (@dildick)
https://github.com/gem-sw/GEMCode/blob/for-CMSSW_11_1_X/GEMValidation/scripts/objects.py
Date: 2020-10-20 15:55:43 From: Sven Dildick (@dildick)
I was zipping the corresponding properties together in new objects called "clct" or "sim_muon" or other relevant names
Date: 2020-10-20 15:56:45 From: Jim Pivarski (@jpivarski)
okay
Date: 2020-10-20 15:56:52 From: Sven Dildick (@dildick)
The idea that the sim_muon contains indices of objects to which it was previously matched, e.g. "sim_id_gem_cluster"
Date: 2020-10-20 15:57:15 From: Sven Dildick (@dildick)
this can be of arbitrary length
Date: 2020-10-20 15:57:32 From: Jim Pivarski (@jpivarski)
If you have these in an interactive prompt (highly recommended while developing), you can print what they look like here.
Date: 2020-10-20 16:00:51 From: Sven Dildick (@dildick)
for instance the sim_muon looks like <Array [[], [], ... sim_id_gem_cluster: 5}]]] type='1000 * var * var * {"phi": f...'>
Date: 2020-10-20 16:01:28 From: Sven Dildick (@dildick)
And sim_muon.sim_id_gem_cluster looks like <Array [[], [], [[2, 5, 6, ... 3, 4], [2, 5]]] type='1000 * var * var * int64'>
Date: 2020-10-20 16:02:50 From: Sven Dildick (@dildick)
another example is sim_muon.eta which looks like <Array [[], [], [[1.63, ... [2.23, 2.23]]] type='1000 * var * var * float32'>
Date: 2020-10-20 16:03:50 From: Sven Dildick (@dildick)
There is also a cluster, which is a bunch of zipped awkward-arrays gem_cluster
Date: 2020-10-20 16:07:21 From: Jim Pivarski (@jpivarski)
(Sorry—distracted by email again. I'm looking now.)
Date: 2020-10-20 16:07:50 From: Sven Dildick (@dildick)
A first thing I wanted to figure out was how to get all sim_muons for which any of the elements in sim_id_gem_cluster point to gem_cluster which has (gem_cluster.station == 2 & gem_cluster.ring == 1)
Date: 2020-10-20 16:08:13 From: Jim Pivarski (@jpivarski)
Okay, let me make something similar.
Date: 2020-10-20 16:08:29 From: Sven Dildick (@dildick)
So the starting point is the muon. I have to go through the indices to get to matched objects. I want all muons for which the matched objects satisfy a specific criterium
Date: 2020-10-20 16:10:02 From: Jim Pivarski (@jpivarski)
What does the gem_cluster array look like?
Date: 2020-10-20 16:10:30 From: Jim Pivarski (@jpivarski)
It has fields "station" and "ring" (at least), but is it singly jagged or doubly?
Date: 2020-10-20 16:11:38 From: Jim Pivarski (@jpivarski)
Taking a wild guess, can you do
gem_cluster[sim_muon.sim_id_gem_cluster]
?
Date: 2020-10-20 16:12:04 From: Sven Dildick (@dildick)
gem_clusters looks like
Date: 2020-10-20 16:12:06 From: Sven Dildick (@dildick)
"bx" : tree["gem_cluster_bx"],
"pad" : tree["gem_cluster_pad"],
"isodd" : tree["gem_cluster_isodd"],
"size" : tree["gem_cluster_size"],
"region" : tree["gem_cluster_region"],
"station" : tree["gem_cluster_station"],
"roll" : tree["gem_cluster_roll"],
"layer" : tree["gem_cluster_layer"],
"chamber" : tree["gem_cluster_chamber"]
})```
**Date:** 2020-10-20 16:12:19 **From:** Sven Dildick (@dildick)
It does not have ring, sorry, but you can use region instead
**Date:** 2020-10-20 16:12:20 **From:** Jim Pivarski (@jpivarski)
I'm guessing that `gem_cluster` has the same length as `sim_muon` and the same level of jaggedness, but a different number of elements in each nested list.
**Date:** 2020-10-20 16:12:52 **From:** Jim Pivarski (@jpivarski)
I saw that in the code, but I don't know how deeply jagged it is without knowing the same about `tree["gem_cluster_bx"]`, etc.
**Date:** 2020-10-20 16:13:05 **From:** Sven Dildick (@dildick)
sim_muon is of type ```1000 * var``` followed by ```{}```
**Date:** 2020-10-20 16:13:17 **From:** Jim Pivarski (@jpivarski)
Does the above code (in black) return a value or raise an error?
**Date:** 2020-10-20 16:13:31 **From:** Sven Dildick (@dildick)
whereas sim_id_gem_cluster is of type ```1000 * var * var```
**Date:** 2020-10-20 16:13:49 **From:** Jim Pivarski (@jpivarski)
I see.
**Date:** 2020-10-20 16:14:31 **From:** Sven Dildick (@dildick)
```ValueError: in ListArray64, jagged slice inner length differs from array inner length```
**Date:** 2020-10-20 16:14:47 **From:** Sven Dildick (@dildick)
this is what ```gem_cluster[sim_muon.sim_id_gem_cluster]``` returns
**Date:** 2020-10-20 16:17:53 **From:** Jim Pivarski (@jpivarski)
I lost connection for a bit; back now.
**Date:** 2020-10-20 16:19:35 **From:** Jim Pivarski (@jpivarski)
Yeah, you got the error because they have different depths of jaggedness; `sim_muon` is singly jagged (`ndim` is 2, like a jagged 2-d array) and `gem_cluster` is doubly jagged (`ndim` is 3).
**Date:** 2020-10-20 16:20:31 **From:** Jim Pivarski (@jpivarski)
I'm trying to think of why your data have that structure. It may be right, but it always takes some time to think it through.
**Date:** 2020-10-20 16:21:52 **From:** Jim Pivarski (@jpivarski)
Okay, this is right: the `gem_cluster` and the `sim_muon.sim_id_gem_cluster` are both doubly jagged. They ought to fit together.
**Date:** 2020-10-20 16:22:43 **From:** Jim Pivarski (@jpivarski)
But they don't because `gem_cluster`'s first level of grouping has nothing to do with muons, does it?
**Date:** 2020-10-20 16:23:37 **From:** Jim Pivarski (@jpivarski)
Actually, I understand why `sim_muon.sim_id_gem_cluster` is doubly jagged: each event has multiple muons and each muon has multiple clusters (which I'm taking to be like "hits").
**Date:** 2020-10-20 16:23:37 **From:** Sven Dildick (@dildick)
It does not, the first level is per-event
**Date:** 2020-10-20 16:24:17 **From:** Sven Dildick (@dildick)
Indeed, each muon can have multiple clusters (of hits)
**Date:** 2020-10-20 16:24:28 **From:** Jim Pivarski (@jpivarski)
I don't understand why `gem_cluster` has two levels of jaggedness. I would think each event has a bunch of clusters. How/why are the clusters grouped?
**Date:** 2020-10-20 16:26:51 **From:** Sven Dildick (@dildick)
So maybe I mistyped, but this is ```gem_cluster```
**Date:** 2020-10-20 16:26:54 **From:** Sven Dildick (@dildick)
```In [10]: gem_cluster
Out[10]: <Array [[], [], ... layer: 2, size: 1}]] type='1000 * var * {"chamber": int32, "...'>
Date: 2020-10-20 16:26:55 From: Jim Pivarski (@jpivarski)
And then my second question will be, what do the indexes in sim_id_gem_cluster refer to? Are they indexes in the whole event or indexes in each group in each event?
Date: 2020-10-20 16:27:11 From: Sven Dildick (@dildick)
they are indices in the event
Date: 2020-10-20 16:27:24 From: Sven Dildick (@dildick)
ultimately, the clusters are not grouped other than per event
Date: 2020-10-20 16:27:28 From: Jim Pivarski (@jpivarski)
Oh, so it only has one level. gem_cluster.type is 1000 * var * {only flat stuff}, right?
Date: 2020-10-20 16:27:53 From: Sven Dildick (@dildick)
yes
Date: 2020-10-20 16:28:11 From: Sven Dildick (@dildick)
e.g. gem_cluster.station is of type <Array [[], [], [1, 1, ... 2, 2, 2, 2, 2, 2]] type='1000 * var * int32'>
Date: 2020-10-20 16:28:36 From: Sven Dildick (@dildick)
so event 1 has no clusters, event 2 has no clusters, event 3 has clusters in at least station 1
Date: 2020-10-20 16:28:46 From: Jim Pivarski (@jpivarski)
Okay, I understand your data. This seems like the sort of thing that ought to be easy but it's not obvious to me right now... Thinking...
Date: 2020-10-20 16:30:22 From: Jim Pivarski (@jpivarski)
I made these to try it interactively (I always have to):
>>> sim_id_gem_cluster = ak.Array([[], [], [[2, 5, 6, 3]]])
>>> sim_id_gem_cluster
<Array [[], [], [[2, 5, 6, 3]]] type='3 * var * var * int64'>
>>> gem_cluster = ak.Array([{"id": 0}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 5}, {"id": 6}])
>>> gem_cluster
<Array [{id: 0}, {id: 1}, ... {id: 5}, {id: 6}] type='7 * {"id": int64}'>
Date: 2020-10-20 16:31:38 From: Jim Pivarski (@jpivarski)
Actually, the gem_cluster was wrong; it's more like
>>> gem_cluster = ak.Array([[], [], [{"id": 0}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 5}, {"id": 6}]])
>>> gem_cluster
<Array [[], ... {id: 4}, {id: 5}, {id: 6}]] type='3 * var * {"id": int64}'>
Date: 2020-10-20 16:35:56 From: Jim Pivarski (@jpivarski)
The thing we're going to want to do is to make the gem_cluster one level deeper by putting all of its contents into a single list.
Date: 2020-10-20 16:37:32 From: Jim Pivarski (@jpivarski)
No, that's not it. That took advantage of the fact that my example sim_id_gem_cluster had only one list in the last event. Using a more general one now:
>>> sim_id_gem_cluster = ak.Array([[], [], [[2, 5], [6, 2]]])
>>> sim_id_gem_cluster
<Array [[], [], [[2, 5], [6, 2]]] type='3 * var * var * int64'>
Date: 2020-10-20 16:38:12 From: Sven Dildick (@dildick)
Right, it's first per event, then per muon
Date: 2020-10-20 16:38:35 From: Sven Dildick (@dildick)
And two muons can match to the same clusters
Date: 2020-10-20 16:38:55 From: Jim Pivarski (@jpivarski)
What should be happening here is that gem_cluster, having one less level of depth, should be broadcasting to fit the deeper one. But broadcasting only automatically happens for math (NumPy ufuncs and such), not for slicing. So we might need to do an explicit broadcasting.
Date: 2020-10-20 16:40:52 From: Sven Dildick (@dildick)
I suppose unless we zip everything together?
Date: 2020-10-20 16:41:08 From: Sven Dildick (@dildick)
e.g. zip together the sim_muon properties and the cluster properties
Date: 2020-10-20 16:41:25 From: Jim Pivarski (@jpivarski)
And they don't broadcast (I'm using ak.broadcast_arrays) because they have different lengths, yeah. So far, this "clearly you want to be able to do this" problem is looking kinda hard.
We always have an escape valve of using Numba, which I'll recommend if I run out of time or the solution to this looks horrendously complex.
I don't see a use of zipping here.
Date: 2020-10-20 16:41:44 From: Jim Pivarski (@jpivarski)
Or maybe... (and that was my blind spot)...
Date: 2020-10-20 16:42:31 From: Jim Pivarski (@jpivarski)
No, zipping does broadcasting. That's equivalent to what I was trying with ak.broadcast_arrays.
Date: 2020-10-20 16:43:16 From: Sven Dildick (@dildick)
So would it be better to zip all arrays directly into a single "event" structure", without definin sim_muon and cluster separately?
Date: 2020-10-20 16:44:34 From: Jim Pivarski (@jpivarski)
Whether everything is in one array called "events" or in separate arrays like this, the problems are the same. (It's essentially renaming: named, Python objects become projections of the Awkward ARray.)
Date: 2020-10-20 16:48:10 From: Jim Pivarski (@jpivarski)
So—the gem_cluster is
[[], [], [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]
but we want to turn it into
[[], [], [[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}],
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]]
because then each muon would get to pick from the full set of ids.
Date: 2020-10-20 16:49:33 From: Jim Pivarski (@jpivarski)
Basically, we'd like to "duplicate" the full set of clusters for each of the muons in each event. The first two events have zero muons; the last event has two; that's why I made this structure.
Date: 2020-10-20 16:49:46 From: Jim Pivarski (@jpivarski)
I can do that in a low-level way, but I'm at a loss for how to do that at high level.
Date: 2020-10-20 16:50:32 From: Jim Pivarski (@jpivarski)
Which means we're missing some high-level functions.
Date: 2020-10-20 16:51:03 From: Sven Dildick (@dildick)
okay, maybe I can repost my question on stackoverflow, in case you want to think about it
Date: 2020-10-20 16:51:43 From: Jim Pivarski (@jpivarski)
And, by the way, you can also do it in Numba and it's just for loops. That's probably the most familiar to you. Do you know where you'd find an example of that?
Date: 2020-10-20 16:52:45 From: Jim Pivarski (@jpivarski)
StackOverflow would be good, but I think this is going to lead to a feature request: a high-level function for what I'm about to give you a low-level recipe for. (You can self-answer with the low-level recipe, adapted to your minimally working example. That doesn't prohibit other, high-level answers later.)
Date: 2020-10-20 16:52:46 From: Sven Dildick (@dildick)
I did not know that...
Date: 2020-10-20 16:53:12 From: Sven Dildick (@dildick)
A relevant example would be nice :-)
Date: 2020-10-20 16:53:39 From: Jim Pivarski (@jpivarski)
You might want to shoot me after you see it in Numba, as in "Why are we wasting our time with trying to do it in slices?" It's so that both are options in the future, so that there are choices.
Date: 2020-10-20 16:56:48 From: Jim Pivarski (@jpivarski)
Dang, I can't find any Numba examples. There's one I was thinking of in GitHub Issues, and I can't find it. It was answering a totally different question.
Date: 2020-10-20 16:58:16 From: Jim Pivarski (@jpivarski)
Well, there's this: https://github.com/jpivarski-talks/2020-06-08-uproot-awkward-columnar-hats/blob/d95042f9e7d0e169cacd711d9751993646b0f7a2/02-columnar-analysis-awkward-array.ipynb
Date: 2020-10-20 16:58:40 From: Jim Pivarski (@jpivarski)
You can self-answer multiple times on StackOverflow, once with Numba when you get it.
Date: 2020-10-20 16:59:52 From: Jim Pivarski (@jpivarski)
I'll have to answer later, but I'll put a low-level answer to your question here, so that you have more fodder.
Date: 2020-10-20 17:00:12 From: Sven Dildick (@dildick)
okay, thanks for your time
Date: 2020-10-20 19:25:56 From: Jim Pivarski (@jpivarski)
@dildick My other meeting just finished and I have a low-level solution for you. As a reminder, my version of your arrays are
>>> sim_id_gem_cluster
<Array [[], [], [[2, 5], [6, 2]]] type='3 * var * var * int64'>
>>> gem_cluster
<Array [[], ... {id: 4}, {id: 5}, {id: 6}]] type='3 * var * {"id": int64}'>
>>> gem_cluster.tolist()
[[], [], [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]
The problem is that each muon in each event (2 levels of jaggedness) indexes a cluster in each event (1 level of jaggedness). To be able to slice gem_cluster with sim_id_gem_cluster, you need a full copy of each event's clusters for every muon. In other words, we need to make gem_cluster look like this:
[[], [], [[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}],
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]]
Date: 2020-10-20 19:33:59 From: Jim Pivarski (@jpivarski)
That's the part that unfortunately doesn't have a high-level function. It can be made by creating an index that repeats as many times as the muons:
>>> index = np.repeat(np.arange(len(sim_id_gem_cluster)), ak.num(sim_id_gem_cluster))
>>> index
array([2, 2])
and then using that index to build a (low-level "layout") IndexedArray:
>>> indexedarray = ak.layout.IndexedArray64(ak.layout.Index64(index), gem_cluster.layout)
>>> ak.to_list(indexedarray)
[[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}],
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]
(If you look at that indexedarray, you'll see its internal structure; that's why I pass it through ak.to_list.)
Then we can build a doubly jagged array using parts of the sim_id_gem_cluster:
ut.offsets, indexedarray)
>>> ak.to_list(new_gem_cluster_layout)
[[], [], [[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}],
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}]]]
And that's what we wanted to make, so wrap it up as a new (high-level) array:
>>> new_gem_cluster = ak.Array(new_gem_cluster_layout)
>>> new_gem_cluster
<Array [[], ... {id: 4}, {id: 5}, {id: 6}]]] type='3 * var * var * {"id": int64}'>
Date: 2020-10-20 19:36:32 From: Jim Pivarski (@jpivarski)
Now we can simply slice it.
>>> new_gem_cluster[sim_id_gem_cluster]
<Array [[], ... {id: 5}], [{id: 6}, {id: 2}]]] type='3 * var * var * {"id": int64}'>
>>> new_gem_cluster[sim_id_gem_cluster].tolist()
[[], [], [[{'id': 2}, {'id': 5}], [{'id': 6}, {'id': 2}]]]
This act of "copying" the whole set of clusters for each muon to pick from is not expensive: the IndexedArray is functioning as a set of pointers. We've actually just built a set of pointers with the same multiplicity as the muons that point to the set of all clusters in the same event.
Date: 2020-10-20 19:39:43 From: Jim Pivarski (@jpivarski)
@dildick Actually, instead of a StackOverflow post, could you make the above a feature request for the missing high-level function? You shouldn't have to build low-level layouts by hand to solve a problem like this. There ought to be some high-level function for making indexes in sim_id_gem_cluster match up with items in gem_cluster to make new_gem_cluster in one line. The hard part will be determining what a natural interface should be—how you would expect such a function to be named and parameterized. That's where I need to get input from you, because I don't know what would be the most "natural" or "guessable" for you.
Date: 2020-10-20 19:40:03 From: Jim Pivarski (@jpivarski)
For something completely different, here's how you could do it with Numba:
Date: 2020-10-20 19:47:47 From: Jim Pivarski (@jpivarski)
>>> import numba as nb
>>> @nb.jit
... def select(builder, gem_cluster, sim_id_gem_cluster):
... for clusters, muons in zip(gem_cluster, sim_id_gem_cluster):
... builder.begin_list()
... for muon in muons:
... builder.begin_list()
... for i in muon:
... builder.append(clusters[i])
... builder.end_list()
... builder.end_list()
... return builder
...
>>> select(ak.ArrayBuilder(), gem_cluster, sim_id_gem_cluster).snapshot()
<Array [[], ... {id: 5}], [{id: 6}, {id: 2}]]] type='3 * var * var * {"id": int64}'>
Date: 2020-10-20 19:51:52 From: Jim Pivarski (@jpivarski)
You can drop the @nb.jit to make this run slowly, but more easily debuggable. With @nb.jit, it tries to compile the contents, which means that any type errors yield lots of output.
The ak.ArrayBuilder is a way to make structured arrays gradually, which is a better fit to for-loop style code. Every begin_list must be accompanied by an end_list, and it does the append at the right level of depth. An ak.ArrayBuilder is not an ak.Array; that's what the .snapshot() is for.
Date: 2020-10-20 19:55:09 From: Jim Pivarski (@jpivarski)
(For all but the most complex problems, I want it to be possible to do it with slices and also to do it with Numba, so that whichever is more natural for a given problem may be used. There are some limitations in Numba, such as "all indexing must be simple indexing: i must be an int." Thus, you're forced to have three nested for loops in this example. They're compiled, so they're fast, but verbose.)
Date: 2020-10-22 11:32:08 From: Alexander Held (@alexander-held)
Hi, I have a workflow that runs against the HEAD of some dependencies daily, including awkward1 (https://github.com/alexander-held/cabinetry/blob/master/.github/workflows/dependencies-head.yml#L34-L56). It started failing last night when installing awkward1. How do I correctly install the current awkward1 master?
When running this
docker run -it --rm python:3.8-slim bash
apt-get update
apt-get install git
git clone --recursive https://github.com/scikit-hep/awkward-1.0.git
cd awkward-1.0/
pip install .
the installation fails with
CMake Error: CMake was unable to find a build program corresponding to "Unix Makefiles". CMAKE_MAKE_PROGRAM is not set. You probably need to select a different build tool.
[...]
ERROR: Failed building wheel for awkward1
Failed to build awkward1
ERROR: Could not build wheels for awkward1 which use PEP 517 and cannot be installed directly
I also tried the same, going back one commit to e1fe6f2497ae007bdddcdbd60a08f7bbec4e1029, with the same error.
Date: 2020-10-22 12:42:28 From: Jim Pivarski (@jpivarski)
Python 3.9, pybind11 2.6.0, and Arrow 2.0.0 have all introduced breaking changes, and this branch: https://github.com/scikit-hep/awkward-1.0/pull/482 has been keeping up with those changes, but it's not ready to merge into master yet because Azure hasn't updated all VMs with Python 3.9 yet and NumPy is failing to compile, because NumPy doesn't have wheels for Python 3.9 yet.
Date: 2020-10-22 12:44:14 From: Jim Pivarski (@jpivarski)
You're running on Python 3.8, so you might only be seeing the pybind11 2.6.0 update (which did go into master because it's a submodule and I was having git problems, and I couldn't see how to update the submodule version in a branch without also updating it in master) and there was one bug-fix to catch up to the latest pybind11. So that means that master will fail to build.
Date: 2020-10-22 12:45:12 From: Chris Burr (@chrisburr)
@alexander-held That looks like you need to install make at the same time as git?
Date: 2020-10-22 12:46:26 From: Jim Pivarski (@jpivarski)
But also, awkward-1.0 master will fail to build. But you're right, @chrisburr, that's a CMake error complaining about the lack of make. I don't see why that would suddenly happen.
Date: 2020-10-22 12:47:55 From: Jim Pivarski (@jpivarski)
I'm going to merge the one Awkward fix for pybind11 2.6.0 into master, so that it will build, when you're ready. This thing with everybody's versions updating at once might still be a problem, though. (An Awkward test will fail because of Arrow.)
Date: 2020-10-22 13:17:56 From: Nicholas Smith (@nsmith-)
but is there a strong motivation to use the head branch of another library in your library's CI?
Date: 2020-10-22 13:18:26 From: Nicholas Smith (@nsmith-)
from a user point of view, its only necessary to make sure the latest release is compatible, no?
Date: 2020-10-22 13:31:04 From: Henry Schreiner (@henryiii)
You can have different branches with different submodule checkouts, the only issue is that you have to type git submodule update when you switch branches. But you’d have to do that when you checkout old tags, etc., anyway.
Date: 2020-10-22 13:32:56 From: Henry Schreiner (@henryiii)
What was the one fix? One of the tests I did before release was compile Awkward against pybind11 master.
Date: 2020-10-22 13:34:55 From: Henry Schreiner (@henryiii)
I think there were some warnings, but it compiled either with no changes or maybe with something so obvious that I did it and then immediatly forget it.
Date: 2020-10-22 13:36:32 From: Jim Pivarski (@jpivarski)
This was the fix: I think it was broken before, but didn't actually complain (not a warning, as the comment says, but an error): https://github.com/scikit-hep/awkward-1.0/pull/482/commits/cade2bdc4be6aa9c2fb7fc17b18d0426b02e40ed
Date: 2020-10-22 13:46:46 From: Henry Schreiner (@henryiii)
Ah, I thought that was something different. That likely is the extended code correctness checking, and I may have tested awkward right before that fix went in. 2.5 would be fine with the new code, too, I believe.
By the way, never use raw POSIX ssize_t. If you ever support Windows + PyPy, it will break. Use py::ssize_t instead.
Date: 2020-10-22 13:53:34 From: Jim Pivarski (@jpivarski)
The pybind11 documentation uses ssize_t as fields of the buffer_info struct. Because of that, NumpyArray has ssize_t fields, which causes friction with the int64_t used everywhere else in Awkward. I run Windows 32-bit and 64-bit tests and ensure that every interface between those pybind-friendly ssize_t integers and Awkward-friendly int64_t integers is marked with an explicit cast (by getting Windows 32-bit down to zero warnings). But I'm not yet testing the Windows + PyPy combination, which might turn up more warnings.
Date: 2020-10-22 13:58:55 From: Alexander Held (@alexander-held)
Thank you! Glad to hear it's expected that it won't build currently, I wanted to be sure that the issue is not in the CI.
Date: 2020-10-22 14:01:00 From: Alexander Held (@alexander-held)
@nsmith- The motivation is to alert me early on, before a release of a dependency goes out. That way I can prepare ahead of time instead of getting surprised and quickly having to fix things on my end. (credit to Matthew Feickert, who introduced this). It's already been useful when pyhf had an API update that broke my code, so I could prepare to release an update quickly after the next pyhf release.
Date: 2020-10-22 14:09:54 From: Henry Schreiner (@henryiii)
@jpivarski The problem here is you are using py::cast<T> to convert from a C++ type to another C++ type, which is unsupported. What’s actually happening is it’s converting the C++ type to py::object, then back down to C++. Then you are giving it to py::make_tuple, which converts it back to py::object. Somewhere in that the new correctness check is throwing an error, I expect. The new form is much better - you convert once to py::object, then pass that one. I’m pretty sure you could drop the py::cast entirely if you want to, though - that’s part of the job of py::make_tuple; it triggers the non-explicit conversion as needed.
Date: 2020-10-22 14:29:30 From: Jim Pivarski (@jpivarski)
That's what I thought. (I didn't write the original code, but deciphering it, I thought it wanted to make the int64_t from length() into a py::object as part of a Python tuple.) I didn't know that py::make_tuple would cast its arguments, though it makes sense that it does. I'm happy with casting it because we will always want the items of that tuple to be Python objects, so it's more explicit than need be, but expresses the intention.
Date: 2020-10-22 14:41:47 From: Henry Schreiner (@henryiii)
No problem with explicit being better than implicit. :) (I think the original verison had either two or three implicit casts, I lost count. :) )
Date: 2020-10-22 14:43:54 From: Henry Schreiner (@henryiii)
py::make_tuple always makes a py::tuple, which is not templated, so it must be a tuple of Python objects. There is a PR that would allow the above to be written as py::tuple(py::cast(…)), though (it would require explicit casting in the py::tuple constructor)
Date: 2020-10-22 14:58:36 From: Nicholas Smith (@nsmith-)
but then you are relying on the upstream project master to always build, which is not always a reasonable expectation
Date: 2020-10-22 14:59:21 From: Nicholas Smith (@nsmith-)
isn't the method here to pin your code to a major release of the upstream package so any API change doesn't affect you? and then update the pin when a new release is out in a PR to test
Date: 2020-10-22 15:00:50 From: Alexander Held (@alexander-held)
It's true that it makes sense only if the master builds most of the time. In this specific setup that is generally the case. I'm also not checking against all dependencies, just the main ones that I expect may cause issues.
Date: 2020-10-22 15:02:10 From: Alexander Held (@alexander-held)
And for pinning: yes, in general that seems like a nicer way of handling things. In practice I'm using some pyhf stuff that's somewhat fragile and deep inside the codebase (the last thing broke in a pyhf patch release 0.5.2->0.5.3), so pinning major version isn't enough. I could do it for the other dependencies though now that I think about it...
Date: 2020-10-22 15:03:51 From: Jim Pivarski (@jpivarski)
This might not be true of every project, but it is a goal for Awkward's master branch to always work: it must compile and the tests must run. That's a requirement for merging a branch, as stated in the CONTRIBUTING.md.
Date: 2020-10-22 15:05:05 From: Jim Pivarski (@jpivarski)
Right now is an exceptional time because of the Python 3.9 release (and other packages releasing breaking changes, only some of which are related to Python 3.9).
Date: 2020-10-22 15:06:07 From: Jim Pivarski (@jpivarski)
So I think it's a reasonable thing for pyhf to build against Awkward master, given that we have this rule to say it should always work, but it's not working now for other reasons.
Date: 2020-10-22 15:08:16 From: Jim Pivarski (@jpivarski)
Actually, since this is all going to get cleared up when (a) NumPy releases a version compatible with Python 3.9 and (b) Python 3.9 is added to all the wheels, so that I can merge the branch that adapts to these things, maybe you can just wait, too? Trying to move parts of the PR that's going to fix it all into master while other necessary things are third party dependencies I don't control doesn't sound very productive. In about a week, it could start working again.
Date: 2020-10-22 15:21:09 From: Alexander Held (@alexander-held)
Hi Jim, sorry if I didn't make this clear before: I'm more than happy to wait for the situation to be resolved, I don't need this to work urgently at all. I only intended to see whether the issue is on my end (and requires some updates to the CI) or expected from awkward1.
Date: 2020-10-22 20:32:41 From: alesaggio (@alesaggio)
Hi everyone, I have a question about awkward arrays. I have started using them recently so apologies if I’m asking something trivial. :)
In my events I am selecting tri-jet objects by doing the following:
trijets = jets.argchoose(3)
What I would like to do is get an index that keeps track of these combinations. In other words, if e.g. one event has 5 jets, then for the same event trijets will contain 7 elements ([(0,1,2),(0,1,3),(0,1,4),(0,2,4),(1,2,3),(1,3,4),(2,3,4)]); I would need, for the same event, the following: [(0,1,2,3,4,5,6)]. Is there any way to accomplish this? I have tried with argmatch() but it does’t work on my trijets objects (I get *** AttributeError: no column named ‘argmatch’, I guess because trijets is a JaggedArray and not a JaggedArrayMethod). For context, I need this because I am doing chi-square minimisation for object reconstruction in my analysis. I can compute the chi-square with no problem, but to trace back the particles that minimise it I need to store the indices as well… I would be extremely grateful if you could help! And thanks in advance :)
Date: 2020-10-22 20:38:42 From: Jim Pivarski (@jpivarski)
@alesaggio Since you're just starting, I would recommend starting with Awkward 1, in which the function is named ak.argcombinations. I'm not sure that a function returning [(0, 1, 2, 3, 4, 5, 6)] is what you want: that would be one element of a 7-tuple. Tuples have a fixed number of slots in each entry, so such a thing would be unable to deal with any events with fewer or more than 7 particles. I'm unclear on what you mean by "an index that keeps track of the combinations." Since you're using the "arg" form, you're already getting indexes (drop "arg" to get the actual combinations, rather than indexes to them).
Date: 2020-10-22 21:05:16 From: alesaggio (@alesaggio)
Hi @jpivarski , thanks, indeed I don't need a fixed number of slots for each event. The length will depend on the number of trijet combinations. Just to bring an example, imagine that I have a W decaying leptonically and a top decaying hadronically, so I would need to match all possible leptons+MET to the W mass and all possible combinations of trijet objects to the top mass. In the end, when I build some chi-square to minimize, I also want to know the three jets and the one lepton that minimised it. Plus, I need to know which objects they were assigned to. I guess tracing back the lepton is not a problem, however it is not trivial for the jets because I'm dealing with combinations of them.
I guess in the end I would need something along the lines of trijets.cross(leptons), which however doesn't work...
Date: 2020-10-22 23:01:01 From: Jim Pivarski (@jpivarski)
ak.cartesian(ak.combinations(jets, 3), leptons)
would give you a lists (per event) of objects like
((jet1, jet2, jet3), lepton1)
and then if you
comb_trijets, comb_lepton = ak.unwrap(ak.cartesian(ak.combinations(jets, 3), leptons))
comb_j1, comb_j2, comb_j3 = ak.unwrap(comb_trijets)
then comb_j1, comb_j2, comb_j3, and comb_lepton would be jagged arrays of jet1, jet2, jet3, and lepton, sufficiently duplicated to represent all combinations, ready to be applied in formulae. Then when you need to find the best combination (according to some criterion), you can ak.argmax(·, axis=1) to maximize over each list (with the criterion as the ·). Any of these can be used with "arg" to get the indexes, rather than the objects.
Major hint: try this with small samples that have a minimal number of fields in an interactive environment like Jupyter. Anyone, no matter how experienced, will have to tinker with it to express what they want.
Date: 2020-10-23 09:41:52 From: alesaggio (@alesaggio)
Thanks @jpivarski . This might be indeed what I need. I'm having issues with the ak.combinations function though:
trijets = ak.combinations(jets,3)
<Array [[(['hadronFlavour', ... )]] type='702 * var * (var * string, var * strin...'>
but I get error when trying to access the p4 (and its components):
trijet.p4.pt
*** AttributeError: no field named 'p4'
(https://github.com/scikit-hep/awkward-1.0/blob/0.3.1/src/awkward1/highlevel.py#L1095)
Date: 2020-10-23 12:36:35 From: Jim Pivarski (@jpivarski)
Look at the type attribute to see what fields the object has. If p4 isn't among them, then it would have to complain like this.
Date: 2020-10-23 12:38:49 From: alesaggio (@alesaggio)
actually I think it
Date: 2020-10-23 12:39:50 From: alesaggio (@alesaggio)
it's because my jets objects was awkward0 JaggedArray while trijets is awkward1, I just converted it and it seems like it's working!
Date: 2020-10-23 12:40:40 From: alesaggio (@alesaggio)
thanks a lot for the help
Date: 2020-10-23 13:41:42 From: Jim Pivarski (@jpivarski)
Great!
Date: 2020-10-23 14:55:11 From: alesaggio (@alesaggio)
Hi again, I just stumbled upon the following:
(Pdb) (comb_j1.p4+comb_j2.p4)["fMass"]
<Array [[21.2], [15.8, ... 36.8, 30.5, 18.6]] type='966 * var * float64'>
(Pdb) comb_j1.p4["fMass"]+comb_j2.p4["fMass"]
<Array [[21.2], [15.8, ... 36.8, 30.5, 18.6]] type='966 * var * float64'>
It seems like awkward1 doesn't recognize Lorentz vectors and therefore doesn't "combine" them correctly... Is there a way to solve this?
Date: 2020-10-23 14:56:44 From: Jim Pivarski (@jpivarski)
This is the "vector" project: https://github.com/scikit-hep/vector
which is in development, but it might have the features you need. (Since it's early, it would have to be opt-in for now.)
Date: 2020-10-23 14:59:50 From: alesaggio (@alesaggio)
I see, I will have a look at this then, thanks!
Date: 2020-11-01 22:19:06 From: Yi-Mu "Enoch" Chen (@yimuchen)
Can I ask about the expected performance penalty for using numba to construct arrays instead of using in-built methods? Currently, I'm trying to calculate a quantile of a doubly nested structure (10K * var *var → 10K * var), and the function I end up using is very slow (more than x1000 times slower than numpy with regular structures). I was wondering if this is to be expected, and if so, what I can do to speed this function?
I've also tried to numba with the mean function, and I also get similar slow downs. It looks like the slow down is mainly due to the construction of a doubly nested structure, as the slow down is only ~x30 when I'm computing arrays of just 2 dimensions.
import awkward1 as ak
import numpy as np
import numba
@numba.jit()
def quantile(x,q=0.5,builder=ak.ArrayBuilder()):
builder.begin_list()
for l in x :
builder.begin_list()
for s in l :
s = ak.to_numpy(s)
if len(s):
builder.real(np.quantile(s,q))
else:
builder.real(0)
builder.end_list()
builder.end_list()
return builder
numpy_x = np.random.rand(10000,10,20)
x = ak.from_numpy(numpy_x)
np.quantile(numpy_x,q=0.75,axis=-1) # 10 ms
quantile(x,q=0.75,builder=ak.ArrayBuilder()).snapshot() # 40 s
Date: 2020-11-02 14:53:35 From: Jim Pivarski (@jpivarski)
The slow thing here is likely the ArrayBuilder, which is dynamically typed.
Date: 2020-11-02 14:58:31 From: Jim Pivarski (@jpivarski)
To diagnose, maybe you could try removing the ArrayBuilder lines, just to see how much they affected the speed. You'd have to put the output somewhere, into a NumPy array, for instance. Otherwise, LLVM might optimize away the unused calculation.
Date: 2020-11-02 14:59:32 From: Jim Pivarski (@jpivarski)
ArrayBuilder is general but slow. Perhaps we need a more specialized builder of lists of lists of numbers. Particularly if the depth is known, a lot of the slow-down can be avoided.
Date: 2020-11-02 18:56:35 From: Nicholas Smith (@nsmith-)
but 1000x slower?
Date: 2020-11-02 20:47:21 From: Jim Pivarski (@jpivarski)
It depends on "slower than what?"
Date: 2020-11-02 21:00:46 From: Jim Pivarski (@jpivarski)
Actually, I just tried this and there's a ton of warnings because @numba.jit couldn't compile it. It comes down to calling ak.to_numpy in the function—none of the ak.* functions can be used in a compiled function. So in this case, the "slower than what?" is slower than pure Python, with no Numba acceleration at all.
Date: 2020-11-02 22:13:08 From: Yi-Mu "Enoch" Chen (@yimuchen)
So, when using numba, I'm guessing the options forceobj=True doesn't help with performance? I tried removing the ak.to_numpy statement, but the compiler doesn't like the np.quantile(s) with s being an awkward array, (though the function will work if I remove numba.jit)
Date: 2020-11-02 22:26:24 From: Nicholas Smith (@nsmith-)
you can try numba.njit to be sure it is always in "no-python" mode
Date: 2020-11-02 22:28:53 From: Jim Pivarski (@jpivarski)
forceobj=True is the opposite of helping performance. It used to be the default, which was bad because normally when people use Numba, they want to speed things up, so now it's opt-in. (There are cases where you'd want to use Python objects in Numba, especially mixed-mode, but forceobj=True forces it to be objects/slow everywhere.)
@numba.njit is a shortcut for passing no_python=True, which is becoming the new default.
Date: 2020-11-02 22:31:46 From: Jim Pivarski (@jpivarski)
Having flat Awkward Arrays be recognized as NumPy arrays within a Numba-compiled function would be a good feature request. I'm not sure, at the moment, how to do it, but that would be a better interface than to require users to type ak.to_numpy. It won't be possible to implement all ak.* functions, so the easiest rule to express is, "No ak.* functions will work." But in functions that expect NumPy arrays, we should at least try to implicitly convert the Awkward Arrays into NumPy.
Date: 2020-11-02 22:37:32 From: Jim Pivarski (@jpivarski)
To solve your specific problem now, perhaps the best thing to do is to ak.pad_none to make the jagged array rectangular with None values, then ak.fill_none to fill the None values with np.nan, convert it into a NumPy array (because it's rectilinear now), and use np.nanquantile, which ignores np.nan values.
Date: 2020-11-02 22:37:57 From: Jim Pivarski (@jpivarski)
@yimuchen ^^^ (Sorry, I forgot to at-sign you in.)
Date: 2020-11-02 23:49:50 From: Yi-Mu "Enoch" Chen (@yimuchen)
@jpivarski Ah, That's a much more elegant solution than the one I was using (ak.sort + ak.argsort to calculate quantiles by indices), thanks!
Date: 2020-11-03 14:49:54 From: Yi-Mu "Enoch" Chen (@yimuchen)
@jpivarski A question about storing the results. Originally we had a say 100x var x var awkward array, after the padding we now have a 100x10x20 numpy array we used to calculate the quantile of the final axis. The quantile results is now a 100x10 numpy/awkward array with nan padding the second axis. How can I reduce this back into a 100x var awkward array? When running the masking operation x[~np.isnan(x)], this reduces the array into a 1 dimensional array which is not what I want.
Date: 2020-11-03 15:32:35 From: Jim Pivarski (@jpivarski)
What we're missing is a good way to turn regular dimensions into jagged dimensions. Starting with
>>> import awkward1 as ak
>>> import numpy as np
>>> original = ak.Array([[[0, 1, 2], []], [[3, 4]], [], [[5], [6, 7, 8, 9]]])
>>> original
<Array [[[0, 1, 2], []], ... 5], [6, 7, 8, 9]]] type='4 * var * var * int64'>
>>> padded = np.asarray(
... ak.fill_none(
... ak.pad_none( # yikes! this is a mouthful!
... ak.fill_none(
... ak.pad_none(original, 2, axis=1),
... []),
... 4, axis=2),
... np.nan))
>>> padded
array([[[ 0., 1., 2., nan],
[nan, nan, nan, nan]],
[[ 3., 4., nan, nan],
[nan, nan, nan, nan]],
[[nan, nan, nan, nan],
[nan, nan, nan, nan]],
[[ 5., nan, nan, nan],
[ 6., 7., 8., 9.]]])
>>> quantiles = np.nanquantile(padded, 0.75, axis=2)
/home/jpivarski/miniconda3/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1389: RuntimeWarning: All-NaN slice encountered
result = np.apply_along_axis(_nanquantile_1d, axis, a, q,
>>> quantiles
array([[1.5 , nan],
[3.75, nan],
[ nan, nan],
[5. , 8.25]])
>>> regular_slice = ak.from_numpy(~np.isnan(quantiles), regulararray=True)
>>> regular_slice
<Array [[True, False], ... [True, True]] type='4 * 2 * bool'>
>>> jagged_slice = ak.Array(regular_slice.layout.toListOffsetArray64(False))
>>> ak.Array(quantiles)[jagged_slice]
<Array [[1.5], [3.75], [], [5, 8.25]] type='4 * var * float64'>It's the second-to-last step that needs a high-level interface. The slicing rules are different for regular slices (4 * 2 * bool) and jagged slices (4 * var * bool), but the only way we currently have for turning the one into the other is to unwrap regular_slice (extract its layout), call an internal method (which converts a RegularArray into a ListOffsetArray), and wrap it back up as an ak.Array.
If you're wondering why the slicing rules are different between regular and jagged slices, it's because NumPy's rules don't extend into the jagged array behavior we want (we want to be able to select particles with an array of booleans derived from operations like jagged > XYZ) and Awkward Array must adhere to NumPy's rules where they overlap. They only truly overlap when the slice has regular dimensions, like a NumPy array with a fixed shape, so that's why we make a distinction between regular and jagged dimensions, even when an irregular-typed dimension (var) might in practice be regular.
Date: 2020-11-03 15:36:13 From: Yi-Mu "Enoch" Chen (@yimuchen)
Ah yes, the second to last line is the ingredient that I was missing. Thanks!
Date: 2020-11-03 15:37:10 From: Jim Pivarski (@jpivarski)
It can become a feature request for "converting regular axes into irregular ones and vice-versa." Such a function would have multiple uses.
Date: 2020-11-12 10:36:20 From: Simon B. (@sbuse)
Hello everybody, i´m currently working on a project where i would like to use awkward (1) arrays and uproot (4) to work with a large collection of ROOT files. The issue is that I have to save the awkward arrays into some format that allows the ROOT like structure, meaning non table like data. I’ve been trying to read into the topic but I’m absolutely lost about which file type (hdf5,arrow,parquet,..) to use for saving and if it is even possible to at the moment.
Date: 2020-11-12 13:28:32 From: Jim Pivarski (@jpivarski)
"non table like data" → not HDF5, but Arrow and Parquet are good options. Of these two, Parquet is a file format. (Arrow's IPC format can be saved in files, but that's not common or compressed.) The functions are ak.to_parquet and ak.from_parquet.
Parquet does change the data because it assumes that every level of structure is "nullable" (i.e. "None" values can appear anywhere in the structure), but this distinction can usually be ignored: the type changes, but the values don't. The Parquet I/O goes through the pyarrow package, which hasn't implemented all data structures—WE are the motivating use-case. While we agitate to get the missing structures implemented, the ak.to_parquet function has an explode_records parameter to work around one: lists of records.
It's also possible to get 100% fidelity through the to/from arrayset functions, but these are intended to build file format interfaces. Currently, pickle goes through these functions, so you can pickle your Awkward Arrays without any changes, though this isn't an efficient way to store it (uncompressed, can include inaccessible data, which can lead to surprising file sizes).
Date: 2020-11-12 13:48:43 From: Simon B. (@sbuse)
Thanks for the quick answer. So it looks like the Parquet file format is the answer to my problem. I tried to run the minimal ak.to_parquet example but unfortunately this directly kills my jupyter notebook kernel.
import awkward1 as ak1
array1 = ak1.Array([[1, 2, 3], [], [4, 5], [], [], [6, 7, 8, 9]])
ak1.to_parquet(array1, "array1.parquet")
Date: 2020-11-12 13:49:45 From: Jim Pivarski (@jpivarski)
"Kills it?" That's not a big example, either: it couldn't be running out of memory.
Date: 2020-11-12 13:50:36 From: Jim Pivarski (@jpivarski)
% python
Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward1 as ak1
>>> array1 = ak1.Array([[1, 2, 3], [], [4, 5], [], [], [6, 7, 8, 9]])
>>> ak1.to_parquet(array1, "array1.parquet")
>>>
% lsl array1.parquet
-rw-rw-r-- 1 jpivarski jpivarski 736 Nov 12 07:50 array1.parquet
Date: 2020-11-12 13:50:54 From: Jim Pivarski (@jpivarski)
What's your pyarrow version?
Date: 2020-11-12 13:51:16 From: Jim Pivarski (@jpivarski)
>>> import pyarrow
>>> pyarrow.__version__
'2.0.0'
Date: 2020-11-12 13:53:37 From: Simon B. (@sbuse)
Okay i have to update that. The one i have is : Version: 0.17.1
I'll try again with the newer version.
Date: 2020-11-12 13:54:55 From: Jim Pivarski (@jpivarski)
Yeah. Between 0.x and 1.x, there are known backward incompatibilities. Within a few months of releasing 1.0, they released 2.0, but I didn't directly observe any backward incompatible changes there.
Date: 2020-11-12 13:55:22 From: Jim Pivarski (@jpivarski)
What do you mean by "killed the kernel?" There was no error message/exception stack trace?
Date: 2020-11-12 13:56:31 From: Jim Pivarski (@jpivarski)
We can't make Awkward compatible with pyarrow 0.x and 1+, but if the behavior for 0.x is so lethal, we should explicitly check the version and raise an error with upgrade instructions.
Date: 2020-11-12 13:58:09 From: Simon B. (@sbuse)
Unfortunately i don't get an error message besides the :
The kernel appears to have died. It will restart automatically.
Date: 2020-11-12 13:59:39 From: Jim Pivarski (@jpivarski)
So the actual error must be happening in pyarrow's Cython side, since a Python-side error would at least give a stack trace (and your example is too small to be a memory error, which can also kill a process). I'll put in an explicit version check.
Date: 2020-11-12 13:59:46 From: Chris Burr (@chrisburr)
Did you check the log of jupyter itself? Nasty errors sometimes manage to avoid the jupyter log capture
Date: 2020-11-12 13:59:46 From: Jim Pivarski (@jpivarski)
Thanks for the feedback!
Date: 2020-11-12 14:00:09 From: Jim Pivarski (@jpivarski)
Oh—could you just do that example in a Python terminal?
Date: 2020-11-12 14:05:04 From: Simon B. (@sbuse)
Python 3.6.10 (default, Jan 16 2020, 09:12:04) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward1 as ak1
>>> array1 = ak1.Array([[1, 2, 3], [], [4, 5], [], [], [6, 7, 8, 9]])
>>> ak1.to_parquet(array1, "array1.parquet")
[1] 30969 illegal hardware instruction (core dumped) python3
Date: 2020-11-12 14:07:43 From: Jim Pivarski (@jpivarski)
Wow—it looks like it wasn't compiled for your hardware. (You can get this kind of error if a compiled binary includes AVX instructions and you don't have AVX in your processor.) That might have been a mistake made in the 0.x era that doesn't have anything to do with changes in format, but hopefully they've fixed it now.
Date: 2020-11-12 14:09:07 From: Chris Burr (@chrisburr)
@sbuse How did you install arrrow and awkward1? And what CPU do you have? (cat /proc/cpuinfo and look for "model name")
Date: 2020-11-12 14:12:08 From: Simon B. (@sbuse)
I'm working on the cluster of my university and installed awkward1 with pip install awkward1 --user . The pyarrow was preinstalled.
Date: 2020-11-12 14:12:23 From: Chris Burr (@chrisburr)
Does import pyarrow work?
Date: 2020-11-12 14:12:29 From: Simon B. (@sbuse)
yes
Date: 2020-11-12 14:21:29 From: Simon B. (@sbuse)
sbuse@farm-ui:~> python3
Python 3.6.10 (default, Jan 16 2020, 09:12:04) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__version__
'2.0.0'
>>> import awkward1 as ak1
>>> array1 = ak1.Array([[1, 2, 3], [], [4, 5], [], [], [6, 7, 8, 9]])
>>> ak1.to_parquet(array1, "array1.parquet")
[1] 8655 illegal hardware instruction (core dumped) python3
Date: 2020-11-12 14:51:21 From: Chris Burr (@chrisburr)
@sbuse Can you get a backtrace with gdb to check which package contains the bad instruction? Something like this should do it:
$ gdb --args python -c 'import awkward1 as ak1; array1 = ak1.Array([[1, 2, 3], [], [4, 5], [], [], [6, 7, 8, 9]]); ak1.to_parquet(array1, "array1.parquet")'
> run
> bt
Date: 2020-11-12 17:29:58 From: Simon B. (@sbuse)
Okee this prints 27 statements like this :
#7 0x00007ffff79ba905 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#8 0x00007ffff797525b in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#9 0x00007ffff79f8b06 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#10 0x00007ffff7a000f4 in _PyFunction_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#11 0x00007ffff7974efe in _PyObject_FastCallDict () from /usr/lib64/libpython3.6m.so.1.0
#12 0x00007ffff7975761 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.6m.so.1.0
#13 0x00007ffff797525b in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#14 0x00007ffff79bd11d in ?? () from /usr/lib64/libpython3.6m.so.1.0
#15 0x00007ffff79ba94e in ?? () from /usr/lib64/libpython3.6m.so.1.0
#16 0x00007ffff797525b in PyObject_Call () from /usr/lib64/libpython3.6m.so.1.0
#17 0x00007ffff79f8b06 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#18 0x00007ffff79fed85 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#19 0x00007ffff79fe476 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#20 0x00007ffff79f741a in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.6m.so.1.0
#21 0x00007ffff79f66b1 in PyEval_EvalCodeEx () from /usr/lib64/libpython3.6m.so.1.0
#22 0x00007ffff79f63db in PyEval_EvalCode () from /usr/lib64/libpython3.6m.so.1.0
#23 0x00007ffff7a78e22 in ?? () from /usr/lib64/libpython3.6m.so.1.0
#24 0x00007ffff7a78dca in PyRun_StringFlags () from /usr/lib64/libpython3.6m.so.1.0
#25 0x00007ffff7a78d1c in PyRun_SimpleStringFlags () from /usr/lib64/libpython3.6m.so.1.0
#26 0x00007ffff7a7fe2d in Py_Main () from /usr/lib64/libpython3.6m.so.1.0
#27 0x0000555555554de5 in main ()
(gdb)
Date: 2020-11-12 17:31:29 From: Jim Pivarski (@jpivarski)
This might be some sort of stack trace.
Date: 2020-11-12 17:32:10 From: Jim Pivarski (@jpivarski)
Does it end with something different?
Date: 2020-11-12 17:33:58 From: Chris Burr (@chrisburr)
@sbuse Can you check what the CPU is by running cat /proc/cpuinfo and looking for "model name"
Date: 2020-11-12 17:34:42 From: Jim Pivarski (@jpivarski)
Just speculating, but maybe it's infinite recursion? Normally, the message would be "stack overflow" or something like that, but maybe instead of detecting that it was out of stack space, it tried to execute some memory that wasn't part of the stack and that's the illegal instruction. (If it did that, that would be a security vulnerability. Injection attacks could use that to take over the process.) Using the wrong version of pyarrow with Awkward could plausibly lead to an infinite recursion.
Date: 2020-11-12 17:35:42 From: Simon B. (@sbuse)
@chrisburr AMD Opteron(tm) Processor 6136 is the model name
Date: 2020-11-12 17:35:45 From: Jim Pivarski (@jpivarski)
These are a bunch of Python C API calls, which is—I think—what Cython generates...
Date: 2020-11-12 17:36:31 From: Chris Burr (@chrisburr)
Can you edit the stack trace again to show the first few lines? when you added the others you lost the first few frames
Date: 2020-11-12 17:37:46 From: Simon B. (@sbuse)
(gdb) bt
#0 0x00007fffbf6b10dc in parquet::arrow::(anonymous namespace)::NodeToSchemaField(parquet::schema::Node const&, parquet::internal::LevelInfo, parquet::arrow::(anonymous namespace)::SchemaTreeContext*, parquet::arrow::SchemaField const*, parquet::arrow::SchemaField*) () from /home/uzh/sbuse/.local/lib/python3.6/site-packages/pyarrow/libparquet.so.200
#1 0x00007fffbf6b2fe1 in parquet::arrow::(anonymous namespace)::ListToSchemaField(parquet::schema::GroupNode const&, parquet::internal::LevelInfo, parquet::arrow::(anonymous namespace)::SchemaTreeContext*, parquet::arrow::SchemaField const*, parquet::arrow::SchemaField*) () from /home/uzh/sbuse/.local/lib/python3.6/site-packages/pyarrow/libparquet.so.200
#2 0x00007fffbf6b1557 in parquet::arrow::(anonymous namespace)::NodeToSchemaField(parquet::schema::Node const&, parquet::internal::LevelInfo, parquet::arrow::(anonymous namespace)::SchemaTreeContext*, parquet::arrow::SchemaField const*, parquet::arrow::SchemaField*) () from /home/uzh/sbuse/.local/lib/python3.6/site-packages/pyarrow/libparquet.so.200
Date: 2020-11-12 17:43:47 From: Chris Burr (@chrisburr)
libparquet.so.200 contains SSE4.1 instructions which aren't supported by that CPU
Date: 2020-11-12 17:48:59 From: Chris Burr (@chrisburr)
It's probably a bug which should be reported upstream as it doesn't appear to have been intentional from my quick search
Date: 2020-11-12 18:22:17 From: Simon B. (@sbuse)
Thanks guys for the investigative effort. It looks like this is non trivial stuff and i'm really not enough of an expert to comment on what to do or what the cause is. I will probably just use PyROOT to write some ROOT files after i have done my analysis with awkward arrays.
Date: 2020-11-12 18:22:43 From: Jim Pivarski (@jpivarski)
Did you still get the error with pyarrow 2.0?
Date: 2020-11-12 18:22:58 From: Jim Pivarski (@jpivarski)
I thought we were only diagnosing pyarrow 0.17.
Date: 2020-11-12 18:23:07 From: Chris Burr (@chrisburr)
This is 2.0.0
Date: 2020-11-12 18:23:43 From: Chris Burr (@chrisburr)
You can probably use pip install --user "pyarrow<2" to use an older version (unless there are awkward related issues for that?)
Date: 2020-11-12 18:24:18 From: Jim Pivarski (@jpivarski)
I think @sbuse saw this error with both versions of pyarrow.
Date: 2020-11-12 18:24:36 From: Jim Pivarski (@jpivarski)
But there are definitely Awkward issues with pyarrow < 1.
Date: 2020-11-12 18:25:02 From: Jim Pivarski (@jpivarski)
I'll be complaining about this soon: https://github.com/scikit-hep/awkward-1.0/issues/531
Date: 2020-11-12 18:25:45 From: Jim Pivarski (@jpivarski)
I can post a JIRA item for the Arrow project, but I won't be able to directly reproduce it: I'll just include what's quoted here.
Date: 2020-11-12 18:27:15 From: Chris Burr (@chrisburr)
If you want to include them, my steps for checking the binary were:
mkdir pyarrow-check
cd pyarrow-check/
wget https://files.pythonhosted.org/packages/f9/a0/f2941d8274435f403698aee63da0d171552a9acb348d37c7e7ff25f1ae1f/pyarrow-2.0.0-cp36-cp36m-manylinux1_x86_64.whl
unzip pyarrow-2.0.0-cp36-cp36m-manylinux1_x86_64.whl
objdump -d pyarrow/libparquet.so.200 > binary.asm
awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' binary.asmDate: 2020-11-12 22:55:52 From: Jim Pivarski (@jpivarski)
@sbuse Do you use TensorFlow? When I searched for this on JIRA, I came up with
https://issues.apache.org/jira/browse/ARROW-4272
which says
Tensorflow publishes incompatible Python packages (wheels) that can make other libraries such as Arrow crash (#). I'm not 100% sure it's the problem here but that is quite likely.
In that case, installing everything with Conda fixed the problem.
Date: 2020-11-13 04:42:13 From: Chris Burr (@chrisburr)
In that case tensorflow was in the backtrace so it's not that
Date: 2020-11-13 04:43:52 From: Chris Burr (@chrisburr)
And conda is a possible solution as the shared library doesn't contain the offending instructions:
$ mamba create --name test pyarrow
$ objdump -d ~/miniconda3/envs/test/lib/libparquet.so.200 > binary.asm
$ awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' binary.asm
24f165: f3 48 0f b8 d0 popcnt %rax,%rdx
24f216: f3 48 0f b8 c6 popcnt %rsi,%rax
24f53a: f3 48 0f b8 d0 popcnt %rax,%rdx
24f60a: f3 48 0f b8 c9 popcnt %rcx,%rcxDate: 2020-11-13 04:44:31 From: Chris Burr (@chrisburr)
Which also points towards this being a bug in pyarrow's wheel deployment
Date: 2020-11-13 12:55:47 From: Jim Pivarski (@jpivarski)
From Wes McKinney, Arrow lead developer: https://twitter.com/wesmckinn/status/1148350953793490944?lang=en
Date: 2020-11-13 13:02:24 From: Chris Burr (@chrisburr)
It's a common sentiment
Date: 2020-11-13 13:02:30 From: Chris Burr (@chrisburr)
Conda has a lot of problems that need working on
Date: 2020-11-13 13:02:40 From: Chris Burr (@chrisburr)
but Python wheels are fundamentally flawed once you start doing anything non-trivial
Date: 2020-11-20 11:21:17 From: Simon B. (@sbuse)
Hi everybody. I ran into a bit of a wired behavior of the awkward arrays and wonder how to work best around it. I observed the following behavior:
ak.Array([[1],[2],[]])*ak.Array([[1],[2],[]]) --> works as expected: [[1], [4], []]
ak.Array([[1],[2],[]])*np.array([[1],[2],[3]]) --> works as excpected: [[1], [4], []]
ak.Array([[1],[2],[]])*ak.Array([[1],[2],[3]]) --> does not work!
Why does it work with the numpy array and not with the awkward arrays? I can make it work with awkward arrays when i fill in Nones but unfortunately it does not work with the mixture off empty list and None.
ak.Array([[1],[2],[None]])*ak.Array([[1],[2],[3]]) --> works: [[1], [4], [None]]
ak.Array([[1],[2],[None]])*ak.Array([[1],[2],[]]) ---> does not work
i wonder if this behavior is on purpose?
Date: 2020-11-20 13:12:07 From: Jim Pivarski (@jpivarski)
Yes, it's described here: https://awkward-array.readthedocs.io/en/latest/_auto/ak.broadcast_arrays.html
Date: 2020-11-20 13:12:51 From: Jim Pivarski (@jpivarski)
The difference is that the NumPy array has regular-list type and the Awkward Array (made from Python lists) has variable-list type.
Date: 2020-11-20 13:13:22 From: Jim Pivarski (@jpivarski)
>>> ak.type(np.array([[1],[2],[3]]))
3 * 1 * int64
>>> ak.type(ak.Array([[1],[2],[3]]))
3 * var * int64
Date: 2020-11-20 13:18:00 From: Jim Pivarski (@jpivarski)
To return the same result as NumPy when dealing with NumPy-like arrays (the ones with regular-list type), we follow NumPy's broadcasting rules in those cases, but for emulating for-loop like code, the direction of these rules must be reversed, so it's different for variable-list type.
Date: 2020-11-20 13:19:19 From: Jim Pivarski (@jpivarski)
>>> print(ak.Array(np.array([[1], [2], [3]])) + ak.Array(np.array([10, 20, 30])))
[[11, 21, 31], [12, 22, 32], [13, 23, 33]]
>>> print(ak.Array([[1], [2], [3]]) + ak.Array([10, 20, 30]))
[[11], [22], [33]]
Date: 2020-11-20 13:20:54 From: Jim Pivarski (@jpivarski)
The top one has regular-type arrays:
>>> ak.type(ak.Array(np.array([[1], [2], [3]])))
3 * 1 * int64
>>> ak.type(ak.Array(np.array([10, 20, 30])))
3 * int64
and the bottom one has variable-type arrays:
>>> ak.type(ak.Array([[1], [2], [3]]))
3 * var * int64
>>> ak.type(ak.Array([10, 20, 30]))
3 * int64
Date: 2020-11-20 13:23:33 From: Jim Pivarski (@jpivarski)
To see why we need left-broadcasting, imagine that the one-dimensional array is a value-per-event variable, like MET, and the jagged array is a value-per-particle variable, like jet pT, and you want to subtract jet pT from MET. You'd want it to work like this:
>>> met = ak.Array([50, 30, 100])
>>> jetpt = ak.Array([[40, 20, 50], [], [60, 30]])
>>> met - jetpt
<Array [[10, 30, 0], [], [40, 70]] type='3 * var * int64'>
Date: 2020-11-20 13:28:28 From: Jim Pivarski (@jpivarski)
(Your message went away...) If they're both nested, their structures must match.
Date: 2020-11-20 13:28:37 From: Simon B. (@sbuse)
Okay so the problem just appears if both arrays are nested.
Date: 2020-11-20 13:28:52 From: Simon B. (@sbuse)
Yes exactly! okay
Date: 2020-11-20 13:29:45 From: Simon B. (@sbuse)
I have another question but let me prepare it a second.
Date: 2020-11-20 13:30:07 From: Jim Pivarski (@jpivarski)
Right. Implicit broadcasting creates a dimension by duplicating elements. If you already have the right number of dimensions, then they have to match.
Date: 2020-11-20 13:43:59 From: Simon B. (@sbuse)
I have the case that i want to select numbers in multiple nested arrays. I thought i would build fist a mask with only Trues and False and then select the values by masking. So lets say from the array a i need the values at the position of the 40 and the 30, so i would create a mask like
a = ak.Array([[40, 20, 50], [], [60, 30]])
mask = ak.Array([[True, False, False], [], [False, True]])
and then i select the my numbers by a[mask], b[mask], c[mask] ect. The way i build the mask is by broadcasting an array with only Flase in it and the set the required indices to True by converting it to a list. I feel like this is not elegant, slow solution and i'm curious if you have a better idea.
Date: 2020-11-20 13:54:06 From: Jim Pivarski (@jpivarski)
I don't understand what you mean by that, so I'm going to make a few suggestions to see if one sounds like something you can use.
Starting with
>>> a = ak.Array([[40, 20, 50], [], [60, 30]])
>>> mask = ak.Array([[True, False, False], [], [False, True]])
>>> a[mask]
<Array [[40], [], [30]] type='3 * var * int64'>
If it's easier to build an integer index than a mask, that's an option:
>>> indexes = ak.Array([[0], [], [1]])
>>> a[indexes]
<Array [[40], [], [30]] type='3 * var * int64'>
If you want cuts to be composable by not changing the lengths of lists but nevertheless "striking out" data, you can apply it as a None-mask:
>>> a.mask[mask]
<Array [[40, None, None], [], [None, 30]] type='3 * var * ?int64'>
>>> a.mask[mask].mask[mask] # slicing twice is not an error; useful feature for bookeeping
<Array [[40, None, None], [], [None, 30]] type='3 * var * ?int64'>
Date: 2020-11-20 13:55:18 From: Jim Pivarski (@jpivarski)
If it's about building these indexes or masks quickly, Numba is an option.
Date: 2020-11-20 13:56:37 From: Jim Pivarski (@jpivarski)
If the indexes or masks come from minimizing some other quantity, there's ak.argmin and ak.argmax.
Date: 2020-11-20 14:06:54 From: Simon B. (@sbuse)
Thanks a lot, the indexes are a really good idea. Numba i will also have to try on day, the speed is really impressive!
Date: 2020-11-20 14:08:52 From: Jim Pivarski (@jpivarski)
I have a few examples of this floating around, but perhaps not on the documentation server: it's sometimes a good idea to use Numba to make the mask/index, rather than making the array itself. (The JITed function is simpler, the mask/index can be reused on more than one array, etc.)
Date: 2020-11-25 08:20:26 From: Simon B. (@sbuse)
Hi everybody, can someone double check this minimal example? Unfortunately i'm stuck at the Numba version: 0.50.1 since there is no version of llvmlite ( >0.33) that runs on the system i'm working on (openSUSE 15.2).
print("Numba version:"+str(nb.__version__))
print("Awkward version:"+str(ak.__version__))
a = ak.Array([True,False,False])
@nb.jit(nopython=True)
def do_something(array):
for i in array:
if i:
#do something
do_something(a)
Error:
Numba version:0.50.1
Awkward version:0.4.4
Error:
LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
'Boolean' object has no attribute 'bitwidth'
File "<ipython-input-14-265d4c1ad00a>", line 7:
def do_something(array):
for i in array:
^
Date: 2020-11-25 08:22:10 From: Simon B. (@sbuse)
Is there really a problem with awkward booleans in this version of Numba?
Date: 2020-11-25 16:09:30 From: Jim Pivarski (@jpivarski)
No problem with Numba (the minimum Numba version we can use is 0.50.0, so you're okay with 0.50.1). This is a bug. It looked pretty simple so I fixed it right away:
https://github.com/scikit-hep/awkward-1.0/pull/559
I had assumed that all Numba types have a bitwidth field, but numba.types.Boolean does not. Downstream of this error, I further learned that booleans are 1-bit integers in LLVM (not 8-bit, as they are in Awkward Array, apart from the mask of BitMaskedArray), so that required yet another step: if boolean, insert a compare-with-zero LLVM instruction.
Evidently, booleans were lacking tests, so I added one.
Date: 2020-11-25 17:37:30 From: Simon B. (@sbuse)
Thanks for the fix!
Date: 2020-12-01 23:45:40 From: Jim Pivarski (@jpivarski)
(Sorry that I'm reposting this everywhere; I want everyone to be warned.)
The Awkward/Uproot name transition is done, at least at the level of release candidates. If you do
pip install "awkward>=1.0.0rc1" "uproot>=4.0.0rc1"
you'll get Awkward 1.x and Uproot 4.x. (They don't strictly depend on each other, so you could do one, the other, or both.)
If you do
pip install "awkward1>=1.0.0rc1" "uproot4>4.0.0rc1"
you'll get thin awkward1 and uproot4 packages that just bring in the appropriate awkward and uproot and pass names through. This is so that uproot4.whatever still works.
If you do
pip install awkward0 uproot3 # or just uproot3
you'll get the old Awkward 0.x and Uproot 3.x that you can import ... as .... This also brings in uproot3-methods, which is a new name just to avoid compatibility issues with old packages that we saw last week.
All of the above are permanent; they will continue to work after Awkward 1.x and Uproot 4.x are full releases (not release candidates). However, the following will bring in old packages before the full release and new packages after the full release.
pip install awkward uproot
So it is only the full release that will break scripts, and only when users pip install --update. I plan to take that step this weekend, when there might be fewer people actively working. It also gives everyone a chance to provide feedback or take action with import ... as ....
Date: 2020-12-03 09:22:10 From: Simon B. (@sbuse)
Good morning everybody. I stumbled over something unexpected. This time it is about memory allocation. My code is doing something similar to the following. I iterate in a loop over events and keep overwriting ak.Arrays. The problem is that the required memory constantly grows so it looks like the arrays are kept in memory. Please have a look at the following.
import numpy as np
import awkward1 as ak
import tracemalloc
import gc
print("Awkward version:"+str(ak.__version__))
tracemalloc.start()
a= np.random.rand(50000)
for i in a:
#here i do something but i always overwrite
b = ak.Array([i])
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics("lineno")[:3]:
#print the 3 largest memory allocations
print(stat)
tracemalloc.stop()
and the print:
Awkward version:0.4.4
.local/lib/python3.6/site-packages/awkward1/highlevel.py:285: size=2588 KiB, count=50000, average=53 B
.local/lib/python3.6/site-packages/awkward1/_util.py:154: size=2588 KiB, count=50000, average=53 B
.local/lib/python3.6/site-packages/awkward1/_util.py:149: size=2588 KiB, count=50000, average=53 B
For a comparison please run the same with overwriting a numpy array. In case of numpy the used memory is like 400kB and with awkward it is 2.5MB. Does anyone know how i can force the arrays to be freed from memory? Deleting the array (del b) at the end of the loop and invoking gc.collect() unfortunately does not work.
Date: 2020-12-03 12:21:54 From: Jim Pivarski (@jpivarski)
The memory won't go away in a "malloc" sense until it's garbage collected. Like all Python modules, Awkward only drops the reference count to zero—it doesn't actually free the memory. You could insert
Date: 2020-12-03 12:21:58 From: Jim Pivarski (@jpivarski)
Date: 2020-12-03 12:22:23 From: Jim Pivarski (@jpivarski)
gc.collect()
Date: 2020-12-03 12:23:58 From: Jim Pivarski (@jpivarski)
to force garbage collection, but that's usually only advisable for debugging, not production. The garbage collector will invoke itself when your process runs out of its allowed memory (which can be set with the Unix ulimit command).
Date: 2020-12-03 12:42:41 From: Simon B. (@sbuse)
I tried to gc.collect() and i does not free any space... the problem i have is that i really run out of memory quickly since i have to create and overwrite a lot of arrays. I could start the job with more memory but ultimately there will be a limit on how many times i can overwrite arrays before the memory is full ...
Date: 2020-12-03 13:07:25 From: Jim Pivarski (@jpivarski)
That sounds like an actual memory leak, then. (Oh! I see where you said that you did gc.collect(); sorry I missed that—I was reading/writing on a phone.) This does sound like an actual memory leak. I did fix a memory leak recently. Let me see...
Date: 2020-12-03 13:10:15 From: Jim Pivarski (@jpivarski)
I don't think it was in the step between 0.4.4 and 0.4.5, but there was a batch of changes there, right before the name change and 1.0 transition.
You could update awkward1==0.4.5 or switch to awkward>=1.0.0rc1.
Date: 2020-12-03 13:10:20 From: Jim Pivarski (@jpivarski)
https://github.com/scikit-hep/awkward-1.0/releases
Date: 2020-12-03 13:22:30 From: Simon B. (@sbuse)
The minimal example with the new version ( awkward>=1.0.0rc1) gives exactly the same result even when i insert a delete command and the gc.collect().
print("Awkward version:"+str(ak.__version__))
tracemalloc.start()
a= np.random.rand(50000)
for i in a:
#here i do something but i always overwrite
b = ak.Array([i])
del b
gc.collect()
Awkward version:1.0.0rc1
python3.6/site-packages/awkward/highlevel.py:268: size=2588 KiB, count=50001, average=53 B
python3.6/site-packages/awkward/_util.py:151: size=2588 KiB, count=50000, average=53 B
python3.6/site-packages/awkward/_util.py:146: size=2588 KiB, count=50000, average=53 B
python3.6/site-packages/awkward/_util.py:141: size=2588 KiB, count=50000, average=53 B
Date: 2020-12-03 13:24:11 From: Simon B. (@sbuse)
i will try to run my analysis with the new version too but this doesn't look promising.
Date: 2020-12-03 13:44:46 From: Jim Pivarski (@jpivarski)
I did a dirty but conclusive-enough experiment (watching htop while running commands in a Zoom meeting). The following consistently increased memory linearly: 100 MB in 80 seconds.
% python -i -c 'import awkward as ak; import numpy as np; import gc'
>>> for i in range(1000000):
... a = ak.ArrayBuilder()
... a.integer(i)
... del a
... tmp = gc.collect()
...
and the following did not increase by even 10 MB in 80 seconds (where 10 MB is the level of noise—other applications allocating and freeing memory on my system).
% python -i -c 'import awkward as ak; import numpy as np; import gc'
>>> for i in range(1000000):
... a = np.array([i])
... b = ak.Array(a)
... del a
... del b
... tmp = gc.collect()
...
The first is a simplified version of yours: your example creates arrays from Python data, which internally invoke the ArrayBuilder. The second makes Awkward Arrays by wrapping NumPy arrays. Your example,
% python -i -c 'import awkward as ak; import numpy as np; import gc'
>>> for i in range(1000000):
... a = ak.Array([i])
... del a
... tmp = gc.collect()
...
is pretty much a combination of the two steps: ArrayBuilder, then wrap as ak.Array. Doing the above explicitly accumulated 120 MB in 80 seconds.
So it sounds like this is a memory leak in ArrayBuilder, very likely the GrowableBuffer that gets allocated is somehow not getting freed. The memory that GrowableBuffer allocates is in a std::shared_ptr that should be kept alive only by the fact that the ArrayBuilder is held as a Python reference. In my first example, del a should have dropped the Python reference count to zero and gc.collect() should have deleted the ArrayBuilder, then GrowableBuffer instance, which should have dropped the std::shared_ptr reference count to zero to immediately free the memory. That doesn't seem to be happening, but I know where to look.
I'll post an issue.
Date: 2020-12-03 13:47:57 From: Jim Pivarski (@jpivarski)
https://github.com/scikit-hep/awkward-1.0/issues/567
Date: 2020-12-03 14:08:57 From: Simon B. (@sbuse)
Thanks a lot for trying it yourself. It was a bit sad to see my code crashing over and over without understanding why.
Date: 2020-12-03 16:00:58 From: Jonas Rübenach (@jrueb)
If you want to use numpy functionality on awkward arrays like np.cumsum and np.digitize, what's the best way to do it? Convert to numpy?
Date: 2020-12-03 16:13:20 From: Jim Pivarski (@jpivarski)
@jrueb Currently, yes. The best way to do that is np.asarray(your_awkward_array), which views the underlying buffer if possible. Ideally, we would have Awkward overloads for each of these, though most of them would probably just do that conversion. Some, like np.max ↔ ak.max, do new things with the NumPy function. I should probably make a pack of NumPy overloads that only convert to NumPy, just so that np.cumsum works without your intervention.
Date: 2020-12-03 16:38:00 From: Jonas Rübenach (@jrueb)
I see, thanks. I've been thinking about the first parameter of np.digitize. Ignoring the second parameter (which is a numpy array in my case), it's just like an ufunc. How can I get this working if I have a multi-dimensional awkward array?
Date: 2020-12-03 16:44:31 From: Jim Pivarski (@jpivarski)
@jrueb If the nested lists are all equal-length, then np.asarray(your_awkward_array) should return the corresponding NumPy array without errors. Then np.digitize will recognize it.
Date: 2020-12-03 16:45:03 From: Jonas Rübenach (@jrueb)
But if not?
Date: 2020-12-03 16:46:06 From: Jim Pivarski (@jpivarski)
If not, you'll get an error message saying that this is the case. But if your lists have different lengths, they can't be a NumPy array. What you can do in this case is ak.pad_none and ak.fill_none to make them have equal lengths, or ak.flatten, if that's appropriate, or possibly something else, depending on what your intention is.
Date: 2020-12-04 19:05:04 From: alesaggio (@alesaggio)
Hi, I am facing something weird with conversion from awkward0 to awkward1. I attach a test file to reproduce the issue here [1]. Below is the code to reproduce the behavior:
import pickle
import awkward1 as ak
with open('test.p', 'rb') as f:
leptons = pickle.load(f)
lep_ak1 = ak.from_awkward0(leptons.p4)
print("order is different for second event:")
print(leptons.p4.fPt)
print(lep_ak1.fPt)The problem is that when converting to awkward1, the leptons inside the second event (in this case) get reshuffled not preserving the original order. I am using: awkward '0.13.0', awkward1 '0.4.5'
Thanks for any help! [1] https://cernbox.cern.ch/index.php/s/rC4xtbIMQiXez6A
Date: 2020-12-04 19:06:30 From: Jim Pivarski (@jpivarski)
Oh, pickle can be confused by the fact that the awkward module name has changed. I'll try to find a way to load it.
Date: 2020-12-04 19:08:41 From: alesaggio (@alesaggio)
Ok, just to be sure, the issue happens without pickle, I just used it to create the file in order to reproduce it
Date: 2020-12-04 19:09:16 From: Jim Pivarski (@jpivarski)
@alesaggio Actually—I don't know which modules you have installed. I think it might be impossible to load an Awkward 0 array pickled when awkward meant Awkward 0 and unpickled when awkward means Awkward 1. I think they won't pass through the name change.
Date: 2020-12-04 19:10:15 From: Jim Pivarski (@jpivarski)
You might have pre-name-change libraries installed and I have post-name-change libraries, and your problem is a different one.
Date: 2020-12-04 19:12:02 From: Jim Pivarski (@jpivarski)
To get the array to me, you could use Awkward 0's save function. That ought to work.
Date: 2020-12-04 19:13:22 From: Jim Pivarski (@jpivarski)
The problem with pickle is that it hard-codes Python class and module names in the file, and usually those don't change. Pickle also has problems with Python class versions changing—anything that changes the meaning of a class with a give name. (It doesn't save the class definition in the pickle file, just the objects and the name of the class.)
Date: 2020-12-04 19:16:39 From: alesaggio (@alesaggio)
Ah, sorry about that. Here's the new file: https://cernbox.cern.ch/index.php/s/3iMlJDcFxDEDM99
Date: 2020-12-04 19:36:03 From: Jim Pivarski (@jpivarski)
The first thing I had to do was fix the awkward0 package to load files made when the name was awkward:
>>> import awkward0
>>> awkward0.__version__
'0.15.1'
>>> old = awkward0.load("newTest.awkd")
>>> old
<JaggedArray [[<Row 23098> <Row 23099> <Row 23100>] [<Row 73133> <Row 73134> <Row 73132>]] at 0x7ffa82ff2070>
Now I'm moving on to your actual problem.
Date: 2020-12-04 19:38:50 From: Jim Pivarski (@jpivarski)
Yeah, I see it:
>>> oldp4 = old.p4
>>> oldp4
<JaggedArrayMethods [[PtEtaPhiMassLorentzVector(pt=67.984, eta=2.4785, phi=2.541, mass=-0.099731) PtEtaPhiMassLorentzVector(pt=46.521, eta=1.8037, phi=-2.2109, mass=-0.013321) PtEtaPhiMassLorentzVector(pt=17.046, eta=1.4587, phi=0.81519, mass=-0.014069)] [PtEtaPhiMassLorentzVector(pt=61.624, eta=-0.098343, phi=-0.27698, mass=0.10571) PtEtaPhiMassLorentzVector(pt=20.262, eta=-1.7275, phi=-2.1987, mass=0.10571) PtEtaPhiMassLorentzVector(pt=10.989, eta=-0.36774, phi=1.5681, mass=-0.0023746)]] at 0x7ffae10783a0>
>>> newp4 = ak.from_awkward0(oldp4)
>>> newp4
<Array [[{fPt: 68, ... fMass: 0.106}]] type='2 * var * struct[["fPt", "fEta", "f...'>
>>> oldp4.pt
<JaggedArray [[67.98397 46.520737 17.046204] [61.623806 20.262285 10.989361]] at 0x7ffae10789d0>
>>> newp4.fPt
<Array [[68, 46.5, 17], [11, 61.6, 20.3]] type='2 * var * float32'>
Date: 2020-12-04 19:48:24 From: alesaggio (@alesaggio)
Actually, I am noticing the following. To build the leptons, I use the JaggedCandidateArray.candidatesfromoffsets function from coffea.analysis_objects. I create a lepton dictionary, then create the leptons object with this function and then I sort them by pt, like in the following
leptons = Jca.candidatesfromoffsets(offsets, **lepton_dict)
leptons = leptons[leptons.pt.argsort()]I am noticing that if I don't sort the leptons by pt, then the ordering is what is seen after the conversion to awkward1
Date: 2020-12-04 19:50:35 From: Jim Pivarski (@jpivarski)
I've found the issue: in the implementation of ak.from_awkward0, I had forgotten that Awkward 0 "Table" combined the responsibilities of "RecordArray" and "IndexedArray", with the latter in a hidden field named _view. I'm making ak.from_awkward0 aware of Table._view right now.
Date: 2020-12-04 19:54:11 From: alesaggio (@alesaggio)
Ah, glad you found it so quickly, thanks a lot!
Date: 2020-12-04 19:54:54 From: Jim Pivarski (@jpivarski)
(The hardest thing is remembering how to use Awkward 0!)
Date: 2020-12-04 19:57:13 From: Jim Pivarski (@jpivarski)
>>> import awkward0
>>> import awkward as ak
>>> old = awkward0.load("newTest.awkd")
>>> oldp4 = old.p4
>>> newp4 = ak.from_awkward0(oldp4)
>>> oldp4.pt
<JaggedArray [[67.98397 46.520737 17.046204] [61.623806 20.262285 10.989361]] at 0x7f1a5f88f8e0>
>>> newp4.fPt
<Array [[68, 46.5, 17], [61.6, 20.3, 11]] type='2 * var * float32'>
Date: 2020-12-04 19:57:15 From: alesaggio (@alesaggio)
for me it's the other way around still, but gradually adjusting to awkward1 :P
Date: 2020-12-04 19:57:47 From: alesaggio (@alesaggio)
great!
Date: 2020-12-04 20:00:06 From: Jim Pivarski (@jpivarski)
It's on its way: https://github.com/scikit-hep/awkward-1.0/pull/573
Date: 2020-12-04 20:11:40 From: alesaggio (@alesaggio)
Works like a charm, thanks :)
Date: 2020-12-05 19:21:12 From: Jim Pivarski (@jpivarski)
(Sorry for the reposting, if you saw this message elsewhere.)
Probably the last message about the Awkward Array/Uproot name transition: it's done. The new versions have moved from release candidates to full releases. Now when you
pip install awkward uprootwithout qualification, you get the new ones. I think I've "dotted all the 'i's of packaging" to get the right dependencies and tested all the cases I could think of on a blank AWS instance.
-
pip install awkward0 uproot3returns the old versions (Awkward 0.x and Uproot 3.x). The prescription for anyone who needs the old packages isimport awkward0 as awkwardandimport uproot3 as uproot. -
pip install awkward1 uproot4returns thin wrappers of the new ones, which point to whatever the latestawkwardanduprootare. They pass through to the new libraries, so scripts written withimport awkward1, uproot4don't need to be changed (though you'll probably want to, for simplicity). -
uproot-methodsno longer causes trouble because there's anuproot3-methodsin the dependency chain:awkward0→uproot3-methods→uproot3. The latestuproot-methods(no qualification) now excludes Awkward 1.x so that they can't be used together by mistake.
Date: 2021-01-04 18:24:50 From: aswanthkrishna (@aswanthkrishna)
when trying to install awkward[cuda] with pip getting this error. ' Could not find a version that satisfies the requirement awkward-cuda-kernels (from versions: ) No matching distribution found for awkward-cuda-kernels' how to fix?
Date: 2021-01-04 18:25:40 From: Jim Pivarski (@jpivarski)
Which version is awkward?
Date: 2021-01-04 18:26:58 From: Jim Pivarski (@jpivarski)
Also, note that the CUDA plugin is very alpha-stage right now.
Date: 2021-01-04 18:33:38 From: Jim Pivarski (@jpivarski)
If it's listing zero possible versions, it could be because you're not on Linux. We're only developing the CUDA plugin for Linux. Macs don't have Nvidia GPUs and Windows is just generally difficult to support. The main target for the CUDA plugin is large-scale computing clusters.
Date: 2021-01-04 18:35:27 From: aswanthkrishna (@aswanthkrishna)
awkward is version 1.0.2rc4. I am running it on linux AWS instance. my main motive is to use jagged arrays with cupy. is it possible at this point?
Date: 2021-01-04 18:40:32 From: Jim Pivarski (@jpivarski)
It depends on what you mean by "use". Jagged arrays have been loaded on a GPU and some simple things have been done with it (ak.num and ufuncs).
Date: 2021-01-04 18:41:19 From: Jim Pivarski (@jpivarski)
Could it be that your Linux doesn't satisfy "manylinux2014"?
https://pypi.org/project/awkward-cuda-kernels/1.0.2rc4/#files
Date: 2021-01-04 18:47:14 From: Henry Schreiner (@henryiii)
Pip needs to be pretty new to pick up manylinux2014 too
Date: 2021-01-04 18:48:10 From: Henry Schreiner (@henryiii)
Try to update pip or at least check the version. I’m thinking it’s 19.something for manylinux 2014.
Date: 2021-01-04 18:52:03 From: aswanthkrishna (@aswanthkrishna)
that solved it. thank you very much :)
Date: 2021-01-05 08:20:17 From: aswanthkrishna (@aswanthkrishna)
can i access a jagged array from a jitted cuda kernel with numba?
Date: 2021-01-05 15:50:07 From: Jim Pivarski (@jpivarski)
Yes!
>>> import awkward as ak
>>> import numba as nb
>>> import numpy as np
>>> @nb.njit
... def manual_sum(array):
... out = np.zeros(len(array), np.float64)
... for index, sublist in enumerate(array):
... for item in sublist:
... out[index] += item
... return out
...
>>> manual_sum(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
array([6.6, 0. , 9.9])
but only if the array is in main memory, not if it's on the GPU. (I only say this because you were looking into the CUDA plugin earlier. You can move an array from main memory to GPU and back with ak.to_kernels.)
Date: 2021-01-05 18:31:17 From: aswanthkrishna (@aswanthkrishna)
Love it. Thanks for this awsome library! Is there plans to support numba cuda kernels on GPU arrays?
Date: 2021-01-05 18:32:42 From: Jim Pivarski (@jpivarski)
Yes. I've had off-and-on conversations with Graham Markall at Nvidia about it.
Date: 2021-01-05 18:33:49 From: Jim Pivarski (@jpivarski)
Numba's nb.cuda.jit currently cannot accept any extension types as arguments or return values, but he's working on adding that to support Numba-compiled UDFs in RAPIDS.ai's cuDF.
Date: 2021-01-05 18:34:23 From: Jim Pivarski (@jpivarski)
Awkward Array may be the first outside-of-Nvidia project to take advantage of that new feature.
Date: 2021-01-07 22:01:24 From: Chris Lee-Messer (@cleemesser)
Hello, yesterday I watched J.P.'s SciPy2020 presentation. Awkward looks amazing! Congrats. I would like to use awkward to access a datastructure on disk as an awkward array but I'm having trouble figuring it out. It is a sequence of blocks which I can easily mmap as a numpy array with dtype int16.
Date: 2021-01-07 22:01:55 From: Chris Lee-Messer (@cleemesser)
import numpy as np
arr = np.memmap(fname, np.int16, 0, (n_blocks, blocksize))
Date: 2021-01-07 22:09:55 From: Chris Lee-Messer (@cleemesser)
Inside each block, there is additional structure to the integers. It is a ragged array with most usually the same length: something like
[ 256 * int16,
256 * int16,
1 * int16,
1 * int16]
As I try to approximate the type/datashape formatting that I know I need to learn. All the blocks have the same format. I feel like this should be easy, perhaps using the ak.from_buffer interface. But I don't see a lot of examples of building an array in this situation. Can you give me a hint? I know ahead of time all the import dimensions ahead of time: the n_blocks, blocksize, and the dimensions of the ragged array inside the block.
Date: 2021-01-07 22:10:24 From: Chris Lee-Messer (@cleemesser)
Thanks in advance if you can help
Date: 2021-01-07 23:21:42 From: Jim Pivarski (@jpivarski)
There isn't an ak.from_buffer, but there's an ak.from_numpy. Any NumPy array can be wrapped as an Awkward Array. I haven't tried it on a memmapped array, but I don't see why it wouldn't work. If the Awkward Array has the same structure as a NumPy array, there isn't a strong advantage to using it, but if you're using it in a context that builds structure, then there's a good reason for it.
Date: 2021-01-07 23:38:06 From: Chris Lee-Messer (@cleemesser)
Thank you. I can create the awkward array that mimics my numpy with no problems, but it is not clear to me how to add the structure portion.
akarr = ak.from_numpy(arr) # following on my example above, works fine but does not contain the block structure, just the information
In the example on creating an HDF5 version of an awkward array, and reading it back, there is a ak.from_buffers() example uses an awkward Form to specify the layout of the data. I was hoping I could use something like that with my memmaped file.
Date: 2021-01-07 23:52:05 From: Chris Lee-Messer (@cleemesser)
My guess was it would be something like this:
import awkward as ak
import numpy as np
mm_arr = memmap("test.data", np.int16, 0, (n_blocks,blocksize)) # linear memmap
mmref = np.memmap._mmap # this might instead use np.int8 or byte
datasize = 2 * n_blocks*blocksize
form = """{ <insert the right form definition here to define the nested structure of this array>}"""
# I don't understand what to point into the above form exactly though.
# the equivalent information to form = """ n_blocks * [ 256 * int16, 256*int16, 1 * int16, 1 * int16]"""
akarr = ak.from_buffers(ak.forms.Form.fromjson(form), datasize, mmref)
Date: 2021-01-08 00:14:56 From: Chris Lee-Messer (@cleemesser)
The example uses a ListOffsetArray64 with content of a NumpyArray:
In[164]: form, length, container = ak.to_buffers(dld, container=group)
In[165]: form
{
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node1"
},
"form_key": "node0"
}
So I'm guessing I need to somehow make a nesting of ListOffsetArray64 with content ListOffsetArray with content NumpyArray
Date: 2021-01-08 01:00:56 From: Jim Pivarski (@jpivarski)
Usually, we get data that already has a data structure that we have to deal with. In your case, you're starting with flat arrays and want to add structure. How you do that will depend strongly on what it is you're trying to do. ak.unflatten might be useful—it assumes that you have flattened data and numbers of items in each list for you to fill into lists. That will give you variable length lists. You mentioned variable length of regular lists: you might want to reshape the data in NumPy and then run ak.unflatten on that.
Date: 2021-01-08 01:11:38 From: Jim Pivarski (@jpivarski)
Oh, ak.from_buffers is nothing like np.from_buffer: I hadn't noticed that similarity. (Fortunately, ours is spelled differently!) ak.from_buffers is a somewhat low-level function because it reveals the ListOffsetArray/NumpyArray/etc structure inside of a high-level ak.Array.
Usually, you wouldn't be hand-crafting Forms and buffers to feed into ak.from_buffers (although I've suggested exactly that for some Python <--> C++ interface projects). ak.from_buffers is a tool that can be used to build backends.
The example with HDF5 in the tutorial was about building such a backend: there wasn't anything special about HDF5, only that it groups collections of named arrays. Notice that in that example, it wasn't necessary to build a Form or the collection of buffers: ak.from_buffers was intended to be used with ak.to_buffers, for reading from and writing to the new backend.
Date: 2021-01-08 02:45:20 From: Chris Lee-Messer (@cleemesser)
Ooo. ak.unflatten looks like ticket :-) I will start experimenting. thank you. Thank you for the insight on ak.from_buffers
Date: 2021-01-13 15:16:07 From: Jonas Rübenach (@jrueb)
Why does ak.Array({})["a"] raise a ValueError and not a KeyError? Generally I would expect to get a KeyError here. ValueError is so general, it's hard to catch only the case where the field does not exist.
Date: 2021-01-13 15:18:17 From: Henry Schreiner (@henryiii)
FYI, if it helps, boost-histogram also plans to throw a KeyError https://github.com/scikit-hep/boost-histogram/issues/387
Date: 2021-01-13 15:31:46 From: Jim Pivarski (@jpivarski)
All of the user errors in Awkward Array raise ValueError and all of the internal errors raise RuntimeError.
In previous projects, I've tried to add fine-grained exception types, including creating exception types for custom things, but the distinctions between various cases were not always clear-cut: the choice of exception type was subjective and arbitrary.
That's particularly true of the ValueError/TypeError distinction in Awkward Array, since some array operations fail because the array types are wrong, but these distinct array types are not distinct Python types.
As for KeyError, situations like array["a"] arguably should raise KeyError, but not array[5], since that should be IndexError. How about array["a", 5] or array[5, "a"]? What if the extraction of field "a" only fails on index 5 and not index 4?
As a user of interfaces that can raise different exception types, I always have to test it to find out which exception type it's going to raise because it's not guessable. A much more useful way of getting fine granularity in exception catching is to narrow the try .. except to just one step, rather than picking a fine-grained exception type. Also, it has happened several times that the type of exception an operation raises changes in some version of the software, and then my try .. except fails to catch it.
Date: 2021-01-13 16:06:27 From: Jonas Rübenach (@jrueb)
Okay, I see the ambiguity with IndexError. In my situation having just one line in the try block did not solve it because it was something like this
arr = ak.Array({"a": ak.virtual(lambda: np.ones(-1), length=1)})
arr["a"] # ValueErrorI will use ak.fields to test if the field is present from now on.
Date: 2021-01-22 10:57:13 From: Diego Ramírez García (@ramirezdiego)
Hi,
sorry in advance if I missed something trivial, but how should I do to print the awkward1 version I import?
>>> import awkward1
>>> awkward1.__version__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'awkward1' has no attribute '__version__'
Date: 2021-01-22 10:59:05 From: Diego Ramírez García (@ramirezdiego)
I experience the same issue with the most recent uproot4, by the way.
Thank you for your help.
Date: 2021-01-22 13:12:29 From: Alexander Held (@alexander-held)
You may have a recent version of awkward1. In that case, it only is a wrapper around awkward (with the new API). The relevant version would then be awkward.__version__ (after importing awkward instead).
Date: 2021-01-22 13:15:13 From: Alexander Held (@alexander-held)
see also this comment https://gitter.im/Scikit-HEP/awkward-array?at=5fc6d5a487cac84fcd01b55e for some more details about this transition
Date: 2021-01-22 13:29:47 From: Diego Ramírez García (@ramirezdiego)
Thank you, @alexander-held! It makes then sense to set my environment to do
pip install "awkward>=1.0.0rc1" "uproot>=4.0.0rc1"
from now on.
Date: 2021-01-22 13:49:29 From: Alexander Held (@alexander-held)
yes, and I think you can drop the "rc1" now, since the proper versions have been released since:
pip install "awkward>=1.0.0" "uproot>=4.0.0"
Date: 2021-01-22 13:59:33 From: Diego Ramírez García (@ramirezdiego)
Perfect, thanks again.
Date: 2021-01-22 15:19:04 From: Jim Pivarski (@jpivarski)
The new awkward1 is a thin wrapper around awkward, but I haven't tested that names starting with underscores get passed through. That's an oversight. Also, which version should it return? The version of the thin wrapper or the awkward library that it wraps? (Ultimately, it's better to use awkward directly; this stub was to avoid breaking scripts that spell the module name "awkward1".)
Date: 2021-01-22 15:40:24 From: Diego Ramírez García (@ramirezdiego)
👍 !
Date: 2021-01-22 16:11:36 From: Henry Schreiner (@henryiii)
I would return awkward.__version__; the __version__ is really just a shortcut for users and is not the canonical version of the package; the actual version of the package can be accessed via importlib.metadata
Date: 2021-01-28 12:56:58 From: Jonas Rübenach (@jrueb)
Thanks for always fixing bugs to quickly. It really helps a lot.
Date: 2021-03-02 15:03:24 From: alesaggio (@alesaggio)
Hi, I have a question regarding the argcombination function in awkward. I am trying to compute the invariant mass of jet combinations. In awkward0, I used to do it as follows:
jet_pairs = jets.argchoose(2)
mjj = (jets[jet_pairs.i0].p4 + jets[jet_pairs.i1].p4).mass
In awkward, I am replacing the function argchoose with argcombination:
jet_pairs = ak.argcombinations(jets, 2)
However, I can’t seem to access the first and second indices of the pair with .i0 and .i1 like in awkward0 anymore. Do you know of a good way to do this?
Date: 2021-03-02 15:04:21 From: Jim Pivarski (@jpivarski)
.slot0 and .slot1. The names with "i" were too short and ambiguous.
Date: 2021-03-02 15:04:44 From: Jim Pivarski (@jpivarski)
I'm a little wary of "slot," too, since it's not easy to guess what it means.
Date: 2021-03-02 15:05:41 From: Jim Pivarski (@jpivarski)
You might actually prefer to ak.unzip them, instead of explicitly naming slots.
Date: 2021-03-02 15:06:23 From: Jim Pivarski (@jpivarski)
jet_pairs = ak.argcombinations(jets, 2)
jet1, jet2 = ak.unzip(jet_pairs) # or do both functions in one stepDate: 2021-03-02 15:08:25 From: alesaggio (@alesaggio)
thanks a lot! That works. Is there a place where I could find these slot fields in the documentation?
Date: 2021-03-02 15:08:59 From: Jim Pivarski (@jpivarski)
They also use up a lot of the left-bar space here: https://awkward-array.readthedocs.io/en/latest/_auto/ak.Array.html#ak-array-slot0
Date: 2021-03-02 15:11:54 From: alesaggio (@alesaggio)
Many thanks :)
Date: 2021-03-03 16:13:08 From: alesaggio (@alesaggio)
Hi again! I was wondering, is there an equivalent function of numpy.clip() in awkward?
Date: 2021-03-03 16:16:53 From: Jim Pivarski (@jpivarski)
There isn't one, so you'd have to do what NumPy says that it's "equivalent to but faster than."
np.minimum(a_max, np.maximum(a, a_min))
If NumPy had made this function a ufunc, we would get it automatically. Since they didn't, we'd have to wrap it just as we wrap ufuncs to get the performance advantage that they cite.
Date: 2021-03-03 16:26:19 From: alesaggio (@alesaggio)
Great, thanks a lot!
Date: 2021-03-03 16:27:35 From: alesaggio (@alesaggio)
While we are at it, I would have another question. When masking an array, its dimensionality gets reduced, e.g.:
myarr = <Array [5., 10., 15., 20., 25.] type=‘5 * var * float64'>
mask = myarr > 10
myarr[mask] = <Array [15., 20., 25.] type=‘3 * var * float64'>
But if I needed to retain the dimensionality of myarr, i.e.:
myarr[mask] = <Array [[], [], [15.], [20.], [25.] type=‘5 * var * float64'>
what would be a good way to do it? I tried to ak.unflatten both myarr and mask to increase the dimensionality, but the result is the same. I managed to find a workaround by declaring both the array and the mask as awkward0 arrays and the convert them back to awkward1, something like this:
ak.from_awkward0(awkward0.JaggedArray.fromcounts(counts, myarr))
and this does what I want. However I believe there should be a smarter way of doing it without transitioning back and forth from awkward0?
Date: 2021-03-03 17:06:50 From: Jim Pivarski (@jpivarski)
Maybe what you're looking for is ak.Array.mask and ak.singletons?
>>> myarr = ak.Array([5.0, 10, 15, 20, 25])
>>> myarr
<Array [5, 10, 15, 20, 25] type='5 * float64'>
>>> myarr[myarr > 10]
<Array [15, 20, 25] type='3 * float64'>
>>> myarr.mask[myarr > 10]
<Array [None, None, 15, 20, 25] type='5 * ?float64'>
>>> ak.singletons(myarr.mask[myarr > 10])
<Array [[], [], [15], [20], [25]] type='5 * var * float64'>Date: 2021-03-03 17:19:16 From: alesaggio (@alesaggio)
That's exactly what I was looking for, many thanks again! :)
Date: 2021-03-05 13:53:05 From: Daniel Holmberg (@deinal)
Hey I'm using awkward-1.0.2, and I have an array
events
<Array [{Jet_pt: [1.68e+03, 1.33e+03, ... -1]}] type='100 * {"Jet_pt": var * flo...'>
I'd like to assign new arrays to a list of observables:
result = events[['GenJet_pt', 'GenJet_eta', 'Jet_pt', 'Jet_eta']]
result[['GenJet_pt', 'GenJet_eta']] = result[['GenJet_pt', 'GenJet_eta']][events.Jet_genJetIdx]
However, it raises the error TypeError: only fields may be assigned in-place (by field name). Is it at all possible to assign new arrays to multiple fields using awkward?
My approach after that was to create two separate arrays, where I reorder GenJets separately and then do a side-by-side concatenation with Jets.
gen_jets = events[['GenJet_pt', 'GenJet_eta']][events.Jet_genJetIdx]
jets = events[['Jet_pt', 'Jet_eta']]
result = ak.concatenate((jets, gen_jets), axis=1)
That throws: ValueError: cannot broadcast records because keys don't match. I have successfully used ak.concatenate to do vertical concatenation using axis=0, but I suppose the similarities with pandas ends there.
I get my desired result by switching to pandas, but I would like to use only awkward. Does anyone know if it's possible?
gen_jets = events[['GenJet_pt', 'GenJet_eta']][events.Jet_genJetIdx]
gen_jets_df = ak.to_pandas(gen_jets)
jets = events[['Jet_pt', 'Jet_eta']]
jets_df = ak.to_pandas(jets)
result = pd.concat((jets_df, gen_jets_df), axis=1)
Date: 2021-03-05 15:10:08 From: Jim Pivarski (@jpivarski)
A record with four fields, GenJet_pt, GenJet_eta, Jet_pt, Jet_eta, is different from concatenating them into two fields pt and eta with gen-jets and jets concatenated. In the latter case, you can no longer tell whether a given object was from the gen-jets or the jets. But assuming that you want the latter, you could rename the fields and therefore make them concatenatable.
generic1 = ak.zip({"pt": events["Jet_pt"], "eta": events["Jet_eta"]})
generic2 = ak.zip({"pt": events["GenJet_pt"], "eta": events["GenJet_eta"]})Date: 2021-03-05 15:11:06 From: Jim Pivarski (@jpivarski)
In fact, you could also tag them before concatenating:
generic1["is_gen"] = False
generic2["is_gen"] = True
result = ak.concatenate((generic1, generic2), axis=1)Date: 2021-03-05 15:11:15 From: Jim Pivarski (@jpivarski)
I haven't tested any of this, but give it a try!
Date: 2021-03-05 15:45:26 From: Daniel Holmberg (@deinal)
Hmm, okay so I get (as a dataframe):

events[['GenJet_pt', 'GenJet_eta']][events.Jet_genJetIdx], and then do a side-by-side concat with Jets. Ideally it could be done without concatenate, by doing the reorder in-place so to speak, but then I run into the problem that I can't assign multiple fields at once. The result should look like this
To address the error I received previously TypeError: only fields may be assigned in-place (by field name) I could assign them one by one instead:
events['Gen_jet_pt'] = events['GenJet_pt'][events.Jet_genJetIdx]
events['Gen_jet_eta'] = events['GenJet_eta'][events.Jet_genJetIdx]
Date: 2021-03-05 15:47:22 From: Jim Pivarski (@jpivarski)
ak.zip is the equivalent of assigning fields all at once, though you might need to set the depth_limit, depending on how deeply you want it to zip.
Date: 2021-03-05 15:48:22 From: Daniel Holmberg (@deinal)
Okay thx, I'll look into it
Date: 2021-03-05 15:48:34 From: Jim Pivarski (@jpivarski)
You want to match GenJet ids to Jet ids... That sounds like a Pandas merge ("JOIN" operations are a major part of Pandas, but not Awkward Array, at least not yet).
Date: 2021-03-05 15:49:38 From: Jim Pivarski (@jpivarski)
Although there have been some circumstances where I wanted a group by and managed it through ak.run_lengths. This documentation page describes how to do that (scroll down).
Date: 2021-03-05 15:57:44 From: Daniel Holmberg (@deinal)
Yeah I played around with pd.merge, but it wasn't ideal for my needs. If I remember correctly it was because the indices got screwed up since it was a pd.MultiIndex dataframe. Anyway, thanks for the tips!
Date: 2021-03-05 15:59:18 From: Jim Pivarski (@jpivarski)
In Pandas, you can reset_index to turn the MultiIndex into ordinary columns, and then set_index to make the id an index. One DataFrame for GenJets, another DataFrame for Jets, and when they're both indexed by ids, pd.merge with left_index=True, right_index=True will JOIN on ids.
Date: 2021-03-05 16:00:34 From: Jim Pivarski (@jpivarski)
Afterwards, you can set_index with a list of the two columns that make it jagged and complain to me that there isn't an ak.from_pandas that reverses ak.to_pandas. Such a thing would be useful here.
Date: 2021-03-05 22:04:52 From: Nicholas Smith (@nsmith-)
@deinal are these NanoAOD? The names look very familiar. You might find https://coffeateam.github.io/coffea/notebooks/nanoevents.html useful. If not, there is also a discussion about resolving cross-references in general in https://github.com/scikit-hep/awkward-1.0/issues/492
Date: 2021-03-06 14:50:09 From: Daniel Holmberg (@deinal)
@nsmith- Yes they are. Thanks for the links!
Date: 2021-03-17 10:28:31 From: DevilsZ (@DevilsZ)
Hello all, I'm a very beginner uproot user and just started to learn the array based analysis with awkward-array. My question is how to apply cuts (event selection in HEP field) to arrays efficiently enough.
-
Require higher pT cut for primary muon and lower pT cut for secondary muon. For this I'm using following codes, but not sure this usage is OK and want to know if there are more smart ways. mumu = event[ak.num(event.mu) == 2] # Here events is an set of arrays taken from TTree mumu = mumu[ak.all(mumu.mu.pt > 25000.0, axis=1)] mumu = mumu[ak.any(mumu.mu.pt > 27000.0, axis=1)]
-
Flavor tagging in jet array. In my TBranch, there is a variable correspond to b-tag score. I want to require 2 or more tagged jest for each event. I guess if I want to require 1 or more, I just need to use awkward.any function, but no idea about '2 or more'
Date: 2021-03-17 10:28:38 From: DevilsZ (@DevilsZ)
Thank you in advance
Date: 2021-03-17 13:42:47 From: Jim Pivarski (@jpivarski)
Your number (1) is a good solution to the problem, though it depends on the fact that you only wanted to put cuts on the first and second ("any" and "all" from a set of exactly two). It wouldn't generalize to a third particle, and that's your issue in part (2). To generalize, you'll want to do more with ak.num (which I know you already know about) than ak.any and ak.all.
First of all, be aware that your initial selection,
mumu = event[ak.num(event.mu) == 2]is excluding events with more than two muons, which you probably don't want to do because track/particle collections include fakes. You probably want "at least two good muons" for a definition of good that may be pt > 25000 in your case. (Are your momenta in MeV/c?) The pt > 25000 could be seen as particle selection and "at least two good muons AND at least one with pt > 27000" is event selection.
Let me mock up a similar case:
>>> events = ak.Array([{"mu": [{"pt": 1}, {"pt": 2}, {"pt": 3}]}, {"mu": [{"pt": 1}, {"pt": 2}, {"pt": 3}, {"pt": 4}]}, {"mu": [{"pt": 1, "pt": 2}]}])You've produced jagged arrays of booleans,
>>> print(events.mu.pt > 1)
[[False, True, True], [False, True, True, True], [True]]but then you immediately reduced them to non-jagged arrays of booleans with ak.any and ak.all. Maybe you don't know that you can use these jagged arrays of booleans as a selector of particles (not events) if the lengths of all the lists match:
>>> print(events[events.mu.pt > 1])
[{mu: [{pt: 2}, {pt: 3}]}, {mu: [{pt: 2}, {pt: 3}, {pt: 4}]}, {mu: [{pt: 2}]}]We still have three events (the number we started with), but have removed particles from the events. Then we can ask questions about the number that remain:
>>> print(ak.num(events[events.mu.pt > 1]))
[{mu: 2}, {mu: 3}, {mu: 1}]
>>> print(ak.num(events[events.mu.pt > 1].mu))
[2, 3, 1]
>>> print(ak.num(events[events.mu.pt > 1].mu) >= 2)
[True, True, False]Moreover, we can make different event-level selections through the number that match different particle-level selections:
>>> print(ak.num(events[events.mu.pt > 1].mu) >= 2)
[True, True, False]
>>> print(ak.num(events[events.mu.pt > 2].mu) >= 2)
[False, True, False]Here, we can change both the pt cut and the num cut to express different predicates, and form a set of event-level cuts as a logical-AND of them. You should be able to use that for both of your problems, (1) and (2).
It's also worth noting that there's an ak.argsort function that can sort deeply nested lists. (Its default axis is -1, which means "deepest level.")
>>> ak.argsort(events.mu.pt, ascending=False)
<Array [[2, 1, 0], [3, 2, 1, 0], [0]] type='3 * var * int64'>
>>> print(events.mu[ak.argsort(events.mu.pt, ascending=False)])
[[{pt: 3}, {pt: 2}, {pt: 1}], [{pt: 4}, {pt: 3}, {pt: 2}, {pt: 1}], [{pt: 2}]]Just as a jagged array of booleans can deeply slice the events, a jagged array of integers can deeply rearrange them, so that each list of muons is sorted according to the criterion you pass to ak.argsort. If you sort your data once (it's more expensive than other operations), then you can refer to the highest, second-highest, third-highest, etc. pt with a simple slice (which is less expensive than doing ak.num all the time, so it's a balance).
Date: 2021-03-17 13:44:08 From: Jim Pivarski (@jpivarski)
>>> sorted = events.mu[ak.argsort(events.mu.pt, ascending=False)]
>>> sorted[:, 0]
<Array [{pt: 3}, {pt: 4}, {pt: 2}] type='3 * {"pt": int64}'>
>>> sorted.mask[ak.num(sorted) >= 2][:, 1]
<Array [{pt: 2}, {pt: 3}, None] type='3 * ?{"pt": int64}'>in that last example, I used ak.mask, which turns events or particles that don't pass a cut into None rather than removing them, which can be helpful if you're having trouble with arrays that have different lengths, due to different cuts being applied.
Date: 2021-03-18 03:07:52 From: DevilsZ (@DevilsZ)
Hi Jim,
Many thanks for your instructions! I now understood how to write the codes.
And also thank you for letting me know this argsoft function. Seems quite useful!
By the way, I want to learn one more things. Could you also let me know how to do below? I want to calculate the invariant mass of two muons (it is sometimes "two electrons" or "a muon and an electron") , and product of their charges. For the product of muon charges, I use following lines; events_mumu = events[ak.num(events.mu) == 2] events_mumu = events_mumu[ak.prod(mumu.mu.charge, axis=1) < 0]
this can also be used for two electrons. and my be similar way can be applied for invariant mass calculation for same flavor lepton pairs. But I want to know how can I do this for electron and muon pairs. They belongs to different array as like below (I just show a part of variables for simplisity): print('tot events: ',ak.type(events))
28780 * {"eventNum": uint64, "mu": var * {"pt": float32, "charge": float32}, "el": var * {"pt": float32, "charge": float32}, "jet": var * {"pt": float32, "btag_score": float32}}
Thank you very mush in advance again.
Date: 2021-03-18 10:11:43 From: Alexander Held (@alexander-held)
I found this answer very insightful, it contains a lot of useful examples for analysis usage. I'm saving this so I can find it myself again in the future, but was wondering if there is a good place to put this in the awkward-array docs or so? Maybe it would just go into https://awkward-array.org/how-to-filter.html in the end. It's like a mini tutorial, and almost too sad to have it disappear into the history of this chat.
Date: 2021-03-18 13:15:58 From: Jim Pivarski (@jpivarski)
If you're going to be restricting to events with exactly two muons (or two electrons, or two leptons in general), then you can address each one individually:
first_muon = events_mumu[:, 0]
second_muon = events_mumu[:, 1]With them in separate arrays like this, you don't need to use a general reducer (ak.prod) to find the product of their charges, you can just do
opposite_sign = first_muon.charge * second_muon.charge < 0(It's because the expression events_mumu[:, 0] is itself a reducer; it turns the jagged array events_mumu into a one-dimensional array first_muon, with one value per event.)
In this form, it's much easier to mix objects from different collections:
opposite_sign = first_muon.charge * first_electron.charge < 0If you don't restrict yourself to a fixed number of particles, you can still do it by concatenating (ak.concatenate) the collections at axis=1. The following combines a jagged array of muons for each event with a jagged array of electrons for each event (without mixing different events):
leptons = ak.concatenate([events.mu, events.e], axis=1)If the muons and electrons have a different data type (e.g. they have different isolation variables), then you'll only be able to access the fields that have the same names from this leptons object. That is, you could say leptons.pt because both the muons and electrons have a pt field (hopefully, it means the same thing!), but if the electrons have an eOverP variable and the muons don't, you won't be able to access it. Do any cuts on particle-specific variables before combining the collections.
Keep in mind that the first_muon/second_muon technique is much easier, so if your analysis lets you do that, do that. Functions like ak.concatenate and ak.prod, which don't make assumptions about how many particles you have, are for the more general case when you have to weaken your assumptions.
Date: 2021-03-18 13:29:46 From: Jim Pivarski (@jpivarski)
Oh, and one last thing: be aware that
events_mumu = events[ak.num(events.mu) == 2]is throwing away events that have two good muons plus some noise that was misreconstructed as a muon. In most analyses that require two good muons, you don't want to introduce a dependence on what else might be going on in the event to give you a fake third muon—you want to accept the event regardless of whether there's a fake third muon. Misreconstructed particles usually have low momenta (and in hadron collisions, they could be real particles coming from the underlying event! also with low
That's why I mentioned ak.argsort before. I think the selection you want to make is
inclusive_two_muons = events[ak.num(events.mu) >= 2]
inclusive_two_muons_order = ak.argsort(inclusive_two_muons.mu.pt)
inclusive_two_muons = inclusive_two_muons[inclusive_two_muons_order]
first_muon = inclusive_two_muons[:, 0]
second_muon = inclusive_two_muons[:, 1]Now you have the highest
If you do want to exclude a legitimate third muon (because there's some background process that creates them, though I can't imagine what background would have more muons than the signal), you could create a veto cut against the third muon being above some threshold
third_muon_pt = ak.fill_none(ak.max(inclusive_two_muons[:, 2:].mu.pt, axis=1), 0)where inclusive_two_muons[:, 2:].mu.pt is a jagged array with 0 items if the event had exactly two muons (it's everything after the first two), then we take the ak.max of them (which might be None if the list is empty), then we do ak.fill_none to replace any None values with 0. The result is a one-dimensional array with 0 if there is no third muon and the third muon
All of this depends on what you want in your analysis, of course.
Date: 2021-03-18 13:35:02 From: Jim Pivarski (@jpivarski)
I know; I think about the fact that the tutorial documentation isn't done whenever I answer a question like this. However, it's much easier to answer questions about specific problems than to write documentation that helps people with generic, imagined problems. That kind of thing tends to become a walkthrough of the features, which the reference documentation already covers, since it describes each function in isolation.
I haven't figured out how to solve this problem. I'm just hoping I get enough information out there, in the form of Gitter (transitory), Slack (transitory), GitHub Discussions (permanent), and StackOverflow (permanent) that people can find what they're looking for.
Date: 2021-03-19 08:26:27 From: DevilsZ (@DevilsZ)
Hi Jim,
Thank you again. Well, both separation array and concatenation seem useful and something I should know. For this time separation looks OK but the other one could be used in the other case. Thanks, I noted these.
is throwing away events that have two good muons plus some noise that was misreconstructed as a muon.
Yes true, thank you this attention. Fortunately this time, I do have only good muons(electrons) in my rootfile may be due to a derivation step. But this is just lucky case, so I keep this in mind for future case.
Finally many thanks to your prompt replies and kind supports!
Date: 2021-03-20 18:35:23 From: Jim Pivarski (@jpivarski)
After some experience with this, I'm leaning more and more toward recommending Numba. Unlike JAX and PyPy, Numba's compiler behaves more like a traditional C compiler, except that the language is a subset of Python. "Subset" is the important word, though. Search for "Numba supported features," as well as examples, and start from a small, do-one-thing-only function, because it will be most likely to compile. It has to get the types right, and python is not very verbose about types. Then, the features that work in Awkward Array in Numba are simple iteration and extraction of fields. The code will look like a for loop in C, unlike the ak.combinatons solution. For what it's worth, I'm working on Lorentz vectors in Numba right now, but they aren't ready yet. You'll have to compute delta R manually.
Date: 2021-03-20 18:43:45 From: Jim Pivarski (@jpivarski)
Another good technique is to use Numba to make arrays of integers or booleans that can be used to slice am array, mixing the two formalisms.
Date: 2021-03-23 16:22:21 From: Simon B. (@sbuse)
Hello everybody, i came across a behavior that i don't understand, can someone have a look an tell me how to fix it?
print(ak.__version__)
A = ak.Record({"a":[1.1, 2.2, 3.3,4.4], "b":[0], "c":[4.4, 5.5]})
A.a
def f(b,switch=True):
if switch:
b["a"] = b.a[b.a<3]
else:
b["a"] = b.a[b.a>1]
return b
x = f(A)
print(x.a)
y = f(A,switch=False)
print(y.a)
results in:
1.2.0rc2
[1.1, 2.2]
[1.1, 2.2]
it looks like the record A gets modified by the function call but i thought that a copy would have been created within the function.
Date: 2021-03-23 16:25:13 From: Jim Pivarski (@jpivarski)
The first one changed the array in-place, so the second one doesn't have the extra items you're expecting.
Date: 2021-03-23 16:26:11 From: Jim Pivarski (@jpivarski)
>>> b = ak.Record({"a":[1.1, 2.2, 3.3,4.4], "b":[0], "c":[4.4, 5.5]})
>>> b.a[b.a < 3]
<Array [1.1, 2.2] type='2 * float64'>
>>> b.a[b.a > 1]
<Array [1.1, 2.2, 3.3, 4.4] type='4 * float64'>
>>> b["a"] = b.a[b.a < 3]
>>> b["a"]
<Array [1.1, 2.2] type='2 * float64'>
>>> b.a[b.a > 1]
<Array [1.1, 2.2] type='2 * float64'>Date: 2021-03-23 16:27:30 From: Simon B. (@sbuse)
for me it is just unexpected that changing b within the function changes A outside the function
Date: 2021-03-23 16:28:13 From: Jim Pivarski (@jpivarski)
In Python, all objects passed as function arguments are passed by reference.
Date: 2021-03-23 16:28:27 From: Henry Schreiner (@henryiii)
Everything in python is passed “by ref… Jim beat me to it. :)
Date: 2021-03-23 16:28:58 From: Henry Schreiner (@henryiii)
If it’s not mutable, it may seem like “value” passing, but it’s always by reference.
Date: 2021-03-23 16:30:34 From: Jim Pivarski (@jpivarski)
This b["a"] = ... is the only kind of mutability that Awkward Array supports. You can get an unconnected copy (something to modify independently) using c = ak.Array(b). They will share numerical array buffers, but those aren't mutable, so it's as independent of a copy as you want to be. (Even b["a"] = ... doesn't change numerical array buffers, it only replaces them and attaches the new one to the ak.Array Python object.)
Date: 2021-03-23 16:31:16 From: Simon B. (@sbuse)
perfect, i will try that. Thanks!
Date: 2021-03-23 16:41:54 From: Jim Pivarski (@jpivarski)
The difference (in brief) is just that ak.Array is iterable with a length; an ak.Record represents one item from a record array. That's why it has a different class. However, ak.Records can have fields that are ak.Arrays, as in your examples.
Date: 2021-03-31 09:16:23 From: Simon B. (@sbuse)
Is there an easy way to get from an array A with True,False to an array B with the index of the True position?
A = ak.Array([[True,False,True],[False,True],[False,False,False]])
B = ak.Array([[0,2],[1],[]])
Date: 2021-03-31 12:19:40 From: Daniel Holmberg (@deinal)
ak.where is probably what you want to use. I just tried ak.where(A), but it doesn't work since "subarray lengths are not regular" (at least in awkward 1.0.2). As a workaround you could do
B = ak.where(A, ak.local_index(A), ak.broadcast_arrays(A, -1)[1])
B = B[B != -1]
print(B)
[[0, 2], [1], []]
Date: 2021-03-31 12:54:50 From: Jim Pivarski (@jpivarski)
I was thinking of nonzero, but where with one argument is the same as nonzero. Internally, the first step of slicing is to turn arrays of booleans into index positions, so there ought to be a way to get this functionality publicly. I'll look into it.
Date: 2021-03-31 12:55:08 From: Jim Pivarski (@jpivarski)
Meanwhile, this work-around looks like it will work.
Date: 2021-04-01 07:56:56 From: Simon B. (@sbuse)
Thanks guys!
Date: 2021-04-02 07:55:45 From: agoose77 (@agoose77:matrix.org)
Hi all. I'm considering an effort to eliminate my per-event top-level Python loop by moving it into numba, but one of the problems I have is where functions need to act on an entire regular-shaped array e.g. FFT deconvolution.
The data might have shape var * var * var * int64, but this is really n_events * n_waveforms_in_event * 512 * int64.
If it were just a per-waveform operation, I could flatten this along axis=-2, and then zero-copy convert it to NumPy before passing to Numba, but I need to retain the structure of the array inside jitted fn, which will need to operate on these data at the event level. I will also have other arrays that structurally align with these waveforms, e.g. n_events * n_waveforms_in_event * int64. Does anyone have any suggestions? All I can think of is either using objmode inside of Numba, or re-implementing a lot of awkward inside of numba, and passing in the offsets and underlying flat array.
Date: 2021-04-02 08:14:54 From: agoose77 (@agoose77:matrix.org)
☝️ Edit: Hi all. I'm considering an effort to eliminate my per-event top-level Python loop by moving it into numba, but one of the problems I have is where functions need to act on an entire regular-shaped array e.g. FFT deconvolution.
The data might have shape var * var * var * int64, but this is really n_events * n_waveforms_in_event * 512 * int64.
If it were just a per-waveform operation, I could flatten this along axis=-2, and then zero-copy convert it to NumPy before passing to Numba, but I need to retain the structure of the array inside jitted fn, which will need to operate on these data at the event level. I will also have other arrays that structurally align with these waveforms, e.g. n_events * n_waveforms_in_event * int64. Does anyone have any suggestions? All I can think of is either using objmode inside of Numba, or re-implementing a lot of awkward inside of numba, and passing in the offsets and underlying flat array.
Effectively what I want is something like apply_along_axis where I can apply this Numba fn along the -2 axis.
Date: 2021-04-02 08:16:23 From: agoose77 (@agoose77:matrix.org)
☝️ Edit: Hi all. I'm considering an effort to eliminate my per-event top-level Python loop by moving it into numba, but one of the problems I have is where functions need to act on an entire regular-shaped array e.g. FFT deconvolution.
The data might have shape var * var * var * int64, but this is really n_events * n_waveforms_in_event * 512 * int64.
If it were just a per-waveform operation, I could flatten this along axis=-2, and then zero-copy convert it to NumPy before passing to Numba, but I need to retain the structure of the array inside jitted fn, which will need to operate on these data at the event level. I will also have other arrays that structurally align with these waveforms, e.g. n_events * n_waveforms_in_event * int64. Does anyone have any suggestions? All I can think of is either using objmode inside of Numba, or re-implementing a lot of awkward inside of numba, and passing in the offsets and underlying flat array.
Effectively what I want is something like apply_along_axis where I can apply this Numba fn along the -2 axis, and with the ability to convert the awkward array to a regular array.
All of this might be slightly moot because to FFT I probably need to use objmode anyway.
Date: 2021-04-02 09:48:17 From: agoose77 (@agoose77:matrix.org)
☝️ Edit: Hi all. I'm considering eliminating my per-event top-level Python loop by moving it into Numba, but one of the sticking points is functions that need to act on an entire regular-shaped array e.g. FFT deconvolution.
The data might have shape var * var * var * int64, but this is really n_events * n_waveforms_in_event * 512 * int64.
If it were just a per-waveform operation, I could flatten this along axis=-2, and then zero-copy convert it to NumPy before passing to Numba, but I need to retain the structure of the array inside jitted fn, which will need to operate on these data at the event level. I will also have other arrays that structurally align with these waveforms, e.g. n_events * n_waveforms_in_event * int64. Does anyone have any suggestions? All I can think of is either using objmode inside of Numba, or re-implementing a lot of awkward inside of numba, and passing in the offsets and underlying flat array.
Effectively what I want is something like apply_along_axis where I can apply this Numba fn along the -2 axis, and with the ability to convert the awkward array to a regular array.
All of this might be slightly moot because to FFT I probably need to use objmode anyway.
Date: 2021-04-02 09:58:14 From: agoose77 (@agoose77:matrix.org)
☝️ Edit: Hi all. I'm considering eliminating my per-event top-level Python loop by moving it into Numba, but one of the sticking points is functions that need to act on an entire regular-shaped array e.g. FFT deconvolution.
One example is my peak-finding stage which runs over each event (first axis) and deconvolves the detector signals after baseline removal. The entire waveform array might have shape var * var * var * int64 when loaded into Awkward, but this is really n_events * n_waveforms_in_event * 512 * int64.
Currently the code looks something like
waveforms = ...
for event_waveforms in waveforms:
correct_polarity(event_waveforms)
subtract_baseline(event_waveforms)
peaks = find_peaks(event_waveforms)
If it were just a per-waveform operation, I could flatten this along axis=-2, and then zero-copy convert it to NumPy before passing to Numba, but I need to retain the structure of the array inside jitted fn, which will need to operate on these data at the event level. I will also have other arrays that structurally align with these waveforms, e.g. n_events * n_waveforms_in_event * int64. Does anyone have any suggestions? All I can think of is either using objmode inside of Numba, or re-implementing a lot of awkward inside of numba, and passing in the offsets and underlying flat array.
Effectively what I want is something like apply_along_axis where I can apply this Numba fn along the -2 axis, and with the ability to convert the awkward array to a regular array.
All of this might be slightly moot because to FFT I probably need to use objmode anyway.
Date: 2021-04-02 09:58:40 From: agoose77 (@agoose77:matrix.org)
☝️ Edit: Hi all. I'm considering eliminating my per-event top-level Python loop by moving it into Numba, but one of the sticking points is functions that need to act on an entire regular-shaped array e.g. FFT deconvolution.
One example is my peak-finding stage which runs over each event (first axis) and deconvolves the detector signals after baseline removal. The entire waveform array might have shape var * var * var * int64 when loaded into Awkward, but this is really n_events * n_waveforms_in_event * 512 * int64.
Currently the code looks something like
waveforms = ...
for event_waveforms in waveforms:
correct_polarity(event_waveforms)
subtract_baseline(event_waveforms)
event_sources = deconvolve(event_waveforms)
peaks = find_peaks(event_sources)If it were just a per-waveform operation, I could flatten this along axis=-2, and then zero-copy convert it to NumPy before passing to Numba, but I need to retain the structure of the array inside jitted fn, which will need to operate on these data at the event level. I will also have other arrays that structurally align with these waveforms, e.g. n_events * n_waveforms_in_event * int64. Does anyone have any suggestions? All I can think of is either using objmode inside of Numba, or re-implementing a lot of awkward inside of numba, and passing in the offsets and underlying flat array.
Effectively what I want is something like apply_along_axis where I can apply this Numba fn along the -2 axis, and with the ability to convert the awkward array to a regular array.
All of this might be slightly moot because to FFT I probably need to use objmode anyway.
Date: 2021-04-02 09:58:48 From: agoose77 (@agoose77:matrix.org)
☝️ Edit: Hi all. I'm considering eliminating my per-event top-level Python loop by moving it into Numba, but one of the sticking points is functions that need to act on an entire regular-shaped array e.g. FFT deconvolution.
One example is my peak-finding stage which runs over each event (first axis) and deconvolves the detector signals after baseline removal. The entire waveform array might have shape var * var * var * int64 when loaded into Awkward, but this is really n_events * n_waveforms_in_event * 512 * int64.
Currently the code looks something like
waveforms = ...
for event_waveforms in waveforms:
correct_polarity(event_waveforms)
subtract_baseline(event_waveforms)
event_sources = deconvolve(event_waveforms)
peaks = find_peaks(event_sources)If it were just a per-waveform operation, I could flatten this along axis=-2, and then zero-copy convert it to NumPy before passing to Numba, but I need to retain the structure of the array inside jitted fn, which will need to operate on these data at the event level. I will also have other arrays that structurally align with these waveforms, e.g. n_events * n_waveforms_in_event * int64. Does anyone have any suggestions? All I can think of is either using objmode inside of Numba, or re-implementing a lot of awkward inside of numba, and passing in the offsets and underlying flat array.
Effectively what I want is something like apply_along_axis where I can apply this Numba fn along the -2 axis, and with the ability to convert the awkward array to a regular array.
All of this might be slightly moot because to FFT I probably need to use objmode anyway.
Date: 2021-04-02 13:22:45 From: Jim Pivarski (@jpivarski)
If it is actually regular (even if the list types are "var"), then ak.to_numpy will convert it.
Date: 2021-04-02 13:24:25 From: Jim Pivarski (@jpivarski)
For completeness, of you need to turn one dimension regular/irregular, there's ak.to_regular/ak.from_regular, though that's not really your case—you need the whole thing to be regular to get it into Numba as a NumPy array.
Date: 2021-04-02 13:26:23 From: Jim Pivarski (@jpivarski)
Also, objmode in Numba is like not using Numba. It exists so that the dividing line between complied and non-compiled doesn't have to be one-way: you can, for instance, have a fast loop go into a rare condition using "with
Date: 2021-04-02 13:27:08 From: Jim Pivarski (@jpivarski)
with numba.objmode: and then come back out. As long as you spend little time there, it didn't impact the total time budget.
Date: 2021-04-02 13:28:40 From: Jim Pivarski (@jpivarski)
Since you need FFT, you should first find out if such a function is available in lowered Numba. There's a project that's converting all of the SciPy functions over. There's also a Numba Gitter and.
Date: 2021-04-02 13:28:50 From: Jim Pivarski (@jpivarski)
Discourse.
Date: 2021-04-02 13:39:04 From: agoose77 (@agoose77:matrix.org)
Unfortunately it is not regular, the inner dimension varies in size.
Yes, that's effectively why I'll probably need objmode for this particular use case. It looks like there's not an option for numba right now, unless I call into the underlying scipy routines with FFI.
Date: 2021-04-02 14:04:17 From: Jim Pivarski (@jpivarski)
More options: ak.pad_none to make the inner dimension a consistent size, followed by ak.fill_none to fill the "None" values with something that could be meaningful (0?).
Date: 2021-04-02 14:07:42 From: agoose77 (@agoose77:matrix.org)
Interesting, what does that do under the hood w.r.t allocation?
Date: 2021-04-02 14:11:37 From: Jim Pivarski (@jpivarski)
In an array structure, if buffer needs to change, it is replaced with a new buffer, but most of the buffers in the input can be reused in the output, so they are. In the example (way above) with a ListOffsetArray containing an "offsets" index and a "content," a calculation such as a ufunc that would change the "content" can reuse the same "offsets." Similar things are done with padding and filling. The padded array reuses the content but introduces a new IndexedOptionArray that specifies missing values with -1 in the index. Filling has to copy the content into a new buffer, which can be done because now we know the structure of what is being filled.
Date: 2021-04-02 14:22:56 From: agoose77 (@agoose77:matrix.org)
That's ingenious, and excellent news.
Date: 2021-04-02 20:26:01 From: Angus Hollands (@agoose77:matrix.org)
What is the advised way to call a numpy function that doesn't modify the structure of an awkward array, but also doesn't have an awkward implementation, e.g. searchsorted? So far my best guess is flatten -> transform -> unflatten
Date: 2021-04-02 21:26:39 From: Jim Pivarski (@jpivarski)
Ideally, we'll need a default __array_function__ to cover all of these cases, and then it would just work: https://github.com/scikit-hep/awkward-1.0/issues/630 For now, flattening, transforming, and unflattening, or possibly get the underlying data from the layout, transform, and rebuild (the low-level equivalent of flattening and unflattening).
Date: 2021-04-02 21:31:38 From: Angus Hollands (@agoose77:matrix.org)
Thanks Jim, any pointers in the docs / examples I can refer to for the low-level route?
Date: 2021-04-02 21:34:02 From: Angus Hollands (@agoose77:matrix.org)
Whilst looking at this I went the route of an array-builder with numba, and I hit an error with what I suspect is type unification:
import awkward as ak
import numba as nb
import numpy as np
jit = nb.jit
@jit
def _searchsorted_outer(b, a, v, depth):
for x in v:
b.begin_list()
_searchsorted_impl(b, a, x, depth - 1)
b.end_list()
@jit
def _searchsorted_inner(b, a, v):
for z in v:
y = np.searchsorted(a, z)
b.append(y)
@jit
def _searchsorted_impl(b, a, v, depth):
if depth > 1:
_searchsorted_outer(b, a, v, depth)
else:
_searchsorted_inner(b, a, v)
@jit
def searchsorted(b, a, v):
_searchsorted_impl(b, a, v, v.ndim)
channels = ak.from_iter([[0], [2, 3], [8], [9, 20]])
index = np.array([0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
builder = ak.ArrayBuilder()
searchsorted(builder, index, channels)
builder.snapshot()When I change to nopython with jit = nb.njit, the pipeline seems to struggle with the fact that the innermost dimension yields values rather than arrays (Invalid use of getiter with parameters (int64)). Although this is a Numba error, I wonder if it's something anyone here has seen before. I can't think of a way to hint to Numba that this isn't actually happening (like with an assert in Typescript)
Date: 2021-04-02 21:37:59 From: Jim Pivarski (@jpivarski)
I talk about type unification here: https://youtu.be/X_BJrmofRWQ?t=2614
Date: 2021-04-02 21:40:56 From: Jim Pivarski (@jpivarski)
As for breaking down an array into low-level components, the .layout attribute of an ak.Array gives you a tree of objects that are all subclasses of ak.layout.Content. You can navigate through these to get down to the ak.layou.NumpyArray objects, which are castable as NumPy arrays (np.asarray(that_object)), and if the operation you're performing doesn't change its length, you can re-use most of the layout to build up a new tree and wrap it as an ak.Array.
That's what most of the Awkward operations are doing—adding a default would be to find a way to do it automatically for NumPy functions we haven't explicitly handled.
Date: 2021-04-02 21:43:42 From: Angus Hollands (@agoose77:matrix.org)
Ah, thinking about it again I'm confident that this is a basic unification issue rather than unexpected behaviour - I'll drop in the numba channel.
Date: 2021-04-02 21:45:28 From: Angus Hollands (@agoose77:matrix.org)
Thanks for this - so this would be if I'm calling my implementation from pure Python, right? and within numba I'd need something else like the code above. Gotcha.
Date: 2021-04-02 21:46:27 From: Jim Pivarski (@jpivarski)
This would be outside of Numba, though breaking something down into NumPy arrays and passing the NumPy arrays into a Numba-compiled function would work.
Date: 2021-04-02 21:48:21 From: Angus Hollands (@agoose77:matrix.org)
Fab, that matches my understanding 🙂
Date: 2021-04-02 21:49:31 From: Angus Hollands (@agoose77:matrix.org)
Interestingly, your same talk proposes an alternative to just code-gen the implementation. Which, with a simple caching scheme, would be quite feasible too 🥳
Date: 2021-04-03 08:21:03 From: Angus Hollands (@agoose77:matrix.org)
Before I open a PR; I think the axis parameter on ak.flatten should default to None, so that in the general case all arrays are completely flattened. This surprised me today with some old code that didn't provide the axis and was flattening a regular Awkward Array. My rationale is that where users want to flatten w.r.t. a specific axis, that axis should be specified as it is for any regular numpy function taking axes. Special casing axis=1 assumes a particular preferred dimension otherwise. Does anyone have any strong feelings on this?
Date: 2021-04-03 10:16:46 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: Before I open a PR; I think the axis parameter on ak.flatten should be changed to default to None, so that in the general case arrays are completely flattened. This surprised me today with some old code that didn't provide the axis and was flattening a regular Awkward Array. My rationale is that where users want to flatten w.r.t. a specific axis, that axis should be specified as it is for any regular numpy function taking axes. Special casing axis=1 assumes a particular preferred dimension otherwise. Does anyone have any strong feelings on this?
Date: 2021-04-03 21:10:20 From: Jim Pivarski (@jpivarski)
In functional programming environments (languages like LISP and Scala or frameworks like Spark), some functions have common meanings across environments, such as "map," "reduce," and "filter," but also "flatten." In all the ones I've seen, "flatten" has always meant one level only, and that one-level flattening with "map" is so important that it often gets a special name: "flatmap." (And that operation is one of the two most fundamental to monads, as a side note.) Coming from one of those environments, it would be very surprising if flatten flattened more than one level. Also, flattening with axis=None is a bit dangerous because it also flattens records. If you have records with "pt," "eta," and "phi" (and maybe you forgot to extract one), flattening with axis=None would mix them all in the output, possibly a histogram. If they're at different scales, it might be hard to notice in the histogram. On the other hand, if you were expecting complete flattening and didn't pass an argument, your get an error message instead of wrong answers.
Date: 2021-04-05 15:34:54 From: agoose77 (@agoose77:matrix.org)
Well, having watched a few of your talks, I can see that there is consistency here. I can't make a good case for setting axis=None besides "flatten means flat", which is clearly my drawing a parallel between ravel().
Date: 2021-04-05 15:37:38 From: Jim Pivarski (@jpivarski)
Maybe there should be an ak.ravel that does this—generalizing from the NumPy version by flattening variable-length structures, but refusing to flatten records because that's the dangerous case. It could overload np.ravel so that np.ravel(awk_array) would return a flat, contiguous array. We could even have the output be of np.ndarray type (unlike what ak.flatten does).
Date: 2021-04-05 15:39:00 From: agoose77 (@agoose77:matrix.org)
Yes, np.ravel also does not flatten the (top level) record structure. I think that might be the right solution here.
Date: 2021-04-05 15:40:29 From: agoose77 (@agoose77:matrix.org)
Incidentally, how can one "simplify" the layout of a sliced RecordArray so that it can be flattened? When I call flattenon this
<IndexedArray64>
<index><Index64 i="[5 7 8 17 31 33 34 52 61 65 ... 9950 9953 9956 9958 9971 9976 9977 9991 9992 9993]" offset="0" length="1932" at="0x7fccfa51f390"/></index>
<content><RecordArray>
<field index="0" key="x">
<ListOffsetArray64>
<offsets><Index64 i="[0 182 382 562 737 920 1084 1258 1401 1560 ... 1798744 1798905 1799098 1799231 1799406 1799594 1799746 1799925 1800098 1800284]" offset="0" length="10001" at="0x7fccfb8372c0"/></offsets>
<content><NumpyArray format="d" shape="1800284" data="120.487 120.526 120.37 120.723 120.281 ... 121.714 125.143 114.857 121.714 114.857" at="0x7fccf0000050"/></content>
</ListOffsetArray64>
</field>
<field index="1" key="y">
<ListOffsetArray64>
<offsets><Index64 i="[0 182 382 562 737 920 1084 1258 1401 1560 ... 1798744 1798905 1799098 1799231 1799406 1799594 1799746 1799925 1800098 1800284]" offset="0" length="10001" at="0x7fccfb84ab80"/></offsets>
<content><NumpyArray format="d" shape="1800284" data="119.143 120.857 122.571 117.429 124.286 ... 213.429 213.429 215.143 215.143 216.857" at="0x7fccf0dbc360"/></content>
</ListOffsetArray64>
</field>
<field index="2" key="z">
<ListOffsetArray64>
<offsets><Index64 i="[0 182 382 562 737 920 1084 1258 1401 1560 ... 1798744 1798905 1799098 1799231 1799406 1799594 1799746 1799925 1800098 1800284]" offset="0" length="10001" at="0x7fccf1b78670"/></offsets>
<content><NumpyArray format="d" shape="1800284" data="66.0266 66.193 66.0262 66.0254 66.027 ... 76.9452 77.4641 55.6267 77.2068 54.852" at="0x7fcce8dbc360"/></content>
</ListOffsetArray64>
</field>
</RecordArray></content>
</IndexedArray64>
it fails because it is an array of records
Date: 2021-04-05 15:41:20 From: Angus Hollands (@agoose77:matrix.org)
The only way I can see thus far is to zip the flattened xyz arrays.
Date: 2021-04-05 15:42:15 From: Jim Pivarski (@jpivarski)
axis=None disregards all boundaries:
>>> ak.flatten(ak.Array([[{"x": [1, 2], "y": 3.3}], [], [{"x": [4], "y": 5.5}]]), axis=None)
<Array [1, 2, 4, 3.3, 5.5] type='5 * float64'>Date: 2021-04-05 15:42:37 From: Jim Pivarski (@jpivarski)
Ah, you want to not disregard the record field boundaries, right?
Date: 2021-04-05 15:43:25 From: agoose77 (@agoose77:matrix.org)
Yes, I would like to retain the record top-level structure, but flatten the contents.
Date: 2021-04-05 15:46:59 From: Jim Pivarski (@jpivarski)
And you have a special case in which each field has the same structure (all list lengths are the same across all fields)?
Date: 2021-04-05 15:47:28 From: Jim Pivarski (@jpivarski)
Without that condition, it's not an operation that has a meaningful result.
Date: 2021-04-05 15:49:55 From: Jim Pivarski (@jpivarski)
>>> record_of_arrays = ak.Array([{"x": [1, 2], "y": [1.1, 2.2]}, {"x": [3], "y": [3.3]}])
>>> record_of_arrays
<Array [{x: [1, 2], y: [1.1, ... 3], y: [3.3]}] type='2 * {"x": var * int64, "y"...'>
>>> ak.unzip(record_of_arrays)
(<Array [[1, 2], [3]] type='2 * var * int64'>, <Array [[1.1, 2.2], [3.3]] type='2 * var * float64'>)
>>> [ak.flatten(x, axis=1) for x in ak.unzip(record_of_arrays)]
[<Array [1, 2, 3] type='3 * int64'>, <Array [1.1, 2.2, 3.3] type='3 * float64'>]
>>> dict(zip(ak.fields(record_of_arrays), [ak.flatten(x, axis=1) for x in ak.unzip(record_of_arrays)]))
{'x': <Array [1, 2, 3] type='3 * int64'>, 'y': <Array [1.1, 2.2, 3.3] type='3 * float64'>}
>>> ak.zip(dict(zip(ak.fields(record_of_arrays), [ak.flatten(x, axis=1) for x in ak.unzip(record_of_arrays)])))
<Array [{x: 1, y: 1.1}, ... {x: 3, y: 3.3}] type='3 * {"x": int64, "y": float64}'>Date: 2021-04-05 15:51:59 From: Angus Hollands (@agoose77:matrix.org)
Yes indeed, the ravelled dimensions match!
Date: 2021-04-05 15:52:09 From: Angus Hollands (@agoose77:matrix.org)
So a manual unzip-flatten-zip it is then 🙂 Thanks Jim.
Date: 2021-04-05 15:52:53 From: Jim Pivarski (@jpivarski)
That's right. Zipping and unzipping are not expensive operations, and with ak.fields, you don't have to know all the field names or explicitly write them down.
Date: 2021-04-06 13:00:16 From: Angus Hollands (@agoose77:matrix.org)
What's the most elegant way to "compact" the memory (via a copy) of an array? E.g., I have a Record that was taken from a RecordArray. I want to create a compact copy so I don't keep the full buffers of the RecordArray in memory (and also, so that I can call to_numpy on the array).
Date: 2021-04-06 13:00:23 From: Angus Hollands (@agoose77:matrix.org)
The layout in this case is
<Record at="7">
<RecordArray>
<field index="0" key="charge">
<ListOffsetArray64>
<offsets><Index64 i="[0 182 382 562 737 920 1084 1258 1401 1560 ... 16662 16839 17013 17188 17367 17502 17679 17852 18033 18210]" offset="0" length="101" at="0x00000cee8430"/></offsets>
<content><NumpyArray format="d" shape="18210" data="5.57846 4.82189 6.69512 6.88845 7.85191 ... 0.0704975 0.0589356 0.0621146 0.108709 0.0730918" at="0x00000db0ae30"/></content>
</ListOffsetArray64>
</field>
<field index="1" key="x">
<ListOffsetArray64>
<offsets><Index64 i="[0 182 382 562 737 920 1084 1258 1401 1560 ... 16662 16839 17013 17188 17367 17502 17679 17852 18033 18210]" offset="0" length="101" at="0x00000c2baad0"/></offsets>
<content><NumpyArray format="d" shape="18210" data="120.487 120.526 120.37 120.723 120.281 ... 114.857 114.857 114.857 114.857 121.714" at="0x00000cf60690"/></content>
</ListOffsetArray64>
</field>
<field index="2" key="y">
<ListOffsetArray64>
<offsets><Index64 i="[0 182 382 562 737 920 1084 1258 1401 1560 ... 16662 16839 17013 17188 17367 17502 17679 17852 18033 18210]" offset="0" length="101" at="0x00000b8db0a0"/></offsets>
<content><NumpyArray format="d" shape="18210" data="119.143 120.857 122.571 117.429 124.286 ... 215.143 218.571 218.571 218.571 218.571" at="0x00000dac3bb0"/></content>
</ListOffsetArray64>
</field>
<field index="3" key="z">
<ListOffsetArray64>
<offsets><Index64 i="[0 182 382 562 737 920 1084 1258 1401 1560 ... 16662 16839 17013 17188 17367 17502 17679 17852 18033 18210]" offset="0" length="101" at="0x00000c2fca50"/></offsets>
<content><NumpyArray format="d" shape="18210" data="66.0266 66.193 66.0262 66.0254 66.027 ... 82.9237 83.9219 80.545 77.9856 61.8755" at="0x00000dae74f0"/></content>
</ListOffsetArray64>
</field>
</RecordArray>
</Record>
Date: 2021-04-06 13:29:00 From: Jim Pivarski (@jpivarski)
You're in need of this missing feature: https://github.com/scikit-hep/awkward-1.0/issues/746
The work-around at the moment is to turn it into an Arrow Array and back again.
Date: 2021-04-06 13:29:41 From: Jim Pivarski (@jpivarski)
I don't know if an ak.Record can become an Arrow Array (since it's a scalar), but a one-element record array can be. Then take the only element from that.
Date: 2021-04-06 13:29:58 From: Jim Pivarski (@jpivarski)
ak.packed is what the not-yet-existing function will be called.
Date: 2021-04-06 13:31:33 From: Angus Hollands (@agoose77:matrix.org)
Ah, I knew it sounded familiar, that's exactly it. Thanks Jim 🙂 It's not too tricky to special case this for now until I have more time to contribute :)
Date: 2021-04-06 15:20:59 From: Angus Hollands (@agoose77:matrix.org)
Are there any pre-existing mixins that treat records as ND arrays (just with named columns)? I.e., a position RecordArray that behaves like a 2D numpy array (and therefore supports subtraction, addition, by constant/array?
Date: 2021-04-06 15:29:55 From: Jim Pivarski (@jpivarski)
Interestingly, RecordArray used to pass all ufuncs down, but that was removed because when people are overloading RecordArrays to mean things like position, naive addition would be wrong on non-Cartesian coordinates and then if the overload fails, the answer would be wrong without errors.
RecordArrays convert into NumPy structured arrays in ak.to_numpy, and these support (naive) addition.
But if you're specifically interested in vectors, you'll probably want to use the new Vector library, which is a suite of behaviors that do addition, subtraction, etc. in non-naive ways for many coordinate systems. It's a vector library for only 2D, 3D (Euclidean), and 4D (Minkowski/Lorentz) spaces.
In general, there aren't tools for turning record fields into list items, which would essentially be turning columns into rows (like Pandas's stack/unstack). @nsmith- has mentioned that this sort of thing can be useful. Without a specialized function, it can be hacked by creating a length-1 list for each field with np.newaxis and concatenating those length-1 lists at their appropriate axis (ak.concatenate). This assumes that the types of all fields are compatible (if not, you'll get a union array). In principle, you should get a regular axis, but ak.concatenate might not realize that and give you an axis of type var.
Date: 2021-04-06 15:35:44 From: Angus Hollands (@agoose77:matrix.org)
A clear answer as ever Jim 😂
Date: 2021-04-06 15:37:23 From: Angus Hollands (@agoose77:matrix.org)
Yes, effectively I want to do a zero cost np.stack(). I had been avoiding named components for a similar reason that numpy's recarray effectively prevents the user from doing this kind of thing unless they first take a pure float view. This makes sense given the possibility of heterogeneous data types, but it's not always the most user friendly 🤔
Date: 2021-04-06 15:37:54 From: Angus Hollands (@agoose77:matrix.org)
I did see the vector library, and I plan to use it. I was hoping to avoid installing another git-based dependency, but maybe I'll give it a chance 😛
Date: 2021-04-06 15:39:15 From: Jim Pivarski (@jpivarski)
A stack would not be zero cost, actually. Record fields ("columns") are separate buffers, but the data in a list ("rows," even if they're jagged rows) are contiguously in a single buffer, so that means a copy will be necessary. In the work-around I described above, ak.concatenate would be doing that copy.
Date: 2021-04-06 15:49:44 From: Angus Hollands (@agoose77:matrix.org)
Ah, to clarify, I mean that it would be nice to have some kind of zero-cost abstraction over the fact that these are separate buffers. Even numpy doesn't let you do this as it assumes your underlying storage is one buffer, so it's definitely a wishlist item vs a missing feature 🙂
Date: 2021-04-06 15:54:40 From: Angus Hollands (@agoose77:matrix.org)
Despite not using it much yet, I think the behaviour API in awkward is something I've been wanting for a while; a way to treat arrays (of records) and their items (records) as objects. In the pure-numpy world, doing this is non-trivial and means I've tended toward wide arrays (e.g. [x, y, z, q, ...] columns) instead of nested deep ones ({'position': [x, y, z], 'charge': q}). It was a particular nifty idea to leverage the subscript operator for behaviour definitions, not many libraries take advantage of that syntax
Date: 2021-04-06 16:36:48 From: Jim Pivarski (@jpivarski)
Just due to the layout of RecordArray and ListOffsetArray, any conversions that merge fields of a RecordArray will have to copy data. That's not something that can change as a future feature (without changing those layouts, which I don't think will ever happen, they're too basic).
Date: 2021-04-07 09:44:40 From: Angus Hollands (@agoose77:matrix.org)
If I have a 1D array from uproot as an Array, what is the best way to reshape it (if I know that it is a flattened representation of a square array)? I could either unflatten and generate the counts, or wrap-unwrap with ak.to_numpy, but I wondered if there were a way that I could modify the underlying shape attribute (in a new copy, given immutability) of the NumpyArray?
Date: 2021-04-07 13:23:21 From: Jim Pivarski (@jpivarski)
Unflattering and to/from NumPy are the right ways to do that. Actually, as long as the NumPy didn't involve a copy, you can use it as a backdoor to changing the array in place: https://awkward-array.org/how-to-convert-numpy.html#mutability-of-awkward-arrays-from-numpy
Date: 2021-04-12 15:10:47 From: Lukas (@lukasheinrich)
I posted a Q&A to the github discussions (https://github.com/scikit-hep/awkward-1.0/discussions/818) but maybe actually it's better for chat (since I guess the answer is trivial) .. after setting parameters to introduce named records a la Point is it possible to re-instantiate an existting awkward array to pick up those behaviors?
Date: 2021-04-12 15:13:47 From: Angus Hollands (@agoose77:matrix.org)
Do you mean assign a behavior to an existing array?
Date: 2021-04-12 15:21:39 From: Lukas (@lukasheinrich)
yeah.. once I set parameters any type of indexing picks it up correctly e.g. array[0].. except for the toplevel object wherer only array[:] seems to pick itt up
Date: 2021-04-12 15:56:22 From: Jim Pivarski (@jpivarski)
Right, the problem is that the top-level, which is the only level that instantiates itself as a Python class, has already instantiated itself. You can pass an ak.Array as the only argument to the ak.Array constructor, so that would be a quick way to get it to notice that some new behaviors have been defined and it should use the new subclass.
Date: 2021-04-12 15:58:43 From: Jim Pivarski (@jpivarski)
I haven't caught up to this GitHub Discussion in my email yet. Is this something that you could self-answer there, @lukasheinrich?
Date: 2021-04-12 16:50:18 From: Lukas (@lukasheinrich)
ok - I will self-answer.. seems like a trivial slice or a new constuction (which I gather is zero-copy) is the solution
Date: 2021-04-13 12:51:42 From: elusian (@elusian:mozilla.org)
Hi all. I'm using uproot and awkward to analyse a TTree. One of the arrays I have has shape N_events * var * float32. I also have a second array of shape N_events * int32 which I would like to use to index one float per event and get an array of shape N_events * float32. In code, this means
>>> a = ak.from_iter([[0., 1., 2., 3.], [4., 5.], [6., 7.], [8., 9., 10., 11.], [12., 13.]])
>>> index = ak.from_iter([2, 0, 1, 3, 0])
>>> something(a, index)
<Array [2, 4, 7, 11, 12] type='5 * float64'>
a[index] is wrong, as that selects on the first dimension. Many examples in the documentation show how to do this kind of nested indexing, but all of them work when the index innermost dimension is variable, while mine isn't.
>>> a = ak.from_iter([[0., 1., 2., 3.], [4., 5.], [6., 7.], [8., 9., 10., 11.], [12., 13.]])
>>> index = ak.from_iter([2, 0, 1, 3, 0])[:, np.newaxis]
<Array [[2], [0], [1], [3], [0]] type='5 * 1 * int64'>
>>> index2 = ak.from_iter([[2], [0], [1], [3], [0]])
<Array [[2], [0], [1], [3], [0]] type='5 * var * int64'>
a[index2]
<Array [[2], [4], [7], [11], [12]] type='5 * var * float64'>
a[index]
<Array [[[6, 7]], [[0, 1, 2, 3]], [[4, 5]], [[8, 9, 10, 11]], [[0, 1, 2, 3]]] type='5 * 1 * var * float64'>
I'm probably missing something obvious, but is there any way to do this for a fixed dimension array? Thanks in advance
Date: 2021-04-13 13:18:57 From: Jim Pivarski (@jpivarski)
Do you mean this?
>>> a[np.arange(len(a)), index]
<Array [2, 4, 7, 11, 12] type='5 * float64'>Like NumPy advanced indexing in multiple dimensions, but jagged.
Date: 2021-04-13 13:39:58 From: elusian (@elusian:mozilla.org)
Thank you, this does work
Date: 2021-04-22 11:05:20 From: Angus Hollands (@agoose77:matrix.org)
I'm looking at the broadcasting behaviour of Awkward, and as an aside noticed that
ak.Array(
[
np.r_[0, 1, 0],
np.r_[1, 0, 0],
np.r_[0, 0, 1],
]
)has the layout 3 * var * int64. I can see that the inner dimension is constant; is it just that there is no constraint to indicate that the rows of these inner arrays are the same length? I.e., it cannot be shown without iterating over the rows and taking the length, so it's not recorded as such in the layout?
Date: 2021-04-22 11:10:47 From: Angus Hollands (@agoose77:matrix.org)
Additionally, I don't get the result I expect when I evaluate this matrix multiplication
transform = ak.Array(
[
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
]
)
vector = ak.from_numpy(np.c_[[4, 5, 6]])
result = transform @ vectorThe layout of the result indicates that the underlying array has the wrong length:
<ListOffsetArray64>
<offsets><Index64 i="[0 3 6 9]" offset="0" length="4" at="0x0000109bb690"/></offsets>
<content><NumpyArray format="l" shape="1" data="15" at="0x000011236120"/></content>
</ListOffsetArray64>Date: 2021-04-22 11:37:40 From: Angus Hollands (@agoose77:matrix.org)
Raised an issue here :)
Date: 2021-04-22 13:00:57 From: Angus Hollands (@agoose77:matrix.org)
and added a PR to test for the bug.
Date: 2021-04-22 14:10:23 From: Jim Pivarski (@jpivarski)
This is because the NumPy arrays are nested within a Python list. It's the top-level type that determines which constructor signature is going to apply. Any NumPy arrays encountered within a variable-length list are treated as general iterables, and that's why you get the type (not layout) 3 * var * int64.
Date: 2021-04-22 14:15:44 From: Angus Hollands (@agoose77:matrix.org)
Fab, that was my understanding too given the documentation 🙂
Date: 2021-04-22 14:38:08 From: Jim Pivarski (@jpivarski)
Hmmm. With all variable-length arrays, it's fine:
>>> np.array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ np.array([[4], [5], [6]])
array([[5],
[4],
[6]])
>>> ak.Array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ ak.Array([[4], [5], [6]])
<Array [[5], [4], [6]] type='3 * var * int64'>but with a fixed-length vector...
>>> ak.from_numpy(np.c_[[4, 5, 6]])
<Array [[4], [5], [6]] type='3 * 1 * int64'>
>>> ak.Array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ ak.from_numpy(np.c_[[4, 5, 6]])
...
ValueError: in ListOffsetArray64 attempting to get 0, offsets[i] != offsets[i + 1] and offsets[i + 1] > len(content)Date: 2021-04-22 14:40:27 From: Angus Hollands (@agoose77:matrix.org)
That's the traceback that I see too.
Date: 2021-04-22 14:40:30 From: Jim Pivarski (@jpivarski)
This matrix-multiplication is the only operation implemented in Numba (to get it done quickly; I wasn't sure if it would ever be used). The idea is that you can make big arrays of matrices and vectors and multiply them all, even if they have different dimensions (the dimensions of each matrix have to match the dimensions of each vector).
Date: 2021-04-22 14:40:59 From: Angus Hollands (@agoose77:matrix.org)
I recall reading that, it's a cool idea to get-things-done-fast(er) :)
Date: 2021-04-22 14:41:19 From: Angus Hollands (@agoose77:matrix.org)
I'm not currently using matmul; I was trying to implement a broadcasting dot product which I've managed now :)
Date: 2021-04-22 14:41:47 From: Angus Hollands (@agoose77:matrix.org)
That doesn't solve the fact that there is a bug, but I'm not quite ready to look at solving it yet :/
Date: 2021-04-22 14:42:55 From: Jim Pivarski (@jpivarski)
Actually, the exception is only raised when trying to print it out. The multiplication happens without raising any errors, but it makes an invalid result.
>>> tmp = ak.Array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ ak.from_numpy(np.c_[[4, 5, 6]])
>>> # no errors yetDate: 2021-04-22 14:43:26 From: Angus Hollands (@agoose77:matrix.org)
Sure, it only has a length of 1 in the underlying buffer it seems
Date: 2021-04-22 14:43:26 From: Angus Hollands (@agoose77:matrix.org)
but requests i>0 offsets
Date: 2021-04-22 14:45:33 From: Jim Pivarski (@jpivarski)
Using the variable-length example, which is correct (because it agrees with NumPy, and it looks like the right result when done by hand), the layout ought to be
>>> (ak.Array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ ak.Array([[4], [5], [6]])).layout
<ListOffsetArray64>
<offsets><Index64 i="[0 1 2 3]" offset="0" length="4" at="0x55ed36a43620"/></offsets>
<content><NumpyArray format="l" shape="3" data="5 4 6" at="0x55ed3688d1e0"/></content>
</ListOffsetArray64>but it is
>>> (ak.Array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ ak.from_numpy(np.c_[[4, 5, 6]])).layout
<ListOffsetArray64>
<offsets><Index64 i="[0 3 6 9]" offset="0" length="4" at="0x55ed36855e10"/></offsets>
<content><NumpyArray format="l" shape="1" data="15" at="0x55ed36a5c9d0"/></content>
</ListOffsetArray64>So both the offsets and the content are wrong.
Date: 2021-04-22 14:47:03 From: Angus Hollands (@agoose77:matrix.org)
Hmm, quite. I would assume it's in the preparation code that determines the results shapes based on that
Date: 2021-04-22 14:49:41 From: Jim Pivarski (@jpivarski)
The fixed-length arrays aren't even entering the matmul function. Somehow, it's not being recognized as matrix multiplication, and if any other code path takes over, it's almost certainly going to be wrong.
Date: 2021-04-22 15:10:27 From: Jim Pivarski (@jpivarski)
This does it:
% git diff
diff --git a/src/awkward/_connect/_numpy.py b/src/awkward/_connect/_numpy.py
index 1b19ac4..e3f993c 100644
--- a/src/awkward/_connect/_numpy.py
+++ b/src/awkward/_connect/_numpy.py
@@ -290,6 +290,11 @@ matmul_for_numba.numbafied = None
def getfunction_matmul(inputs):
+ inputs = [
+ ak._util.recursively_apply(
+ x, (lambda _: _), pass_depth=False, numpy_to_regular=True
+ )
+ for x in inputs
+ ]
+
if len(inputs) == 2 and all(
isinstance(x, ak._util.listtypes)
and isinstance(x.content, ak._util.listtypes)
diff --git a/src/awkward/_util.py b/src/awkward/_util.py
index 3d7fc34..390264e 100644
--- a/src/awkward/_util.py
+++ b/src/awkward/_util.py
@@ -563,7 +563,10 @@ def broadcast_and_apply( # noqa: C901
):
return False
elif isinstance(x, ak.layout.RegularArray):
- my_offsets = nplike.arange(0, len(x.content), x.size)
+ if x.size == 0:
+ my_offsets = nplike.empty(0, dtype=np.int64)
+ else:
+ my_offsets = nplike.arange(0, len(x.content), x.size)
if offsets is None:
offsets = my_offsets
elif not nplike.array_equal(offsets, my_offsets):
%
%
% python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward as ak
>>> import numpy as np
>>> ak.Array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ ak.Array([[4], [5], [6]])
<Array [[5], [4], [6]] type='3 * var * int64'>
>>> ak.Array([[0, 1, 0], [1, 0, 0], [0, 0, 1]]) @ ak.from_numpy(np.c_[[4, 5, 6]])
<Array [[5], [4], [6]] type='3 * var * int64'>
Have you started a PR? Should I try to put these two bug-fixes into it? I haven't looked closely at your tests.
Date: 2021-04-22 15:13:00 From: Angus Hollands (@agoose77:matrix.org)
I have started a PR which is the one linked earlier - currently just two very dumb tests
Date: 2021-04-22 15:13:22 From: Angus Hollands (@agoose77:matrix.org)
They don't test much variation in input; just the case that one of the arrays is a numpy array
Date: 2021-04-22 15:13:28 From: Angus Hollands (@agoose77:matrix.org)
(well, an awkward.from_numpy equivalent)
Date: 2021-04-22 15:13:51 From: Angus Hollands (@agoose77:matrix.org)
I'm happy to push this as a fix to the branch to save you forking
Date: 2021-04-22 15:14:13 From: Jim Pivarski (@jpivarski)
I'll put what I have into your PR. One moment...
Date: 2021-04-22 15:14:40 From: Angus Hollands (@agoose77:matrix.org)
fab!
Date: 2021-04-22 15:26:20 From: Angus Hollands (@agoose77:matrix.org)
off-topic, I'm also trying to do something finicky in Awkward and getting a little stuck:
I have some indices in an array (j) which has the type 10000 * ?int64. I want to index into the second axis of another array direction which hast type 10000 * var * 3 * float64. I seem to be unable to use directions[range(len(directions)), j] because j contains Nones. Any ideas?
Date: 2021-04-22 15:26:42 From: Angus Hollands (@agoose77:matrix.org)
High level, this is just a jagged lookup table.
Date: 2021-04-22 15:27:10 From: Jim Pivarski (@jpivarski)
I had to change a test, so be sure to check it.
Date: 2021-04-22 15:28:14 From: Angus Hollands (@agoose77:matrix.org)
Ugh, my mistake. Thanks for catching that.
Date: 2021-04-22 15:29:04 From: Jim Pivarski (@jpivarski)
Integer arrays with Nones are allowed as indexes. Oh, not in a NumPy-like each-tuple-item-is-a-dimension mode, but as Awkward nested indexes (because allowing Nones in an index array is an Awkward thing).
Date: 2021-04-22 15:29:44 From: Angus Hollands (@agoose77:matrix.org)
Yes, that's where I am at
Date: 2021-04-22 15:29:59 From: Jim Pivarski (@jpivarski)
Here's the question: do you want the Nones in the integer array to turn into Nones in the output, or do you want them to be ignored?
Date: 2021-04-22 15:31:39 From: Angus Hollands (@agoose77:matrix.org)
I was typing a long reply but I decided it was moot 🙂 In this particular case, the Nones correspond to entries in the indexed array that are empty (size 0)
Date: 2021-04-22 15:31:45 From: Angus Hollands (@agoose77:matrix.org)
I'd want them to be None
Date: 2021-04-22 15:33:03 From: Angus Hollands (@agoose77:matrix.org)
Relatedly, is ak.where(mask) supposed to support non-regular array convertible arrays? I am using ak.firsts(ak.local_index(mask)[mask]) as a work-around.
Date: 2021-04-22 15:34:03 From: Jim Pivarski (@jpivarski)
Okay, then you need to turn range(len(directions)), j into a single nested array by repeating the j's.
Date: 2021-04-22 15:35:47 From: Jim Pivarski (@jpivarski)
>>> j = ak.Array([5, None, 2])
>>> j[np.newaxis][np.zeros(3, np.int64)]
<Array [[5, None, 2], ... 2], [5, None, 2]] type='3 * 3 * ?int64'>Date: 2021-04-22 15:38:04 From: Jim Pivarski (@jpivarski)
Apparently, it does:
>>> import awkward as ak
>>> ak.where(ak.Array([[True, False], [], [False, True]]), ak.Array([[1, 2], [], [3, 4]]), ak.Array([[1.1, 2.2], [], [3.3, 4.4]]))
<Array [[1, 2.2], [], [3.3, 4]] type='3 * var * float64'>Date: 2021-04-22 15:38:45 From: Angus Hollands (@agoose77:matrix.org)
what happens for non "regular" arrays?
ak.where(
ak.Array(
[
[False, False, True],
[False],
[False, True],
[False, False, True],
[False, False, True],
[False],
[False, True, False],
[False, False, True],
[False, False, True],
[False],
]
)
)Date: 2021-04-22 15:41:32 From: Jim Pivarski (@jpivarski)
My example was not regular, but it was the 3-argument form of ak.where. It could be that the one-argument form has not been generalized in this way. (Like np.where, 3-arguments and 1-argument do completely different things and should be viewed as distinct functions.)
Date: 2021-04-22 15:42:37 From: Jim Pivarski (@jpivarski)
Actually, considering what the one-argument form is supposed to do, i.e. be the same as np.nonzero, it's unclear how you can return a tuple of indexes for each dimension if they're irregular. Let me think...
Date: 2021-04-22 15:44:33 From: Jim Pivarski (@jpivarski)
I guess the output of this is supposed to be
(ak.Array([0, 2, 3, 4, 6, 7, 8]), ak.Array([2, 1, 2, 2, 1, 2, 2]))so it may be possible in general.
Date: 2021-04-22 15:47:39 From: Jim Pivarski (@jpivarski)
Yes, it would be possible in general. It would be equivalent to (but faster than) padding to the longest list at each level:
>>> np.where(ak.to_numpy(ak.fill_none(ak.pad_none(bools, 3), False)))
(array([0, 2, 3, 4, 6, 7, 8]), array([2, 1, 2, 2, 1, 2, 2]))Date: 2021-04-22 15:48:29 From: Angus Hollands (@agoose77:matrix.org)
ah, using pad_none was obvious in hindsight 😭
Date: 2021-04-22 15:49:04 From: Jim Pivarski (@jpivarski)
But actually implementing this so that pad_none isn't necessary should be a feature request.
Date: 2021-04-22 15:49:04 From: Angus Hollands (@agoose77:matrix.org)
Sorry, I didn't read the code entirely and missed that you were using the three arg form.
Date: 2021-04-22 15:54:30 From: Angus Hollands (@agoose77:matrix.org)
I'm not entirely sure what I would do with this , could you possibly elaborate?
Date: 2021-04-22 16:00:22 From: Jim Pivarski (@jpivarski)
I guess I don't know what you're trying to slice.
Date: 2021-04-22 16:06:39 From: Jim Pivarski (@jpivarski)
If you originally had
>>> want_to_slice = ak.Array([[1.1, 2.2], [3.3, 4.4, 5.5], [6.6, 7.7], [8.8, 9.9], [10, 11, 12, 13]])
>>> no_nones = ak.Array([1, 0, 0, 1, 1])
>>> want_to_slice[np.arange(len(want_to_slice)), no_nones]
<Array [2.2, 3.3, 6.6, 9.9, 11] type='5 * float64'>but now have None values in the array you're using to slice,
>>> yes_nones = ak.Array([1, None, 0, None, 1])then just attempting it gives you an error message saying that you can't mix the irregular-style and one-array-per-dimension style of slicing (because one-array-per-dimension assumes that they form a rectangle).
ValueError: cannot mix missing values in slice with NumPy-style advanced indexing
But you can achieve the same thing by turning your integers with Nones into a nested, variable length list:
>>> want_to_slice[ak.from_regular(yes_nones[:, np.newaxis])]
<Array [[2.2], [None], [6.6], [None], [11]] type='5 * var * ?float64'>That's admittedly different from my first suggestion.
Date: 2021-04-22 16:07:00 From: Jim Pivarski (@jpivarski)
(Okay, now I've got to switch gears to Uproot. I hope this helps; bye!)
Date: 2021-04-22 16:12:10 From: Angus Hollands (@agoose77:matrix.org)
Thanks a bunch Jim, appreciate it.
Date: 2021-04-22 16:15:48 From: Angus Hollands (@agoose77:matrix.org)
That's clever - the from_regular drops the var broadcasting!
Date: 2021-04-22 16:39:12 From: Angus Hollands (@agoose77:matrix.org)
I've never seen the fact that NumPy actually broadcasts the advanced indices before, cool!
Date: 2021-04-22 16:43:14 From: Angus Hollands (@agoose77:matrix.org)
So, is this the "extension to NumPy indexing" mentioned for awkward0?
Date: 2021-04-22 16:46:57 From: Jim Pivarski (@jpivarski)
Slicing is described in detail here: https://awkward-array.readthedocs.io/en/latest/_auto/ak.Array.html#ak-array-getitem
The fact that Awkward 0 had a lot of corner cases that we couldn't get right was a major motivation for Awkward 1, which is a lot more complete. There were some slides at PyHEP 2019 about that.
Date: 2021-04-22 16:48:48 From: Angus Hollands (@agoose77:matrix.org)
Oh yikes, that looks like an important doc not to miss
Date: 2021-04-22 17:19:39 From: Angus Hollands (@agoose77:matrix.org)
For posterity if this chat log turns up, here's a brief demonstration of the integer indexing modes: https://nbviewer.jupyter.org/gist/agoose77/362d0b87a08c74cd3661223d171cd25e
The important note with the ak.from_regular case is that regular Awkward arrays behave as NumPy predicts; Nd arrays just describe the shape of the output, and index against the zeroth axis. When the array is irregular, indicated by var, the final case documented here applies; multi-dimensional indexing similar to take_along_axis applies.
Date: 2021-04-23 07:55:06 From: sterbini (@sterbini)
Hi, I started to explore the awkward package with the aim to store settings of accelerators at CERN as parquet files. Thanks to all the community for this nice tool! I have some questions, in particular I would like to know if I am using the tool correctly.
Below you can find an example of the python dictionary I would like to cast as awkward array.
Date: 2021-04-23 07:55:26 From: sterbini (@sterbini)
my_dict = {'device1': {'value': {'property1':np.arange(3, dtype=np.uint8),
'property2':np.arange(5, dtype=np.uint16)},
'header': {'acqStamp':1622969203.245,'cycleStamp':1622969203.200},
'exception': ''},
'device2': {'value': np.random.rand(1),
'header': {'acqStamp':1622969203.211,'cycleStamp':1622969203.200},
'exception': ''},
'device3': {'value': np.random.rand(1),
'header': {'acqStamp':1622969203.212,'cycleStamp':1622969203.200},
'exception': ''}} Date: 2021-04-23 07:58:30 From: sterbini (@sterbini)
I can do it by
my_ak_array=ak.Array([my_dict])Let assume now that I want to extract for all devices the field header->acqStamp (indeed this is present in all devices, e.g. device1,...,device3). At the moment I do it with
[my_ak_array[ii].header.acqStamp for ii in ak.type(my_ak_array).keys()]The previous command seems to me a bit verbose. Is that a simpler way to achieve the same result? Thanks!
Date: 2021-04-23 08:28:20 From: sterbini (@sterbini)
The second question is related to the type preservation. To my understanding, the initial typing of the np.array (e.g., np.uint8) is lost when I cast the dictionary to ak.Array. So when I try to convert back the object to numpy the round-trip does not give me the original type. See the example below.
# origianal dictionary
my_dict['device1']['value']['property1'].dtype # it gives me: dtype('uint8')gives me a dtype('uint8'), but when
# I cast
my_ak_array=ak.Array([my_dict])
# I convert back
ak.to_numpy(my_ak_array['device1']['value']['property1']).dtype # it gives me: dtype('int64')givesme a dtype('int64'). Is there a way to mantain the numpy type in the round-trip?. Thanks!
Date: 2021-04-23 08:45:27 From: sterbini (@sterbini)
The round-trip consistency is very important for us when we want to set back a configuration in the accelerator.
Date: 2021-04-23 13:25:01 From: Jim Pivarski (@jpivarski)
I'll give more detailed answers when I get to a computer, but the reason you're losing types is because the arrays are not being ingested as NumPy arrays, but are being iterated over in Python (and Python integers don't have fixed-width types).If the top-level object passed to the ak.Array constructor
Date: 2021-04-23 13:26:48 From: Jim Pivarski (@jpivarski)
is a NumPy array, it will be taken as-is, without iteration, which scales better to large datasets and preserves types. If the top-level object is a Python type, like dict or list, then it proceeds to iterate through everything.
Date: 2021-04-23 13:28:08 From: Jim Pivarski (@jpivarski)
You can construct your nested data in stages by wrapping the NumPy arrays as Awkward Arrays, then gluing those arrays together with ak.zip.
Date: 2021-04-23 13:28:49 From: Jim Pivarski (@jpivarski)
https://awkward-array.readthedocs.io/en/latest/_auto/ak.zip.html
Date: 2021-04-23 13:30:29 From: Jim Pivarski (@jpivarski)
Not the depth_limit parameter, which lets you choose between making structs of arrays that look like arrays of structs or structs of arrays that look like structs of arrays. (Regardless of the interface structure, the internal structure is always columnar.)
Date: 2021-04-23 13:32:47 From: sterbini (@sterbini)
No worry Jim, I can wait that you are in front to a pc. Thanks already!
Date: 2021-04-23 13:34:26 From: Jim Pivarski (@jpivarski)
my_ak_array[ak.fields(my_ak_array), "header", "acqStamp"]
Date: 2021-04-23 13:34:59 From: Jim Pivarski (@jpivarski)
https://awkward-array.readthedocs.io/en/latest/_auto/ak.Array.html#nested-projection
Date: 2021-04-23 13:35:38 From: Jim Pivarski (@jpivarski)
I just wanted to answer while still thinking about it.
Date: 2021-04-23 14:39:24 From: Jim Pivarski (@jpivarski)
Looking at my_dict, I can't tell what is expected to scale to large datasets and what would remain small. Like, if you have arbitrarily many "devices," then perhaps these should be identified by index position, rather than string name lookup. The main assumption in Awkward Array is that something scales to billions of "rows" (identically typed, unnamed, identified by index position) but at most thousands of "columns" (named record fields, identified by string name, can have different types). I can't tell from the names in my_dict which of your data are the "billions" and which are the "dozens/hundreds/maybe thousands."
Date: 2021-04-23 14:44:03 From: Jim Pivarski (@jpivarski)
Well, in your example, "device1" has a different type than "device2" and "device3", so I guess they have to be fields.
Date: 2021-04-23 15:01:33 From: Jim Pivarski (@jpivarski)
So here's a way to build up your structure (assuming that you still want this structure, now that you're working with columnar data instead of Python dicts):
>>> dataset = ak.Array([{"device1": {}, "device2": {}, "device3": {}}])
>>> dataset["device1", "header"] = ak.Array([{"acqStamp": 1622969203.245, "cycleStamp": 1622969203.200}])
>>> dataset.type
1 * {"device2": {}, "device3": {}, "device1": {"header": {"acqStamp": float64, "cycleStamp": float64}}}
>>> dataset.tolist()
[{'device2': {}, 'device3': {}, 'device1': {'header': {'acqStamp': 1622969203.245, 'cycleStamp': 1622969203.2}}}]
>>> dataset["device1", "value"] = ak.Array({
... "property1": ak.Array(np.arange(3, dtype=np.uint8)[np.newaxis]),
... "property2": ak.Array(np.arange(5, dtype=np.uint16)[np.newaxis]),
... })
>>> # Note that the NumPy types are preserved because we ingested them as NumPy arrays, instead of iterating over them.
>>> dataset.type
1 * {"device2": {}, "device3": {}, "device1": {"header": {"acqStamp": float64, "cycleStamp": float64}, "value": {"property1": 3 * uint8, "property2": 5 * uint16}}}
>>> dataset.tolist()
[{'device2': {}, 'device3': {}, 'device1': {'header': {'acqStamp': 1622969203.245, 'cycleStamp': 1622969203.2}, 'value': {'property1': [0, 1, 2], 'property2': [0, 1, 2, 3, 4]}}}]and similarly for "device2", "device3", etc. Be sure to use strings in square brackets so that you attach arrays to the one dataset object, rather than creating temporaries and attaching to those (same problem as NumPy and Pandas).
Also, your "property1" and "property2" are viewed as a struct of arrays in the original structure, so to preserve that, I used np.newaxis to make them length-1 objects. In Awkward Array, everything is a struct of arrays internally, even if they're viewed and operated upon as though they were an array of structs, so you might have chosen a data structure that is no longer necessary.
Date: 2021-04-23 15:06:47 From: Jim Pivarski (@jpivarski)
>>> ak.zip({"property1": np.arange(10, dtype=np.uint8), "property2": np.arange(10, dtype=np.uint16)}).tolist()
[{'property1': 0, 'property2': 0}, {'property1': 1, 'property2': 1}, {'property1': 2, 'property2': 2}, {'property1': 3, 'property2': 3}, {'property1': 4, 'property2': 4}, {'property1': 5, 'property2': 5}, {'property1': 6, 'property2': 6}, {'property1': 7, 'property2': 7}, {'property1': 8, 'property2': 8}, {'property1': 9, 'property2': 9}]
>>> ak.zip({"property1": np.arange(10, dtype=np.uint8)[np.newaxis], "property2": np.arange(10, dtype=np.uint16)[np.newaxis]}, depth_limit=1).tolist()
[{'property1': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 'property2': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]The first looks like a bunch of {property1: #, property2: #} records and the latter looks like two arrays, but they have the same performance characteristics. (The first of these does not have a bunch of "property1" strings or individually allocated objects inside of it.)
This means that you're not discouraged from changing the structure of the data a lot—projecting out columns and gathering them together again as records—so you might pick a different data structure, knowing that.
Date: 2021-04-23 16:01:46 From: Angus Hollands (@agoose77:matrix.org)
^ this was something I took a while to get my head around, thinking about records as distinct from the memory that backs them.
Date: 2021-04-23 16:08:11 From: Angus Hollands (@agoose77:matrix.org)
I have some positions in a record array, with fields (N * var) x, y, z, and some direction vectors in an 2D Jagged Awkward Array (N * var * 3). If I want to elegantly take the dot product of these, am I best attaching a record structure to the normals (by pulling out the columns into a record), and then leveraging vector or something?
What I'm actually doing is taking a subset of points for all events which have the right label (involved_points = points.mask[label == i]), and pulling the corresponding normal vector in each event by first padding them (padded_normals = ak.pad_none(normals, n_labels_max, axis=1)) and then accessing lama with n_label = n[:, i]
Date: 2021-04-23 16:08:23 From: Angus Hollands (@agoose77:matrix.org)
Right now I'm having to pad_none and manually implement the dot.
Date: 2021-04-23 16:10:34 From: Jim Pivarski (@jpivarski)
Vector's view of the world is that components are fields of a record:
>>> import awkward as ak
>>> import vector
>>> vector.register_awkward()
>>> array = ak.Array([
... [{"px": 1.1, "py": 1, "pz": 3, "E": 5}, {"px": 1.1, "py": 1, "pz": 3, "E": 5},{"px": 1.1, "py": 1, "pz": 3, "E": 5}],
... [],
... [{"px": 1.1, "py": 1, "pz": 3, "E": 5}, {"px": 1.1, "py": 1, "pz": 3, "E": 5}]],
... with_name = "Momentum4D")
>>> array
<MomentumArray4D [[{px: 1.1, py: 1, pz: 3, ... E: 5}]] type='3 * var * Momentum4...'>
>>> array.dot(array)
<Array [[13.8, 13.8, 13.8], ... [13.8, 13.8]] type='3 * var * float64'>Date: 2021-04-23 16:11:18 From: sterbini (@sterbini)
Thank you very much! It is very clear. I will test it immediately
Date: 2021-04-23 16:11:30 From: Jim Pivarski (@jpivarski)
But in another thread, I'm told there's something wrong with the latest release of Vector that has been fixed in its main GitHub branch, so pip-install from GitHub: https://adamj.eu/tech/2019/03/11/pip-install-from-a-git-repository/
Date: 2021-04-23 16:12:37 From: Jim Pivarski (@jpivarski)
What you need for Vector to work is to have the Vector behaviors in the ak.behaviors dict (that's what register_awkward() does) and to have the record named "Momentum2D", "Momentum3D", or "Momentum4D".
Date: 2021-04-23 16:12:51 From: Jim Pivarski (@jpivarski)
With appropriate field names.
Date: 2021-04-23 16:13:16 From: Angus Hollands (@agoose77:matrix.org)
Thanks for the heads up. I think the question is maybe; in NumPy I'd probably have points and normals in an ND array and just use the broadcasting. here, I get the feeling I should instead leverage the record + behaviour model.
Date: 2021-04-23 16:13:44 From: Jim Pivarski (@jpivarski)
If the field names are "pt", "eta", "phi", "M"` or another combination, it will be interpreted in the correct coordinate system. (That's why it needs names, to determine the coordinate system.)
Date: 2021-04-23 16:14:09 From: Angus Hollands (@agoose77:matrix.org)
Right, so the positions already have the required fields, but the normals are just a jagged array of regular 2D arrays
Date: 2021-04-23 16:14:58 From: Angus Hollands (@agoose77:matrix.org)
So if using the behaviours model is best then I'll convert the normals to an array of records and then use vector.
Date: 2021-04-23 16:15:18 From: Jim Pivarski (@jpivarski)
There is a fundamental question of whether to make data rows or columns (same question as for @sterbini's data).
Date: 2021-04-23 16:16:04 From: Angus Hollands (@agoose77:matrix.org)
Yeah, I keep running into it more generally with NumPy. The performance aspect is less relevant with Awkward because everything is "just columns"
Date: 2021-04-23 16:16:22 From: Jim Pivarski (@jpivarski)
Columns allow arbitrary types, but are looked up by string name, so there shouldn't be more than ~thousands of them. Rows are unnamed, which projects like xarray attempt to address.
Date: 2021-04-23 16:16:42 From: Angus Hollands (@agoose77:matrix.org)
But behaviour-wise, indeed.
Date: 2021-04-23 16:17:41 From: Jim Pivarski (@jpivarski)
Also, the row vs column question implies a different memory layout: if the data are
1000000000 * 3 * float64
then all three coordinates are in the same buffer; if they're
1000000000 * {x: float64, y: float64, z: float64}
then they're in three separate buffers.
Date: 2021-04-23 16:18:18 From: Angus Hollands (@agoose77:matrix.org)
Yes, indeed; I follow that having records implies fields are stored separately
Date: 2021-04-23 16:18:46 From: Jim Pivarski (@jpivarski)
The closest semantic equivalent is NumPy's structured arrays, but those are all in the same buffer regardless. They're arrays of structs.
Date: 2021-04-23 16:18:56 From: Angus Hollands (@agoose77:matrix.org)
My gut feeling in this case is that I should just forget about microperformance concerns and just do what is elegant.
Date: 2021-04-23 16:19:13 From: Angus Hollands (@agoose77:matrix.org)
Yeah, I have been ruminating that one lately, because I like the notion of structured containers, but that enforces a particular layout
Date: 2021-04-23 16:19:30 From: Jim Pivarski (@jpivarski)
I agree. As long as the performance differences are micro! :)
Date: 2021-04-23 16:19:56 From: Angus Hollands (@agoose77:matrix.org)
It seems that you can't really maintain the abstraction that "this is a 2D chunk of data" with "this is a record with fields"
Date: 2021-04-23 16:20:33 From: Jim Pivarski (@jpivarski)
Oh! The reason I was saying that was to point out that Vector does NumPy arrays, too, if you find that useful. (NumPy structured arrays.) So if you have a reason to want the single-buffer approach, without jaggedness, that's available by choosing NumPy over Awkward in Vector.
Date: 2021-04-23 16:20:34 From: Angus Hollands (@agoose77:matrix.org)
At least, in terms of the fact that you have to define operations between rows (e.g. dot product)
Date: 2021-04-23 16:21:15 From: Angus Hollands (@agoose77:matrix.org)
Oh, yes, that makes sense. In this case I do have jagged data so it's all good :)
Date: 2021-04-23 16:21:57 From: Angus Hollands (@agoose77:matrix.org)
I think part of the problem for my mental model is that the more you think about minimising allocation and improving access patterns, the more you lose the expressiveness of array programming
Date: 2021-04-23 16:22:23 From: Jim Pivarski (@jpivarski)
Because NumPy's structured array is one buffer regardless, you can swap between representations without changing the memory layout:
>>> array = np.array([(1, 2, 3, 4), (1.1, 2.2, 3.3, 4.4)],
... dtype=[("px", float), ("py", float), ("pz", float), ("E", float)])
>>> array
array([(1. , 2. , 3. , 4. ), (1.1, 2.2, 3.3, 4.4)],
dtype=[('px', '<f8'), ('py', '<f8'), ('pz', '<f8'), ('E', '<f8')])
>>> array.view(float)
array([1. , 2. , 3. , 4. , 1.1, 2.2, 3.3, 4.4])
>>> array.view(float).reshape(-1, 4)
array([[1. , 2. , 3. , 4. ],
[1.1, 2.2, 3.3, 4.4]])But if you were to do the equivalent in Awkward Array, turning an inner dimension into named fields or vice-versa, you'd have to copy and/or concatenate.
Date: 2021-04-23 16:22:31 From: Angus Hollands (@agoose77:matrix.org)
Short of something that optimises the temporaries away and can choose the best memory layout, it's just not worth worrying about (too much, that is)
Date: 2021-04-23 16:22:47 From: Angus Hollands (@agoose77:matrix.org)
Yes, that's one useful feature.
Date: 2021-04-23 16:23:05 From: Angus Hollands (@agoose77:matrix.org)
The only thing you can't do is view fortran ordered data with a structured array
Date: 2021-04-23 16:23:39 From: Angus Hollands (@agoose77:matrix.org)
My memory is a bit fuzzy on this, but in short, you can't optimise for columnar acces
Date: 2021-04-23 16:23:59 From: Jim Pivarski (@jpivarski)
Yeah. The biggest performance difference is in the step from Python objects to NumPy/Awkward/C++/etc; the differences within the latter category only matter at the extremes.
Date: 2021-04-23 16:24:48 From: Angus Hollands (@agoose77:matrix.org)
I really appreciate the help you've given recently. I feel a little like I'm somewhere not-too-nice on the Dunning Kruger curve after starting with Awkward.
Date: 2021-04-23 16:25:37 From: Jim Pivarski (@jpivarski)
Right, Fortran ordering would prevent the interpretation as an array of structs. Since NumPy is requiring a particular memory layout, they can do certain reinterpretations without modifying the buffer, but others are impossible.
Date: 2021-04-23 16:25:38 From: Angus Hollands (@agoose77:matrix.org)
It's been really interesting changing the mental model to use Awkward. The more I use it I am also getting a sense that it's well built, so kudos to y'all
Date: 2021-04-23 16:25:52 From: Jim Pivarski (@jpivarski)
Thanks, and good luck!
Date: 2021-04-23 16:28:23 From: sterbini (@sterbini)
Thanks @Pivarski. Its working like a charm. I also test the write/read to parquet and the round-trip maintain the types.
Date: 2021-04-23 16:28:58 From: Jim Pivarski (@jpivarski)
I'm glad to hear it!
Date: 2021-04-23 16:34:27 From: sterbini (@sterbini)
Question: using the my_ak_array[ak.fields(my_ak_array), "header", "acqStamp"] is not fully equivalent to the [my_ak_array[ii].header.acqStamp for ii in ak.type(my_ak_array).keys()]. The second one is a list and can be plotted directly while I cannot plot directly the first one. Am I missing something?
Date: 2021-04-23 16:37:21 From: Jim Pivarski (@jpivarski)
I just recreated from my_dict and by assigning to dataset.
Date: 2021-04-23 16:38:15 From: Jim Pivarski (@jpivarski)
In general, you'll need to do some sort of projection to plot anything. It was a design choice to not have it automatically flatten everything down to numbers, because that could hide mistakes. You'll have to say exactly what you want to plot.
Date: 2021-04-23 16:38:57 From: Jim Pivarski (@jpivarski)
Plotters generally only recognize NumPy arrays, so it's also sometimes necessary to call ak.to_numpy on something to get a plotter to recognize it, and that's the function that complains if the data are not rectilinear.
Date: 2021-04-23 16:40:00 From: Jim Pivarski (@jpivarski)
So, if you're wanting to plot device1, value, property1, that's
>>> ak.to_numpy(my_ak_array.device1.value.property1)
array([[0, 1, 2]])
>>> ak.to_numpy(dataset.device1.value.property1)
array([[0, 1, 2]], dtype=uint8)
Date: 2021-04-23 16:41:44 From: Jim Pivarski (@jpivarski)
If you want to do all the devices, that's complicated by the fact that they're different types. See:
>>> my_ak_array[ak.fields(my_ak_array), "value"].tolist()
[{'device1': {'property1': [0, 1, 2], 'property2': [0, 1, 2, 3, 4]}, 'device2': [0.06441346929713354], 'device3': [0.7973273860848885]}]Date: 2021-04-23 16:42:23 From: Jim Pivarski (@jpivarski)
Device1 is split into property1 and property2, whereas device2 and device3 have flatter data types.
Date: 2021-04-23 16:42:57 From: sterbini (@sterbini)
Ok, in my case I can if I do plt.plot(my_array[my_array.fields,'header', 'acqStamp'].tolist()[0].values())
Date: 2021-04-23 16:42:58 From: Jim Pivarski (@jpivarski)
If you don't care about any of this and just want all the numbers at the leaf nodes of the tree structure, you can do ak.flatten with axis=None:
Date: 2021-04-23 16:43:16 From: sterbini (@sterbini)
is working
Date: 2021-04-23 16:43:20 From: Jim Pivarski (@jpivarski)
>>> ak.flatten(my_ak_array[ak.fields(my_ak_array), "value"], axis=None)
<Array [0, 1, 2, 0, 1, ... 3, 4, 0.0644, 0.797] type='10 * float64'>Date: 2021-04-23 16:44:16 From: Jim Pivarski (@jpivarski)
I've been using tolist in the examples here so that we can see what we're doing on small samples, but be aware that turning things back into Python objects is a performance cost if you're going to scale this up.
Date: 2021-04-23 16:45:04 From: sterbini (@sterbini)
Perfect!
Date: 2021-04-23 16:45:12 From: sterbini (@sterbini)
thanks again!
Date: 2021-04-23 16:46:34 From: Jim Pivarski (@jpivarski)
You can put the 0 (which selects the first item) inside the slice.
>>> my_ak_array[my_ak_array.fields, "header", "acqStamp", 0]
<Record ... device3: 1.62e+09} type='{"device1": float64, "device2": float64, "d...'>Date: 2021-04-23 16:47:00 From: sterbini (@sterbini)
interesting
Date: 2021-04-23 16:47:28 From: sterbini (@sterbini)
Last question if a try to use
ak.Array([{'a':datetime.datetime(2021,4,1)}])Date: 2021-04-23 16:47:48 From: Jim Pivarski (@jpivarski)
And since you're plotting this across columns (fields, which you used dict's values() for), you could unzip it to get the same effect:
>>> ak.unzip(my_ak_array[my_ak_array.fields, "header", "acqStamp", 0])
(1622969203.245, 1622969203.211, 1622969203.212)but this is already a hint that your devices might want to be rows, not columns.
Date: 2021-04-23 16:48:34 From: sterbini (@sterbini)
there is an error. I think is normal since only list+dict+str+tuple+numpy can be translated, is it correct?
Date: 2021-04-23 16:49:24 From: Jim Pivarski (@jpivarski)
If it's about supporting date-time types, @ianna is working on that right now: https://github.com/scikit-hep/awkward-1.0/pull/835
You might want to add a message on that thread to let her know you're a potential user, and to keep up with updates.
Date: 2021-04-23 16:50:05 From: sterbini (@sterbini)
Perfect (again!). Thanks a lot!
Date: 2021-04-23 16:51:26 From: sterbini (@sterbini)
Done!
Date: 2021-04-23 16:51:59 From: Jim Pivarski (@jpivarski)
NumPy has a date-time type, which @ianna is adding to Awkward, but the others are
- numbers, booleans
- lists of variable length
- lists of fixed length (a la NumPy)
- strings (are lists with special methods)
- dicts → records, so the field names and their types must match across an array
- tuples → records without field names, so the number of items and their types must match across an array
Good to hear that that worked for you.
Date: 2021-04-23 17:12:29 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: by "devices [as] rows", do you mean that the Record that contains them would contain only primitive fields, i.e. as an array of records?
Date: 2021-04-23 17:14:48 From: Jim Pivarski (@jpivarski)
It's the same question we were talking about: the choice between
>>> ak.Array([{"device1": {"x": 1}, "device2": {"x": 2}}])
<Array [... x: 1}, device2: {x: 2}}] type='1 * {"device1": {"x": int64}, "device...'>and
>>> ak.Array([{"x": 1}, {"x": 2}])
<Array [{x: 1}, {x: 2}] type='2 * {"x": int64}'>or maybe
>>> ak.Array([[{"x": 1}, {"x": 2}]])
<Array [[{x: 1}, {x: 2}]] type='1 * var * {"x": int64}'>Date: 2021-04-23 17:15:47 From: Jim Pivarski (@jpivarski)
It's the question of whether @sterbini wants to address these devices as individuals with individual names or as a collection, like events/jets/tracks/hits.
Date: 2021-04-23 17:16:52 From: Jim Pivarski (@jpivarski)
If they have different types (as in @sterbini's original example), then they really should be columns. But if not and most plots cut across all the devices, then probably not.
Date: 2021-04-23 17:17:13 From: agoose77 (@agoose77:matrix.org)
Ah, I see - where devices have the same structure just treat them as rows
Date: 2021-04-23 17:18:03 From: Jim Pivarski (@jpivarski)
Yeah. At once point, @nsmith- suggested that we should have functions to convert between the two representations: stack and unstack. Unfortunately, it's still an open issue. (They haven't been written.)
Date: 2021-04-24 15:39:44 From: Angus Hollands (@agoose77:matrix.org)
Is there any construct to split an existing axis by introducing a new dimension? Something like
ak.concatenate((
arr.position[arr.label == 0][:, np.newaxis, ],
arr.position[arr.label == 1][:, np.newaxis, ],
arr.position[arr.label == 2][:, np.newaxis, ],
), axis=1)
except, instead of manually partitioning like this, one would ak.argsort(arr.label) and find the loci of the different label transitions (e.g. sorted arr.label would be 0000112222233)?
Date: 2021-04-24 15:39:58 From: Angus Hollands (@agoose77:matrix.org)
It's a bit esoteric, and quite possible to do like this all considered
Date: 2021-04-25 14:32:47 From: Jim Pivarski (@jpivarski)
Isn't this actually just sorting on labels? Does sort/argsort not do it?
Date: 2021-04-25 16:41:57 From: Angus Hollands (@agoose77:matrix.org)
Wow this example was poor, on reflection! I think this is just a simple numpy list comp situation; I've just gotten used to there being useful functions in awkward, that I thought it worth checking.
Date: 2021-04-26 08:10:33 From: Angus Hollands (@agoose77:matrix.org)
Hmm, I'm still not sure if this is possible without a copy. I'm trying to insert an additional dimension into an array, but I'm not introducing any new data, I just want to reorder it - Instead of the 1st axis being flat with length A+B+C, I want to insert a new dimension with length 3, and the 2nd axis corresponding lengths A, B, and C.
Date: 2021-04-26 08:54:41 From: Angus Hollands (@agoose77:matrix.org)
I think what I need to do is compose a ListOffsetArray onto an IndexedArray64, where the latter sorts the array by label, and the former slices this ordered array. Because I want to act on an interior axis, I believe that I need to make use of something like recursively_apply
Date: 2021-04-26 12:28:49 From: Angus Hollands (@agoose77:matrix.org)
I've implemented something that works by first sorting the data, and then unflattening according to the number of unique elements:
@nb.njit
def split_runs_inner(arr):
last = arr[0]
counts = [0]
for i, x in enumerate(arr):
if x != last:
counts.append(1)
last = x
else:
counts[-1] += 1
return counts
@nb.njit
def split_runs(arr):
counts = []
for i, x in enumerate(arr):
counts.append(split_runs_inner(x))
return counts
with_new_dim = ak.unflatten(labels, split_runs(labels), axis=1)I wish I could drop the hard-coded axis=1 implementation in favour of something which could take an axis parameter. I don't need axis != 1, but I feel like this code isn't the most elegant way to do this.
Date: 2021-04-26 12:52:22 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: I've implemented something that works by first sorting the labels, and then unflattening (them, or the labelled data in my actual use-case) according to the number of each label:
@nb.njit
def split_runs_inner(arr):
last = arr[0]
counts = [0]
for i, x in enumerate(arr):
if x != last:
counts.append(1)
last = x
else:
counts[-1] += 1
return counts
@nb.njit
def split_runs(arr):
counts = []
for i, x in enumerate(arr):
counts.append(split_runs_inner(x))
return counts
with_new_dim = ak.unflatten(labels, split_runs(labels), axis=1)I wish I could drop the hard-coded axis=1 implementation in favour of something which could take an axis parameter. I don't need axis != 1, but I feel like this code isn't the most elegant way to do this.
Date: 2021-04-26 12:52:37 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: I've implemented something that works by first sorting the labels, and then unflattening (them, or the labelled data in my actual use-case) according to the number of each label:
@nb.njit
def split_runs_inner(arr):
last = arr[0]
counts = [0]
for i, x in enumerate(arr):
if x != last:
counts.append(1)
last = x
else:
counts[-1] += 1
return counts
@nb.njit
def split_runs(arr):
counts = []
for i, x in enumerate(arr):
counts.append(split_runs_inner(x))
return counts
with_new_dim = ak.unflatten(labels, split_runs(labels), axis=1)I wish I could drop the hard-coded axis=1 (outer-inner) implementation in favour of something which could take an axis parameter. I don't need axis != 1, but I feel like this code isn't the most elegant way to do this.
Date: 2021-04-26 12:54:11 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: I've implemented something that works by first sorting the labels, and then unflattening (them, or the labelled data in my actual use-case) according to the number of each label:
@nb.njit
def split_runs_inner(arr):
last = arr[0]
counts = [0]
for i, x in enumerate(arr):
if x != last:
counts.append(1)
last = x
else:
counts[-1] += 1
return counts
@nb.njit
def split_runs(arr):
counts = []
for i, x in enumerate(arr):
counts.append(split_runs_inner(x))
return counts
with_new_dim = ak.unflatten(labels, split_runs(labels), axis=1)I wish I could drop the hard-coded axis=1 (outer-inner) implementation in favour of something which could take an axis parameter. This could be done by preparing a list of operations i.e. [outer, inner] which are picked out by depth, but that would introduce N loops, and although they're in jitted code, I am certain that there must be a better way. I don't need axis != 1, but I feel like this code isn't the most elegant way to do this.
Date: 2021-04-26 12:54:24 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: I've implemented something that works by first sorting the labels, and then unflattening (them, or the labelled data in my actual use-case) according to the number of each label:
@nb.njit
def split_runs_inner(arr):
last = arr[0]
counts = [0]
for i, x in enumerate(arr):
if x != last:
counts.append(1)
last = x
else:
counts[-1] += 1
return counts
@nb.njit
def split_runs(arr):
counts = []
for i, x in enumerate(arr):
counts.append(split_runs_inner(x))
return counts
with_new_dim = ak.unflatten(labels, split_runs(labels), axis=1)I wish I could drop the hard-coded axis=1 (outer-inner) implementation in favour of something which could take an axis parameter. This could be done by preparing a list of operations i.e. ops=[outer, ..., inner] which are picked out by depth, but that would introduce N loops, and although they're in jitted code, I am certain that there must be a better way. I don't need axis != 1, but I feel like this code isn't the most elegant way to do this.
Date: 2021-04-26 13:49:50 From: Angus Hollands (@agoose77:matrix.org)
Haha, of course this feature already exists in ak.run_lengths 🤦 I've said it before, but I find it reassuring that the set of high level operations that Awkward provides manages to cover most of what you need to do. Although this does solve my problem, I do anticipate needing to understand how to deal with arbitrarily nested arrays with numba in future. Any tips on that would be much appreciated :)
Date: 2021-04-26 17:33:28 From: sterbini (@sterbini)
Dear Jim, I am back with a new problem. If I try this
import numpy as np
import awkward as ak
my_dict = {'device1': {'value': {'property1':np.arange(3, dtype=np.uint8),
'property2':np.arange(5, dtype=np.uint16)},
'header': {'acqStamp':1622969203.245,'cycleStamp':1622969203.200},
'exception': [1]},
'device2': {'value': np.arange(3, dtype=np.uint8),
'header': {'acqStamp':1622969203.211,'cycleStamp':1622969203.200},
'exception': [2]},
'device3': {'value': np.random.rand(1),
'header': {'acqStamp':1622969203.212,'cycleStamp':1622969203.200},
'exception': []}}
for device in my_dict.keys():
dataset=ak.Array([{device:{} for device in my_dict.keys()}])
for device in my_dict.keys():
#print(device)
dataset_device=ak.Array([{field:{} for field in my_dict[device].keys()}])
for field in my_dict[device].keys():
if type(my_dict[device][field])==dict:
#print(my_dict[device][field])
dataset_field=ak.Array([{device_property:{} for device_property in my_dict[device][field].keys()}])
for device_property in my_dict[device][field].keys():
#print(device_property)
try:
dataset_field[device_property]=my_dict[device][field][device_property][np.newaxis]
except:
dataset_field[device_property]=my_dict[device][field][device_property]
else:
#print(my_dict[device][field])
try:
dataset_field=my_dict[device][field][np.newaxis]
except:
dataset_field=my_dict[device][field]
dataset_device[field]=dataset_field
dataset[device]=dataset_device
datasetit fails. From what I understand it seems related to the fact that Device3-> Exception is an empty list. I think I am missing something...
Date: 2021-04-26 17:34:59 From: agoose77 (@agoose77:matrix.org)
@sterbini: what are you trying to do with this example?
Date: 2021-04-26 17:37:52 From: sterbini (@sterbini)
@agoose77:matrix.org Hi, I would like to cast the my_dict in a awkward array and preserving the types of all the elements in the dictionary. The idea then is to store it in a parquet file.
Date: 2021-04-26 17:38:21 From: Angus Hollands (@agoose77:matrix.org)
How do you load my_dict originally? uproot?
Date: 2021-04-26 17:40:25 From: Angus Hollands (@agoose77:matrix.org)
Is this all of your data, or are there other fields that you've omitted for simplicity?
Date: 2021-04-26 17:40:28 From: Angus Hollands (@agoose77:matrix.org)
@sterbini:
Date: 2021-04-26 17:53:19 From: sterbini (@sterbini)
- This is a simplified example of the data. The structure is very similar, one would have typically ~20 devices, each device has tree field (value, header and exception). Inside value there can be a lot of different data.
- This is not obtained with
uprootbut via a transformation in python dictionaries of java objets (namely, from the control system of our accelerator).
Date: 2021-04-26 17:54:14 From: Angus Hollands (@agoose77:matrix.org)
how many fields will your device parameters have? at the moment they are flat (single rows)
Date: 2021-04-26 17:59:19 From: Angus Hollands (@agoose77:matrix.org)
From the examples you've given so far, it doesn't necessarily feel like you'd need / want to use awkward over something like JSON / pickle.
Date: 2021-04-26 18:01:01 From: sterbini (@sterbini)
Each device->value can have up ~50 field, the field can also be a 2D array or a list of 2D array.
Date: 2021-04-26 18:04:53 From: sterbini (@sterbini)
We thought that awkward is good for our use case because we can have a very large number of this single acquisition. In fact each acquisition is a snapshot of the machine but we can have also 1e4-1e5 of those acquisitions . So parquet+ awkward seemed a very powerful tool for us.
Date: 2021-04-26 18:06:20 From: sterbini (@sterbini)
In addition awkward maintain the data types: we tried with parquet+pandas and we did not manage to maintain the data types.
Date: 2021-04-26 18:21:36 From: Angus Hollands (@agoose77:matrix.org)
I see
Date: 2021-04-26 20:38:37 From: Jim Pivarski (@jpivarski)
@sterbini, your code is hard to read in a thread, so I'll answer here. I see that you're trying to replicate the structure of the dict in Awkward Array. Perhaps instead of the assignment method, you could use ak.ArrayBuilder like this:
builder = ak.ArrayBuilder()
with builder.record():
for device_name, device_value in my_dict.items():
builder.field(device_name)
with builder.record():
builder.field("value")
value = device_value["value"]
if isinstance(value, np.ndarray):
builder.append(ak.from_numpy(value))
elif isinstance(value, dict):
with builder.record():
for property_name, property_value in value.items():
if isinstance(property_value, np.ndarray):
builder.field(property_name)
builder.append(ak.from_numpy(property_value))
else:
raise NotImplementedError
else:
raise NotImplementedError
builder.field("header")
header = device_value["header"]
with builder.record():
builder.field("acqStamp")
builder.real(header["acqStamp"])
builder.field("cycleStamp")
builder.real(header["cycleStamp"])
builder.field("exception")
with builder.list():
for x in device_value["exception"]:
builder.integer(x)Now when you do
array = builder.snapshot()you get an array with the same structure as my_dict. The advantage of this is that the code that walks through the record (~20 devices) can be put in a loop over 1e4-1e5 such things. The above code makes an array with type
1 * {"device1": {"value": {"property1": var * uint8, "property2": var * uint16}, "header": {"acqStamp": float64, "cycleStamp": float64}, "exception": var * int64}, "device2": {"value": var * uint8, "header": {"acqStamp": float64, "cycleStamp": float64}, "exception": var * int64}, "device3": {"value": var * float64, "header": {"acqStamp": float64, "cycleStamp": float64}, "exception": var * unknown}}
but if it were in a loop, the 1 * could become 100000 *.
HOWEVER, there is a serious performance issue to consider. The builder.append statements insert parts of an existing Awkward Array into an array that's being built, but if you create new arrays for each of the 1e4-1e5 instances, then they can't share a buffer. They won't be columnar. Actually, I think the above code might raise an error after 255 records because it has to treat data in different buffers as being an inhomogeneous type ("union array").
So if you do put the above into a loop over many instances, start by gathering all of the property1/property2/value data into arrays that cut across the large dimension, the 1e4-1e5. For instance, if you know there's going to be 100000 of the above, define
device1_value_property1 = ak.from_numpy(np.empty((3 * 100000, 3), dtype=np.uint8))
device1_value_property2 = ak.from_numpy(np.empty((5 * 100000, 5), dtype=np.uint16))
device2_value = ak.from_numpy(np.empty((3 * 100000, 3), dtype=np.uint8))
device3_value = ak.from_numpy(np.empty((1 * 100000, 1), dtype=np.float64))You can fill these because their underlying NumPy arrays are mutable, as described here. Then when you're appending parts of these arrays into the ArrayBuilder, you can use the at argument of ArrayBuilder.append to link items from these single arrays into the array that's being built, instead of constructing new structures in a non-columnar way.
Date: 2021-04-26 20:54:28 From: Jim Pivarski (@jpivarski)
Actually, this is exactly what the ArrayBuilder is doing for all the other fields: the acqStamp values are all members of the same buffer in memory, the cycleStamp are all in their own buffer, and the exception numbers are all in their own buffer. When used in a loop, ArrayBuilder recognizes when the structure of the nth record is the same as all preceding records and re-uses the same buffer. The reason you don't have to preallocate arrays to use it is because it prospectively allocates them for you. (ak.from_iter, which gets called when you pass Python lists and dicts to the ak.Array constructor, calls ArrayBuilder. It's all the same algorithm.)
The reason you'd have to do that manually for the deviceN_value_propertyMs is because ArrayBuilder.append links an existing array into the new one that is being built, and linking in 1e4-1e5 distinct arrays would have terrible performance (if the rule against union arrays larger than 255 doesn't kick in first). The up-front allocation is to ensure that all of these thousands of items share a buffer, which in turn was because you're saying, "Use this NumPy buffer instead of creating your own," which in turn was because you didn't want it to allocate buffers with generic integer types: np.int64 instead of np.uint8 or np.uint16.
Maybe there's a different way to satisfy the last condition: you could do this entirely with a loop (no explicit preallocation on your part) if you used TypedArrayBuilder, rather than ArrayBuilder, which lets you specify the types before filling it with data. The only problem with that is that TypedArrayBuilder isn't complete: it has been implemented (https://github.com/scikit-hep/awkward-1.0/pull/769) but it doesn't have a high-level interface yet.
But if being a first user of TypedArrayBuilder (so that you can specify that the properties have np.uint8/np.uint16/etc.) sounds like a better option to you than manually ensuring that the properties are columnar, you can get in touch with @ianna in a GitHub Discussion (https://github.com/scikit-hep/awkward-1.0/discussions).
Date: 2021-04-27 10:35:10 From: sterbini (@sterbini)
@jpivarski I tested your implementation (using the ak.ArrayBuilder) and it is way faster than mine (x500). I think it is very adequate for our use case. Thanks again for your support and your prompt reply!
Date: 2021-04-27 12:10:05 From: Jim Pivarski (@jpivarski)
@sterbini Okay, I'm glad that works. But if you get performance issues when you fill many of these (or hit a wall at 255), it's probably the issue above and we can look into fixing that. TypedArrayBuilder will have mostly the same interface as ArrayBuilder, so starting with ArrayBuilder would provide a good upgrade path, if need be.
Date: 2021-04-28 13:50:47 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is there a clean way to "drop" a mask on an Array? I noticed that the option type is distinct from a Union[..., NoneType] content type, and so fill_none doesn't touch it.
Date: 2021-04-28 16:47:33 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org I'm not sure what you mean because Union[..., NoneType] is a mypy/Python type. Are you meaning the equivalent Awkward type, optional[union[...]]?
Date: 2021-04-28 16:48:28 From: Angus Hollands (@agoose77:matrix.org)
Yes, quite right!
Date: 2021-04-28 16:48:59 From: Jim Pivarski (@jpivarski)
>>> a = ak.Array([1, 2, 3, None, [4, 5, 6]])
>>> a
<Array [1, 2, 3, None, [4, 5, 6]] type='5 * ?union[int64, var * int64]'>
>>> ak.fill_none(a, -1)
<Array [1, 2, 3, -1, [4, 5, 6]] type='5 * union[int64, var * int64]'>Date: 2021-04-28 16:49:36 From: Jim Pivarski (@jpivarski)
ak.fill_none eliminates the optional (represented in Datashape notation as a ? before the union).
Date: 2021-04-28 16:50:13 From: Angus Hollands (@agoose77:matrix.org)
Hmm, I wasn't observing that
Date: 2021-04-28 16:50:30 From: Jim Pivarski (@jpivarski)
Or if this question doesn't have anything do to with union-types (because that part is orthogonal),
>>> a = ak.Array([1, 2, 3, None])
>>> a
<Array [1, 2, 3, None] type='4 * ?int64'>
>>> ak.fill_none(a, -1)
<Array [1, 2, 3, -1] type='4 * int64'>Date: 2021-04-28 16:50:59 From: Angus Hollands (@agoose77:matrix.org)
ah, ok I now have this type 100 * union[var * ?bool, bool] after calling fill_none
Date: 2021-04-28 16:52:04 From: Jim Pivarski (@jpivarski)
>>> a = ak.Array([True, False, [True, False, None]])
>>> a
<Array [True, False, [True, False, None]] type='3 * union[bool, var * ?bool]'>Date: 2021-04-28 16:53:15 From: Jim Pivarski (@jpivarski)
Ah, the problem is that ak.fill_none doesn't have an axis parameter. And anyway, it would be difficult to say what the right axis is, since the list-depth is mixed by the union.
Date: 2021-04-28 16:54:03 From: Jim Pivarski (@jpivarski)
Unions were included in the type-system, but there are several things that just aren't expressible for unions. They really cause problems.
Date: 2021-04-28 16:54:29 From: Angus Hollands (@agoose77:matrix.org)
Haha, I can see that, I already have a head-ache :)
Date: 2021-04-28 16:54:35 From: Jim Pivarski (@jpivarski)
Oh, wait a minute: ak.fill_none doesn't have an axis parameter because it applies to all axes?
Date: 2021-04-28 16:54:45 From: Jim Pivarski (@jpivarski)
>>> ak.fill_none(a, False)
<Array [True, False, [True, False, False]] type='3 * union[bool, var * bool]'>Date: 2021-04-28 16:54:48 From: Angus Hollands (@agoose77:matrix.org)
It does, which is why I expected it to just work
Date: 2021-04-28 16:54:59 From: Angus Hollands (@agoose77:matrix.org)
Let me repr a reproducer
Date: 2021-04-28 16:56:39 From: Angus Hollands (@agoose77:matrix.org)
arr = ak.Array(
[
[True, False, None],
None,
None,
[True, False, None],
[True, False, None],
None,
[True, None, False],
[True, False, None],
[True, False, None],
None,
]
)Date: 2021-04-28 16:56:47 From: Angus Hollands (@agoose77:matrix.org)
ak.fill_none(arr, False)
Date: 2021-04-28 16:58:53 From: Jim Pivarski (@jpivarski)
>>> a = ak.Array([True, None, [False, None]])
>>> a
<Array [True, None, [False, None]] type='3 * ?union[bool, var * ?bool]'>
>>> ak.fill_none(a, False)
<Array [True, False, [False, None]] type='3 * union[bool, var * ?bool]'>
>>> ak.fill_none(ak.fill_none(a, False), False)
<Array [True, False, [False, False]] type='3 * union[bool, var * bool]'>Date: 2021-04-28 16:59:54 From: Jim Pivarski (@jpivarski)
It does to one level and stops, so it can be applied twice to get both levels. I'm not sure I like that behavior—I think using the explicit axis concept everywhere would be better—but you'd be out of luck if you had to try to specify an axis here.
Date: 2021-04-28 17:01:06 From: Jim Pivarski (@jpivarski)
Or maybe not. Maybe if this function took an explicit axis, then it would still have to be applied twice, but the first time you'd have to say axis=0 and the second time you'd have to say axis=1.
Date: 2021-04-28 17:04:58 From: Angus Hollands (@agoose77:matrix.org)
purely from a UX perspective, could it also accept axis=None?
Date: 2021-04-28 17:05:28 From: Angus Hollands (@agoose77:matrix.org)
I don't know enough of the internals to know how gnarly that would be to implement vs a multi-pass approach.
Date: 2021-04-28 17:05:31 From: Jim Pivarski (@jpivarski)
To fill_none at all axes? That would make sense.
Date: 2021-04-28 17:06:40 From: Jim Pivarski (@jpivarski)
The thing that saves it from being horrendous is that the number of items (lists, numbers, whatever) stays the same at each level that is being transformed.
Date: 2021-04-28 17:06:53 From: Jim Pivarski (@jpivarski)
That's what's making ak.drop_none horrendous.
Date: 2021-04-28 17:07:42 From: Jim Pivarski (@jpivarski)
(ak.flatten has the same complication, and somehow that has been solved, but I haven't figured out how to adapt ak.flatten to just drop Nones without also exploding lists.)
Date: 2021-04-28 17:16:18 From: Angus Hollands (@agoose77:matrix.org)
Yeah, I can see how that quickly balloons into something rather tricky
Date: 2021-05-04 16:52:19 From: alesaggio (@alesaggio)
Hi, I am trying to evaluate a BDT trained with TMVA on numpy arrays built from awkward. Here is a gist with the various steps describing the issue: https://gist.github.com/alesaggio/783e774bff2827b17d36f3150732c2e3 I spotted two problems that may be related:
- Evaluating the BDT on the ndarray built from awkward fails;
- Evaluating the BDT on the first entry of the ndarray fails as well, unless a new array is created which is a copy of the original one. Do you have an idea of what is going on and how to possibly make it work? I am using ROOT 6.22/06 and awkward 1.2.0rc2. I am using this TMVA.Experimental feature as it seems the only option to support batch processing (and the BDT is externally provided as an xml).
Date: 2021-05-04 18:06:03 From: Jim Pivarski (@jpivarski)
I think this has to do with TMVA's interface, but np.array(awkward_array) does make a copy. Yes, just tested it:
>>> awkward_array = ak.Array([1, 2, 3, 4, 5])
>>> numpy_array = np.array(awkward_array)
>>> awkward_array
<Array [1, 2, 3, 4, 5] type='5 * int64'>
>>> numpy_array[2:] = 99
>>> numpy_array
array([ 1, 2, 99, 99, 99])
>>> awkward_array
<Array [1, 2, 3, 4, 5] type='5 * int64'>If you want to view an array without copying it, the NumPy word for that is np.asarray (as opposed to np.array):
>>> awkward_array = ak.Array([1, 2, 3, 4, 5])
>>> numpy_array = np.asarray(awkward_array)
>>> awkward_array
<Array [1, 2, 3, 4, 5] type='5 * int64'>
>>> numpy_array[2:] = 99
>>> numpy_array
array([ 1, 2, 99, 99, 99])
>>> awkward_array
<Array [1, 2, 99, 99, 99] type='5 * int64'>So from the very first step in your gist, the array named new has nothing to do with Awkward Array; it's a brand new NumPy array.
It might have something to do with strides: it's possible to make a NumPy (or Awkward) array in which the logical order of items in the array is not the same as the physical order of the items in memory. In fact, I bet taking the transpose does this. Yes, it does:
>>> # make a "C-contiguous" array; the interpreted order is the same as the order in memory
>>> rectangle = np.arange(3*5).reshape(3, 5)
>>> rectangle
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> rectangle.shape, rectangle.strides
((3, 5), (40, 8))
>>> rectangle.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> # convert it into raw bytes (order="A" leaves the order as-is) and back again
>>> np.frombuffer(rectangle.tobytes(order="A"), dtype=np.int64)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
>>> # transpose it
>>> transpose = rectangle.T
>>> transpose
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])
>>> transpose.shape, transpose.strides
((5, 3), (8, 40))
>>> # Strides are how many bytes between each item, in each dimension
>>> # notice that the transpose has (8, 40) whereas the original had (40, 8).
>>> # That means that the inner dimension is taking bigger steps than the outer dimension.
>>> # It's not C-contiguous; it happens to be Fortran-contiguous.
>>> transpose.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> # viewing the raw data, we see the ORIGINAL order, not the transposed order
>>> np.frombuffer(transpose.tobytes(order="A"), dtype=np.int64)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
>>> # because this "transpose" is not a new array; it's the original array viewed with different strides
>>> rectangle.ctypes.data == transpose.ctypes.data
True
>>> transpose[1:, 2:] = 99
>>> transpose
array([[ 0, 5, 10],
[ 1, 6, 99],
[ 2, 7, 99],
[ 3, 8, 99],
[ 4, 9, 99]])
>>> rectangle
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 99, 99, 99, 99]])So when you took the transpose of this NumPy (not Awkward) array, NumPy did not create a new array, it just viewed the same data in a different order. When you did
(Pdb) bdt.Compute(np.array(new.T[0], copy=True))you copied the transposed NumPy array into another new array, and this one is C-contiguous. By default, new arrays in NumPy are C-contiguous. I suspect that TMVA doesn't convert arrays that aren't C-contiguous. The error message isn't clear on that, but it is good that it checks for a case that it doesn't handle, because simply ignoring the distinction would yield wrong results, which is harder to interpret than even an unclear error message.
Date: 2021-05-05 09:55:30 From: alesaggio (@alesaggio)
Hi Jim, this looks interesting, thanks for the explanation. I guess one way out is to try and build my arrays without having to use the transpose. I will keep digging a bit more and hopefully find a way to make it work in the multidimensional case as well. Thanks again!
Date: 2021-05-05 14:27:36 From: Jim Pivarski (@jpivarski)
Just copy the transposed array, as you have been doing. If TMVA doesn't accept contiguous arrays, the solution is to make a contiguous copy. Overall, this might not even be slower: compiled loops over data with constant-sized strides (i.e. known at compile-time, and therefore must be something like C-contiguous or Fortran-contiguous) are faster than loops over data with strides whose value is in a variable. The compile-time constant allows the compiler to do some loop-unrolling and auto-vectorization that it can't do when it's not a compile-time constant (I have just waved my hands with that explanation). It may be that one memory copy + a fully optimized loop is faster than zero-copy + an unoptimized loop. So don't assume that your data copy,
bdt.Compute(np.array(new.T[0], copy=True))must be avoided. It might even be the fastest way to run. (If that is, in fact, true, the best solution would be for TMVA to internally copy a non C-contiguous array as a first step of bdt.Compute.)
Date: 2021-05-12 14:24:42 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is .mask supposed to support 2D numpy arrays given that the normal subscript operator does?
Date: 2021-05-12 14:24:43 From: Angus Hollands (@agoose77:matrix.org)
E.g.
https://gist.github.com/agoose77/ccb0c37890caa48297b499d3b885ee62
Date: 2021-05-12 14:50:15 From: Jim Pivarski (@jpivarski)
That's the sort of thing that hasn't been decided. There's no equivalent of mask in NumPy, so we get to define what it does. This case of a 2D slice might not be easy to implement (given how mask is implemented, as a recursive descent that acts at one level, not as general as the __getitem__ implementation), but I can see how it was expected.
Date: 2021-05-12 15:35:43 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I understand you. I suppose it's tricky because we (or I would) want the mask layout to broadcast to the array layout, which is not the same as the __getitem__ behaviour
Date: 2021-05-15 14:08:35 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: it's the weekend so please feel free not to read this any time soon.
I have a regular array of shape (N,) and I want to index into it with a Jagged array full of indices. I would like the behaviour of numpy advanced indexing (where each index at any level of nesting pulls out the same index in my regular array, independent of position), but for a jagged input.
Is my best bet to just flatten->index->unflatten this?
Date: 2021-05-16 12:27:59 From: Jim Pivarski (@jpivarski)
I don't understand what you're trying to do, but unflatten might not work if the array of counts is not one dimensional. (I'm not sure what that would mean.) Is what you want is to unflatten with index.ravel() and then put its regular dimensions back in, your could do that with another ak.unflatten (to a constant) and, if necessary, ak.to_regular that new axis because it's constant. Or you could drop out of the high-level view and add the ak.layout.RegularArray manually.
Date: 2021-05-16 12:32:41 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: i.e.
def np_index_jagged(arr, ind):
assert arr.ndim == 2
return ak.unflatten(arr[ak.flatten(ind)], ak.num(ind))Date: 2021-05-16 12:37:00 From: Angus Hollands (@agoose77:matrix.org)
Ah, I can't un-edit my edit on Matrix 😕 Yes, I also want to only assert that the arr is 1D. At a higher-level, I just want the counts of the index to be treated (for any number of dimensions) as a 1D index. It would be nice to generalise this for a non-1D arr, but that's not something I need right now.
Date: 2021-05-17 10:11:49 From: Angus Hollands (@agoose77:matrix.org)
I would really like to take a second and say a big thank-you to all of the Awkward contributors. Since doing most of my analysis with Awkward now, I feel a lot more productive and, moreover, I feel like my code is more expressive and readable. It's not something I was actively aware of before, but I certainly feel happier working in this way that I did beforehand.
Date: 2021-05-17 12:21:30 From: Jim Pivarski (@jpivarski)
:)
Date: 2021-05-17 13:17:15 From: Angus Hollands (@agoose77:matrix.org)
Hey @jpivarski , I've noticed that the fix for matmul in https://github.com/scikit-hep/awkward-1.0/pull/847 has introduced potentially a new bug. I'm now unable to matmul between Vector3DArrays and VectorObject3D objects. Do you have any thoughts here?
Date: 2021-05-17 13:31:32 From: Angus Hollands (@agoose77:matrix.org)
It's clear that it's just a non-Awkward aware type being passed into recursively_apply. I'm not sure on the expected semantics in Awkward for handling non awkward friendly arguments. I would assume that recursively_apply should error when this happens, and instead the caller should make sure not to call recursively_apply for these types.
Date: 2021-05-17 15:48:03 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org Actually, what should happen is non-Awkward types should be passed through this input without modification, like
inputs = [
ak._util.recursively_apply(
x, (lambda _: _), pass_depth=False, numpy_to_regular=True
)
if isinstance(x, ak.layout.Content)
else x
for x in inputs
]which might be easier to read as a traditional for loop (would require a new temporary name, like newinputs).
The Numba code is supposed to handle array_of_vectors @ vector_object combinations, which it JIT-compiles as a special case. For this reason, the broadcasting code (ak._util.broadcast_and_apply) passes non-Awkward objects through, but ak._util.recursively_apply was never intended to do that, so it needs to be protected. As to whether ak._util.recursively_apply should pass through or error out on non-Awkward types, I don't know. It's an internal function, so it's just a matter of what it seems would be easier to maintain, rather than one would be a good interface for users.
Date: 2021-05-17 16:06:30 From: Angus Hollands (@agoose77:matrix.org)
Right, that's what I was aiming for :P
Date: 2021-05-17 16:07:35 From: Jim Pivarski (@jpivarski)
I thought I'd just put that change in a PR so that you can add a test for the case you just encountered.
Date: 2021-05-17 16:07:49 From: Jim Pivarski (@jpivarski)
My copy of Awkward is building, so that I can properly test...
Date: 2021-05-17 16:09:00 From: Angus Hollands (@agoose77:matrix.org)
Thanks a lot Jim, let me know if I can do anything.
Date: 2021-05-17 16:12:29 From: Jim Pivarski (@jpivarski)
I just created this for you to finish off with a test: https://github.com/scikit-hep/awkward-1.0/pull/868
Date: 2021-05-17 16:13:50 From: Jim Pivarski (@jpivarski)
I've given you write access so that you can directly work on this branch.
Date: 2021-05-17 16:15:51 From: Angus Hollands (@agoose77:matrix.org)
Thanks, I'll take a look :)
Date: 2021-05-17 21:06:48 From: Henry Schreiner (@henryiii)
Does Awkward 1 not support lexsort, but Awkward 0 did? Someone upgraded awkward and now gets:
TypeError: no implementation found for 'numpy.lexsort' on types that implement __array_function__: [<class 'awkward.highlevel.Array'>]
Date: 2021-05-17 21:08:27 From: Jim Pivarski (@jpivarski)
If lexsort ever worked before, it was definitely an accident. We never implemented an overload for np.lexsort.
Date: 2021-05-17 21:11:11 From: Jim Pivarski (@jpivarski)
Oh! If it was one-dimensional in Awkward 0, it would have been an np.ndarray. The Python type differed for different array-element types. In that case, np.lexsort would have worked because it was a plain NumPy array. Now the Python type is always ak.Array and some NumPy functions aren't going to recognize that. If that's the case here, just cast with np.asarray or ak.to_numpy.
Date: 2021-05-19 14:38:59 From: alesaggio (@alesaggio)
Hi, I am a bit stuck with manipulating awkward arrays and I would gladly appreciate any help :). Given the following array:
<Array [[1],[2],[3]] type='3 * var * int64'>
I need to extend its inner elements by repeating them N times, in order to have something like (e.g. for N=5):
<Array [[1, 1, 1, 1, 1], [2, 2, 2, 2, 2], [3, 3, 3, 3, 3]] type='3 * var * int64'>
Could you help me understand what the best way to achieve this is? I tried broadcasting with a dummy array of the desired shape without any luck so far
Date: 2021-05-19 14:43:51 From: Angus Hollands (@agoose77:matrix.org)
@alesaggio: you can make your lhs a regular array, and then broadcast np.broadcast_arrays(ak.to_regular(x), np.zeros(5)). There may be a more concise way that doesn't involve a ufunc, I don't know :)
Date: 2021-05-19 14:46:56 From: alesaggio (@alesaggio)
ah, I was missing the ak.to_regular() function! This works, thanks a lot :)
Date: 2021-05-19 14:47:22 From: Jim Pivarski (@jpivarski)
@alesaggio @agoose77:matrix.org I was approaching this from a completely different angle, but I like your broadcast-based solution better. There's no ufunc here, and this is more concise than what I had in mind. Note that it's the first of the two return values from broadcast_arrays that you want:
output = np.broadcast_arrays(ak.to_regular(input), np.zeros(5, np.int64))[0]Date: 2021-05-19 14:47:54 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: feel free to share your soln for posterity!
Date: 2021-05-19 14:48:32 From: Jim Pivarski (@jpivarski)
I wasn't done making it work; now I've closed the window. I was trying to do it as a slice, and for that, I had to construct a slice argument with ak.unflatten.
Date: 2021-05-19 14:49:02 From: Angus Hollands (@agoose77:matrix.org)
Also, offtopic - I'm parsing a binary data format and I've noticed that np.frombuffer seems to be a bottleneck for me vs struct for small numbers of bytes (in this case, using structured arrays). Is this something you have any understanding of?
Date: 2021-05-19 14:49:30 From: alesaggio (@alesaggio)
Got it, thanks @jpivarski !
Date: 2021-05-19 14:51:54 From: Angus Hollands (@agoose77:matrix.org)
Right, I follow*
Date: 2021-05-19 14:53:16 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org We don't make much effort to optimize for small arrays, many times, just for large arrays, few times. struct.unpack would only work on small amounts of data, but for that use-case, it probably is much faster than np.frombuffer. The whole idea of columnar data is to do few operations on big arrays, so you need to be in that regime for good performance.
Date: 2021-05-19 14:54:25 From: Angus Hollands (@agoose77:matrix.org)
Thanks Jim, that was my case. I'm hoping to present a talk to my group on using Awkward - right now we have data in MIDAS format, and it's quite easy to parse but I don't want to write this in C++ if I can help it
Date: 2021-05-19 14:55:24 From: Jim Pivarski (@jpivarski)
Actually, if you have data that could be described as "non-columnar," then you might want to parse it into columns using AwkwardForth.
Date: 2021-05-19 14:55:57 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: tbh I did think about that
Date: 2021-05-19 14:56:00 From: Jim Pivarski (@jpivarski)
https://indico.cern.ch/event/948465/contributions/4324131/
Date: 2021-05-19 14:56:12 From: Angus Hollands (@agoose77:matrix.org)
Can it parse imperatively i.e. read n bytes of x dtype?
Date: 2021-05-19 14:57:14 From: Jim Pivarski (@jpivarski)
Yes (this is a thread about non-columnar → columnar using AwkwardForth). Here's another reference: https://github.com/scikit-hep/awkward-1.0/wiki/AwkwardForth-documentation
Date: 2021-05-19 14:58:30 From: Jim Pivarski (@jpivarski)
It would be something like
n input #d-> output
where n puts the number on the stack and d is float64.
Date: 2021-05-19 15:00:44 From: Jim Pivarski (@jpivarski)
The biggest difference between struct.unpack and np.frombuffer is that struct.unpack can read mixed types, such as an int32, followed by three floats, followed by a raw bytestring, etc., while np.frombuffer requires everything in the stream to have the same numeric data type. AwkwardForth can be seen as a generalization of struct.unpack to a full, Turing-complete language. Admittedly, it's a strange one, but the implementation can be fast: https://skilldrick.github.io/easyforth/
Date: 2021-05-19 15:01:28 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: yeah I'm going to finish this work up and re-work it in forth
Date: 2021-05-19 15:01:57 From: Angus Hollands (@agoose77:matrix.org)
Because I'm pretty sure this is probably going to be a "simple" forth program
Date: 2021-05-19 15:02:50 From: Jim Pivarski (@jpivarski)
Good; most of these programs should be rather simple, or else generated algorithmically out of simple pieces.
Date: 2021-05-19 15:15:24 From: Angus Hollands (@agoose77:matrix.org)
Hehe, this is actually quite fun
Date: 2021-05-19 15:28:16 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is there support for constants (e.g. if I'm looking for a sentinel string), or am I better generating the forth programmatically to do that?
Date: 2021-05-19 15:30:48 From: Angus Hollands (@agoose77:matrix.org)
also, is there a way to write temporary arrays rather than just input and output??
Date: 2021-05-19 15:54:58 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: also, is there a way to write temporary arrays rather than just input and output? (er, sorry for the double punctuation).
Date: 2021-05-19 17:01:44 From: Jim Pivarski (@jpivarski)
Since it's intended to be generated, constants should be burned into the code. By "sentinel string," there isn't any string-handling (like string equality).
Date: 2021-05-19 17:02:47 From: Jim Pivarski (@jpivarski)
Standard Forth has temporary variables and temporary arrays; I've implemented temporary variables but not arrays. If there's a strong use-case for it, adding temporary arrays would be a matter of adding another part of Standard Forth.
Date: 2021-05-19 17:28:41 From: Angus Hollands (@agoose77:matrix.org)
Thanks. God, this is so much more elegant and readable than writing it imperatively.
Date: 2021-05-19 17:30:42 From: Jim Pivarski (@jpivarski)
:) Well, it is imperative, but Forth is a unique basis vector among programming languages.
Date: 2021-05-19 17:31:09 From: Angus Hollands (@agoose77:matrix.org)
Hehe, yes, to be more precise it's much nicer than operating at the collection level
Date: 2021-05-19 17:40:13 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: why do the type declarations change between inputs and outputs?
Date: 2021-05-19 18:30:38 From: Jim Pivarski (@jpivarski)
Inputs don't have types: they're raw buffers. What types they're presumed to have is determined by which parser commands you apply to them. It's a raw string.
Outputs do have types. They are columns that will become an Awkward Array. That's why the output declarations have types and the input declarations don't.
Date: 2021-05-19 18:35:28 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: how does one locally break out of a loop?
Date: 2021-05-19 18:36:22 From: Angus Hollands (@agoose77:matrix.org)
I thought exit, but it seems to quit the entire program
Date: 2021-05-19 18:37:18 From: Angus Hollands (@agoose77:matrix.org)
or maybe I'm getting something else wrong ...
Date: 2021-05-19 19:03:55 From: Jim Pivarski (@jpivarski)
exit breaks out of the function or program, I think, so it can be used as a "break" if the loop is in a function. (https://forth-standard.org/standard/core/EXIT)
Normally, you'd use a do-loop, while-repeat, repeat-until.
Date: 2021-05-19 19:54:39 From: Angus Hollands (@agoose77:matrix.org)
Yes, I think that's what I'll try next :)
Date: 2021-05-19 19:54:40 From: Angus Hollands (@agoose77:matrix.org)
Thanks!
Date: 2021-05-19 21:44:51 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: nice, the vm drops the read time to 200ms for an 8mb file (70k events)
Date: 2021-05-19 21:45:13 From: Angus Hollands (@agoose77:matrix.org)
I was getting ~2-4 s by doing it using struct + numpy
Date: 2021-05-19 21:45:17 From: Jim Pivarski (@jpivarski)
From what? I'm curious.
Date: 2021-05-19 21:45:34 From: Jim Pivarski (@jpivarski)
I see (2-4 sec).
Date: 2021-05-19 21:49:48 From: Angus Hollands (@agoose77:matrix.org)
Yes, here's the (horrible) vm code 😛 https://gist.github.com/agoose77/ee6ab19242ec9f571a39b9641dc6ee13
Date: 2021-05-19 21:50:09 From: Angus Hollands (@agoose77:matrix.org)
It's fab, I can just directly chuck that into awkward
Date: 2021-05-19 21:50:36 From: Jim Pivarski (@jpivarski)
Awesome! And you got to use gist's Forth syntax highlighter.
Date: 2021-05-19 21:51:01 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: tbh the main highlight (no pun intended) 😆
Date: 2021-05-19 21:51:14 From: Angus Hollands (@agoose77:matrix.org)
I love generating code but it's sad to lose the syntax highlighter
Date: 2021-05-19 21:51:51 From: Jim Pivarski (@jpivarski)
I think we need to have a Python wrapper around ForthMachine that takes a Form and source code and builds the ak.Array from the Form. Maybe using ak.from_buffers conventions.
Date: 2021-05-19 21:52:44 From: Angus Hollands (@agoose77:matrix.org)
It seems like that would be a nice feature
Date: 2021-05-19 21:52:45 From: Jim Pivarski (@jpivarski)
Or maybe there's nothing to do. You just need to make the Form's form_keys equal to the output names.
Date: 2021-05-19 21:52:56 From: Angus Hollands (@agoose77:matrix.org)
I've not used Form yet
Date: 2021-05-19 21:53:10 From: Angus Hollands (@agoose77:matrix.org)
I'm doing this currently
outputs = vm.outputs.copy()
num = outputs.pop("num")
events = ak.unflatten(ak.zip(outputs), num)Date: 2021-05-19 21:53:29 From: Angus Hollands (@agoose77:matrix.org)
Sorry for the spam on Gitter
Date: 2021-05-19 21:54:13 From: Angus Hollands (@agoose77:matrix.org)
Would that remove the need to zip first?
Date: 2021-05-19 21:55:37 From: Jim Pivarski (@jpivarski)
>>> array = ak.Array([[{"x": 0.0, "y": []}, {"x": 1.1, "y": [1]}], [], [{"x": 3.3, "y": [1, 2, 3]}]])
>>> array.type
3 * var * {"x": float64, "y": var * int64}
>>> form, length, buffers = ak.to_buffers(array)
>>> form
{
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "RecordArray",
"contents": {
"x": {
"class": "NumpyArray",
"itemsize": 8,
"format": "d",
"primitive": "float64",
"form_key": "node2"
},
"y": {
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node4"
},
"form_key": "node3"
}
},
"form_key": "node1"
},
"form_key": "node0"
}
>>> length
3
>>> buffers
{'part0-node0-offsets': array([0, 2, 2, 3], dtype=int64), 'part0-node2-data': array([0. , 1.1, 3.3]), 'part0-node3-offsets': array([0, 0, 1, 4], dtype=int64), 'part0-node4-data': array([1, 1, 2, 3])}Date: 2021-05-19 21:56:22 From: Jim Pivarski (@jpivarski)
You can make your own Form using ak.forms.Form.fromjson (might get renamed with an underscore between "from" and "json").
Date: 2021-05-19 21:56:42 From: Angus Hollands (@agoose77:matrix.org)
Ah yeah I recall this
Date: 2021-05-19 21:57:27 From: Jim Pivarski (@jpivarski)
With YAML → JSON, you can make a YAML description of where each Forth output should go in the array.
That's why I'm thinking that there might be nothing to do: it's these few pieces.
Date: 2021-05-19 21:58:07 From: Angus Hollands (@agoose77:matrix.org)
Another really nice thing about this approach is that I don't have to worry about memory handling as much - I can just stop the vm when we have enough results
Date: 2021-05-19 21:58:09 From: Jim Pivarski (@jpivarski)
Oh, your structure is pretty simple. The Form would be basically a verbose list.
Date: 2021-05-19 22:16:17 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org For your case, if you make offsets instead of counts (your num; note that the AwkwardForth word +<- is good for that), then you could do this:
>>> # the output buffers
>>> outputs = {
... "list-offsets": np.array([0, 3, 3, 5]),
... "x-content": np.array([1.1, 2.2, 3.3, 4.4, 5.5]),
... "y-content": np.array([1, 2, 3, 4, 5]),
... }
>>>
>>> # make the length in Forth, either as a final value on the stack
>>> # or in a variable (forthmachine.variables["length"])
>>> length = 3
>>>
>>> # maybe this is a YAML from somewhere
>>> form = ak.forms.Form.fromjson("""
... {
... "class": "ListOffsetArray64",
... "offsets": "i64",
... "form_key": "list-offsets",
... "content": {
... "class": "RecordArray",
... "contents": {
... "x": {
... "class": "NumpyArray",
... "primitive": "float64",
... "form_key": "x-content"
... },
... "y": {
... "class": "NumpyArray",
... "primitive": "int64",
... "form_key": "y-content"
... }
... }
... }
... }
... """)
>>>
>>> array = ak.from_buffers(form, length, outputs, key_format="{form_key}")
>>> array
<Array [[{x: 1.1, y: 1}, ... x: 5.5, y: 5}]] type='3 * var * {"x": float64, "y":...'>
>>> array.tolist()
[[{'x': 1.1, 'y': 1}, {'x': 2.2, 'y': 2}, {'x': 3.3, 'y': 3}], [], [{'x': 4.4, 'y': 4}, {'x': 5.5, 'y': 5}]]
>>> array.type
3 * var * {"x": float64, "y": int64}This could be written up somewhere as "the right way" to do this, as it's generalizable.
Date: 2021-05-19 22:23:06 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: right good idea
Date: 2021-05-19 22:23:26 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: do you have any best practice regarding stack manipulation vs creating variables
Date: 2021-05-19 22:23:52 From: Angus Hollands (@agoose77:matrix.org)
e.g. when you have flow control
Date: 2021-05-19 22:23:53 From: Angus Hollands (@agoose77:matrix.org)
I feel like it's easy to lose track of what is the current stac item
Date: 2021-05-19 22:25:47 From: Jim Pivarski (@jpivarski)
The Forth way of doing things is to do as much with the stack as possible, though that's harder to figure out. (An interesting thing about Forth is that its adherents blame the programmer!)
Date: 2021-05-19 22:26:39 From: Jim Pivarski (@jpivarski)
For performance, stack manipulation is assignment to a fixed-size array, and variable assignment is assignment to a std::vector item. I don't know how much C++ compiles away the indirection.
Date: 2021-05-19 22:27:46 From: Jim Pivarski (@jpivarski)
They're semantically different, too. If you have a recursive or looping function, you can't use variables because you don't know how deep the recursion is going to go. Variables are a fixed set of named globals.
Date: 2021-05-19 22:58:56 From: Angus Hollands (@agoose77:matrix.org)
Right, that makes sense. keep it local
Date: 2021-05-19 23:01:58 From: Angus Hollands (@agoose77:matrix.org)
I know it's late in the day, but I've perhaps found a bug - https://gist.github.com/agoose77/9b596b306f1b725aa0c353551e1af3b7
Date: 2021-05-19 23:03:47 From: Angus Hollands (@agoose77:matrix.org)
If I replace these lines with this one, a later halt is raised suggesting that data not being skipped by the correct amount.
Date: 2021-05-19 23:04:47 From: Angus Hollands (@agoose77:matrix.org)
I'm going to dig into it because I can't share the data 😿
Date: 2021-05-19 23:10:11 From: Angus Hollands (@agoose77:matrix.org)
ffs it's a floordiv issue, apologies
Date: 2021-05-20 13:33:09 From: Angus Hollands (@agoose77:matrix.org)
Has it occurred to anyone to override the default __str__ of tracebacks? I'm thinking
Well, this is ... Awkward
Clearly a missed opportunity otherwise 😆
Date: 2021-05-25 14:29:57 From: Andrew Naylor (@asnaylor)
Hi I wonder if someone here might be able to help me, I want to expand a 1D array to match the shape/layout of a jagged array, the proper arrays way.
My example here; each event has many pulses but only 1 timestamp, i'd like the time stamp array to match the shape/layout of the jagged pulse array:
>>> root_tree = uproot.open(file_path)['Events']
>>> data = root_tree.arrays(['pulseArea_phd', 'triggerTimeStamp'])
>>> print(data['pulseArea_phd'])
[[340, 1.83, 1.07, 1.71, 3.24, 0.54, ... 0.772, 0.72, 0.692, 0.944, 0.801, 0.699]]
>>> print(data['triggerTimeStamp'])
[1615520134, 1615520134, 1615520134, ... 1615520185, 1615520185, 1615520185]
My hacky solution is this:
>>> print(data['triggerTimeStamp']* (data['pulseArea_phd'] >= 0))
[[1615520134, 1615520134, 1615520134, ... 1615520185, 1615520185, 1615520185]]
But this is not the correct way to do this
Date: 2021-05-25 14:41:46 From: Angus Hollands (@agoose77:matrix.org)
@asnaylor: it sounds like a similar variant of a recent question - you want to broadcast the timestamps to the pulse?
Date: 2021-05-25 14:42:56 From: Andrew Naylor (@asnaylor)
@agoose77:matrix.org yes that's it
Date: 2021-05-25 14:44:58 From: Angus Hollands (@agoose77:matrix.org)
@asnaylor: try
timestamp, pulse = np.broadcast_arrays(data['triggerTimeStamp'], data['pulseArea_phd'])Date: 2021-05-25 14:48:15 From: Andrew Naylor (@asnaylor)
ah perfect, thank you @agoose77:matrix.org
Date: 2021-05-25 14:48:24 From: Andrew Naylor (@asnaylor)
that was the function i needed
Date: 2021-05-25 14:51:21 From: Angus Hollands (@agoose77:matrix.org)
@asnaylor: I've never really needed it in my NumPy code, but Awkward really benefits (I suspect due to the additional indexing modes needing some explicitness sometimes)
Date: 2021-05-28 11:51:37 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is broadcasting of None a special case vs other primitive types? I want to be able to zip a None-containing (20000*option[var*uint16]) array with some other structures, but the broadcasting causes all fields to be None where the option array is None. E.g.
>>> np.broadcast_arrays(ak.Array([[1,2,3],[4]]), [None])
>>> [<Array [None, None] type='2 * option[var * int64]'>,
<Array [None, None] type='2 * option[var * bool]'>]Date: 2021-05-28 11:58:52 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org I think what's happening is that "None" is being taken like an empty list—empty lists can cause a broadcast to lose information. But this isn't what we want—we might need to tweak the broadcasting rules.
Date: 2021-05-28 12:00:57 From: Angus Hollands (@agoose77:matrix.org)
Right, it also happens with np.broadcast_arrays(ak.Array([None] * 3), np.arange(3)) if that provides another angle
Date: 2021-05-28 12:52:22 From: Angus Hollands (@agoose77:matrix.org)
Hey Jim, sorry if I'm overloading the awkward issues at the moment - just had some time to work on it.
One thing that's been confusing me up until recently was the distinction between ? and option in the type string. Having looked at the source, I can see that it's just a different representation of the option type for array-like entries vs primitives. Is this a useful distinction? I'd almost prefer one or the other in both cases, e.g.
>>> ak.Array([[1],[2],None])
3 * ?[var * int64]or
>>> ak.Array([[1],[2],None])
3 * option[var * int64]Do you have any thoughts on this?
Date: 2021-05-28 12:54:06 From: Jim Pivarski (@jpivarski)
"?" and "option" are synonyms, though "option" can take square-parenthesized arguments and "?" can't. I'm just following Datashape syntax. https://datashape.readthedocs.io/en/latest/
Date: 2021-05-28 12:54:48 From: Angus Hollands (@agoose77:matrix.org)
Well that answers that one then! :)
Date: 2021-06-01 15:15:48 From: Jonas Rübenach (@jrueb)
Is there an equivalent to np.around?
Date: 2021-06-01 16:46:37 From: Jim Pivarski (@jpivarski)
I had to look it up: np.around. The short answer is no, but this is almost a ufunc and really should be usable. I'm going to suggest a trick using Numba:
>>> import numba as nb
>>> @nb.vectorize([nb.float64(nb.float64)])
... def around(x):
... return np.around(x, 2)
...
>>> array = ak.Array([[1.1111111, 2.222222, 3.3333333], [], [4.4444444, 5.5555555]])
>>> array.tolist()
[[1.1111111, 2.222222, 3.3333333], [], [4.4444444, 5.5555555]]
>>> around(array).tolist()
[[1.11, 2.22, 3.33], [], [4.44, 5.56]]Above, Numba is creating an actual ufunc using nb.vectorize, which Awkward Array recognizes as a ufunc and propagates it down into the nested list in this example. The list of supported signatures must be given (only nb.float64(nb.float64) here), which doesn't include the decimals parameter (2) because that's what breaks its ufuncyness.
This is the sort of thing we need to have in tutorials on the awkward-array.org page, which I keep meaning to get to...
Date: 2021-06-02 10:03:23 From: Jonas Rübenach (@jrueb)
Ok, nice. Thanks
Date: 2021-06-03 16:36:07 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I've implemented a groupby function that, at present, places each group under a field in a record array. I don't really use much pandas these days - do you think this is "worse" than returning a non-record array, with the first dimension populated by each group subarray?
Date: 2021-06-03 17:31:33 From: Angus Hollands (@agoose77:matrix.org)
Also, I am returning a RecordArray, which has fields with different lengths. I want to be able to wrap the first item of this record array in the return value, rather than requiring the user to call groupby(...)[0]. I can see that this is basically what __getitem__ does, but I haven't had any luck with ak._util.wrap. Any ideas?
def groupby(group, array, sorter=None, as_record: bool = True):
"""Group an array into sub-arrays using a given group index.
:param group: group index array
:param array: array to apply groupby to
:param sorter: optional pre-computed sort key
:param as_record: return the groups as fields in a RecordArray
"""
if sorter is None:
sorter = np.argsort(group)
inner = ak.layout.IndexedArray64(ak.layout.Index64(sorter), array.layout)
offsets, fields = _find_group_offsets(group[sorter])
if as_record:
layouts = [
ak.layout.ListOffsetArray64(ak.layout.Index64(offsets[i : i + 2]), inner)
for i in range(len(offsets) - 1)
]
outer = ak.layout.RecordArray(layouts, fields)
else:
outer = ak.layout.ListOffsetArray64(ak.layout.Index64(offsets), inner)
return ak.Array(outer)Date: 2021-06-03 17:39:37 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org The fields of a RecordArray can't have different lengths. (Or, the underlying contents can, but in the high-level presentation, they'll be truncated to the shortest.)
Date: 2021-06-03 17:41:06 From: Jim Pivarski (@jpivarski)
The output of a group-by shouldn't be in a RecordArray, since it can scale with the length of the array (consider the case in which group-by keys are unique or nearly unique: every item goes into a different group), but the number of fields of a RecordArray must not scale with the length of an array.
Date: 2021-06-03 17:42:41 From: Jim Pivarski (@jpivarski)
Also, the field names of a RecordArray shouldn't depend on the contents of the data array (and those names can only be strings, so what if you're grouping by integers?).
Date: 2021-06-03 17:44:19 From: Jim Pivarski (@jpivarski)
Here's a structure that would be appropriate for the output of a group-by (expressed as a Datashape, in which the type of the keys of the group-by is K and the type of the values in the group is V and the number of groups is N):
N * (K, var * V)
Date: 2021-06-03 17:44:41 From: Jim Pivarski (@jpivarski)
The parentheses are a tuple (RecordArray with recordlookup=None).
Date: 2021-06-03 17:45:55 From: Jim Pivarski (@jpivarski)
See ak.run_lengths for a helper function that can help build a group-by (along with ak.argsort and ak.unflatten).
Date: 2021-06-03 17:46:14 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I was optimising for the small n case. Also, can't you have different sized fields? Just don't broadcast that far. I can't recall the example now, but I have datasets that do this. They just need to be inside an array.
Date: 2021-06-03 17:46:32 From: Angus Hollands (@agoose77:matrix.org)
This structure looks better actually, encoding the name in a tuple
Date: 2021-06-03 17:47:00 From: Jim Pivarski (@jpivarski)
You can't have different sized fields directly in a record, but some of the fields can contain variable-length lists.
Date: 2021-06-03 18:16:45 From: Angus Hollands (@agoose77:matrix.org)
Yes, I would need to have a leading 1 dimension otherwise
Date: 2021-06-03 18:39:53 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: ah, I'd forgotten about run_lengths, that reduces the number of functions :)
Date: 2021-06-07 14:39:27 From: agoose77 (@agoose77:matrix.org)
@jpivarski: I just went to unzip a record array with a union type, and noticed that there isn't any mention of what should be expected to happen in that case. For convenience, I'd want the existing behaviour; take the common subset, but I also wonder whether there should be a "strict" behaviour? Perhaps by default (although that would be a breaking change.
Date: 2021-06-07 14:43:32 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org With unions, there should be quite a few cases that can't be unzipped. It should only be unzippable if all the variants of the union are records with the same set of field names or tuples with the same number of fields. I think any other case should raise an exception (and if it doesn't now, I wouldn't call it a "breaking change" to introduce that—I'd call it a bug fix!).
Date: 2021-06-07 14:47:11 From: agoose77 (@agoose77:matrix.org)
@jpivarski: OK, I'll introduce a reproducer :)
Date: 2021-06-07 14:53:25 From: agoose77 (@agoose77:matrix.org)
Tracking it here, will add a test 🙂 https://github.com/scikit-hep/awkward-1.0/issues/898#
Date: 2021-06-07 16:23:41 From: Angus Hollands (@agoose77:matrix.org)
off-topic, has anyone already established a standard/convention for describing the structure of record arrays in docstrings? I'm currently using a Markdown list with the field names but wondered if this is an already solved problem
Date: 2021-06-07 22:19:26 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski have you done any benchmarking for when making a copy is better than adding another layer of indexing? I have a case where I need to rewrite groupby such that the groups are predetermined. I can either 'unflatten' an indexed view of the data, or concatenate the views directly into a new array. Clearly at one level of Indirection, if only a few passes happen over the data I suspect the copy is worse, but as the number of indices increase I assume this swaps over. Maybe this is related to the idea of a packed function?
Date: 2021-06-07 22:24:21 From: Jim Pivarski (@jpivarski)
This was such a case: https://github.com/scikit-hep/awkward-1.0/pull/261
Although it can be situation-dependent: adding an IndexedArray layer trades an upfront copy with future indirection. It was worthwhile in the above PR because the layer we were protecting from an upfront copy was a RecordArray, records can be very wide (dozens to hundreds of fields), and often the user will only access a few of those fields after the operation is over. That's why most operations are eager, but ones involving RecordArrays are sometimes lazy, like the above.
Date: 2021-06-07 22:26:47 From: Angus Hollands (@agoose77:matrix.org)
Right. I've been thinking about the lazy vs upfront w.r.t multiple passes on the data, so I suspect it needs benchmarking for my particular case!
Date: 2021-06-07 22:28:35 From: Jim Pivarski (@jpivarski)
If you know you're going to access it, you should probably construct it eagerly. The "lazy carry" implemented in the above PR is a workaround for users (very reasonably!) wanting to compute combinations first and decide what fields to read later.
Date: 2021-06-07 22:46:01 From: Angus Hollands (@agoose77:matrix.org)
Fab, that is reasonable. Does this mean that indexing into a non record array just produces a copy? I never checked myself and am now afk
Date: 2021-06-07 22:50:32 From: Jim Pivarski (@jpivarski)
If it's a simple enough slice, nothing is copied; otherwise, it results in a copy of a buffer associated with the root of the tree. It does not result in a cascading copy through the tree. For instance, if you do an advanced slice (slice by array) on a ListArray, it will create new starts and stops, but it will not change the content downward.
Date: 2021-06-08 08:05:07 From: Angus Hollands (@agoose77:matrix.org)
OK, looking at the above PR, it seems like the short answer is eager lazy where eagerness is not inevitably required (+ some additional rules)?
Date: 2021-06-08 13:29:29 From: Jim Pivarski (@jpivarski)
You mentioned slices, and d-dimensional slices can always be performed by changing no more than d dimensions of nested tree nodes.
Date: 2021-06-08 13:30:42 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: OK, looking at the above PR, it seems like the short answer is lazy where eagerness is not inevitably required (+ some additional rules)?
Date: 2021-06-08 13:31:20 From: Angus Hollands (@agoose77:matrix.org)
The more I look at the PRs the more its clear how much work was done in the last year or two!
Date: 2021-06-08 13:32:23 From: Angus Hollands (@agoose77:matrix.org)
I opened a Discussion relating to implementing ufunc-like operations on records, but the wider question for me is how you would advise structuring an analysis; with NumPy I often don't assume how many dimensions an array might have, whilst with Awkward that is something I currently need to pre-determine whenever I need to operate with Numba (or solve with one of the approaches in the Discussion) . Do you have any strong thoughts here? Is it a case of "stop generalising everything and fix your dimensions", or is this something where the sorts of tricks above are reasonable?
Date: 2021-06-08 19:49:38 From: Jim Pivarski (@jpivarski)
(Replies to the above have gone to the Discussion, since it's a more permanent place.)
Date: 2021-06-09 08:05:27 From: Angus Hollands (@agoose77:matrix.org)
I've observed that indexing into an n-D record array with flat NumpyArrays seems ~44 times slower than indexing the same n-D numpy table. I find myself surprised at this performance difference; I would expect something more like ~3 times given the width of the record array (3 fields).
<RegularArray size="4">
<content><RegularArray size="64">
<content><RecordArray length="1024">
<field index="0" key="u">
<NumpyArray format="h" shape="1024" data="65 66 67 68 65 ... 64 64 64 64 64" at="0x000003c10f80"/>
</field>
<field index="1" key="v">
<NumpyArray format="h" shape="1024" data="31 31 31 31 30 ... 68 67 66 65 64" at="0x000003c45190"/>
</field>
<field index="2" key="region">
<NumpyArray format="h" shape="1024" data="0 0 0 0 0 ... 0 0 0 0 0" at="0x000003293200"/>
</field>
</RecordArray></content>
</RegularArray></content>
</RegularArray>
Date: 2021-06-09 08:20:56 From: Angus Hollands (@agoose77:matrix.org)
Interestingly, this is not an issue in the Numba loop-unrolled equivalent
Date: 2021-06-09 08:21:20 From: Angus Hollands (@agoose77:matrix.org)
well, perhaps not interestingly, Numba is using bare arrays
Date: 2021-06-09 13:21:22 From: Jim Pivarski (@jpivarski)
Are you measuring the O(1) time or the O(n) time? Awkward Arrays have slower startup time (the O(1) overhead involved in getting ready to perform a slice) than NumPy, but in situations in which they're doing the same thing, the same O(n) time that scales with the size of the array n. If your arrays are really this small, n = 1024, then the access time would be dominated by the startup overhead. All of the optimization effort went into the O(n) scaling, not the O(1) startup.
In short, you're not supposed to repeatedly call __getitem__ on an array in a hot loop (something that is called many times, on the scale of n), outside of Numba. If you're outside of Numba, you'd want to do it in a single slice.
The reason for the choice to focus on O(n) times only is because the O(1) is fundamentally limited by the speed of Python, anyway. It could be faster, but you'd eventually run into a speed limit that we can't lift (because it's Python) and then you'd have to turn the many __getitems__ into a single slice or use Numba anyway.
Date: 2021-06-09 13:35:25 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: sure, I believe I was scaling the size of the array and observing the speed difference which was where I was surprised.
Date: 2021-06-09 13:38:06 From: Jim Pivarski (@jpivarski)
At some scale for shape, probably in the millions, the wall time will go from being constant with respect to shape to linear. Above that point, the linear slope of time-vs-shape is 44 times larger than time-vs-shape for NumPy?
Date: 2021-06-09 13:49:53 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org This is an example of an O(n) scaling issue that we addressed (Awkward Array was performing O(n) operations to prepare for an uncommon but possible case; the optimization skips this preparation if it's not needed): https://github.com/scikit-hep/awkward-1.0/issues/442
Here's another example of a scaling study, but this one was not implemented because Awkward Array legitimately needs to do more work than NumPy here (not doing so would cause unwanted behavior): https://github.com/scikit-hep/awkward-1.0/issues/852
Date: 2021-06-09 13:55:48 From: Angus Hollands (@agoose77:matrix.org)
Date: 2021-06-09 13:55:48 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: not sure what I was observing after re-benchmarking:
Date: 2021-06-09 13:56:26 From: Angus Hollands (@agoose77:matrix.org)
🤷♂️
Date: 2021-06-09 13:57:13 From: Jim Pivarski (@jpivarski)
What is t/s, is it time in seconds or time per something?
Date: 2021-06-09 13:57:51 From: Angus Hollands (@agoose77:matrix.org)
Sorry, spacing was missing "t /s" (time in seconds)
Date: 2021-06-09 13:59:00 From: Jim Pivarski (@jpivarski)
Although this is a semi-log plot, it looks like the slopes are the same, but Awkward's O(1) setup time is so large that it continues to dominate at N=1e7.
Date: 2021-06-09 13:59:26 From: Jim Pivarski (@jpivarski)
Actually, no, I can't judge the slope on a semi-log plot.
Date: 2021-06-09 13:59:42 From: Angus Hollands (@agoose77:matrix.org)
This is basically doing this MAP[addr.asad, addr.aget, addr.cobo] where MAP is an ND regular array and addr is a record array
Date: 2021-06-09 14:00:47 From: Angus Hollands (@agoose77:matrix.org)
🤦 yep, force of habit
Date: 2021-06-09 14:01:25 From: Angus Hollands (@agoose77:matrix.org)
Bit disappointing that you cant visually take an x-axis logarithm in your head, though 😉 /s
Date: 2021-06-09 14:01:25 From: Angus Hollands (@agoose77:matrix.org)
Date: 2021-06-09 14:01:57 From: Jim Pivarski (@jpivarski)
So addr.sasd, addr.aget, and addr.cobo are all arrays of integers with the same length and MAP is the array whose layout is given above?
Date: 2021-06-09 14:02:17 From: Jim Pivarski (@jpivarski)
Oh, and the bottom is Numba, not NumPy.
Date: 2021-06-09 14:02:47 From: Angus Hollands (@agoose77:matrix.org)
I might need to redefine the problem 😅
Date: 2021-06-09 14:03:25 From: Angus Hollands (@agoose77:matrix.org)
The indices come from this addr object with some redundant fields 107949 * deconvolution["aget": uint8, "amplitude": float64, "asad": uint8, "channel": uint8, "cobo": uint8, "index": uint16, "time": float64]
Date: 2021-06-09 14:03:39 From: Angus Hollands (@agoose77:matrix.org)
The lookup MAP is 4 * 4 * 64 * int16
Date: 2021-06-09 14:04:03 From: Angus Hollands (@agoose77:matrix.org)
@nb.njit
def nb_addr(data):
res = np.zeros(len(data), dtype=np.int_)
for i, addr in enumerate(data):
res[i] = MAP[addr["asad"]][addr["aget"]][addr["cobo"]]
return resvs
MAP[addr.asad, addr.aget, addr.cobo]Date: 2021-06-09 14:05:36 From: Jim Pivarski (@jpivarski)
As __getitem__ descends through the tree, it is making intermediate arrays to slice the next level because it doesn't know yet what type that next level is going to be. NumPy knows that all levels are rectilinear and can therefore make some optimizations (in NumPy, I doubt it's more expensive to slice from the last dimension than the first, but in Awkward Array, it's very asymmetric).
Numba is a different story entirely: absolutely everything was pared down to iterate over the data as fast as possible (probably faster than NumPy, but not faster than NumPy-in-Numba), but it's only iteration: all other computation is up to you. That's the tradeoff.
Date: 2021-06-09 14:07:56 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: OK, that seems reasonable. I couldn't face making this too fast so I've involved ArrayBuilder anyway 😉
Date: 2021-06-09 14:11:49 From: Jim Pivarski (@jpivarski)
This is getitem_next for a RegularArray:
It takes the slice array in (originally addr.asad) and combines it with the next-dimension slice (addr.aget) to make a new slice to apply to the next level down. That combination happens in the kernel kernel::RegularArray_getitem_next_array_64. That new array-slice would work in any node type (ListArray, RecordArray, UnionArray, etc.). When NumPy implements the same kind of slice, it doesn't have to worry about the next dimension possibly not being regular.
There's a potential optimization here: if we know that the array is regular everywhere, we could remove some steps. Actually, we'd be pretty much replacing it with a NumPy array—that's what we'd have to do. Without that optimization, we can encourage users to use NumPy for NumPy-like cases and Awkward Array when NumPy can't do it. (A similar argument can be made for using Pandas when you can't do it in pure NumPy because of the overhead that Pandas introduces, but there are a lot of things you can do in Pandas that you can't do in NumPy.)
Date: 2021-06-09 14:16:26 From: Angus Hollands (@agoose77:matrix.org)
I see the thought process. I suppose it's not urgent but I might open an issue to track it?
Date: 2021-06-09 14:20:39 From: Jim Pivarski (@jpivarski)
This would be quite low in a list of priorities, and I don't want to lose track of all the issues asking for new features. I suppose if it's labeled "performance," not many issues have that label and it won't get too confusing.
Date: 2021-06-09 14:22:59 From: Angus Hollands (@agoose77:matrix.org)
I was thinking of using such a label. How do you feel about that?
Date: 2021-06-09 14:28:22 From: Jim Pivarski (@jpivarski)
Yeah, I changed my mind mid-sentence. As long as it's labeled "performance," we can filter for it or against it when looking for things to fix. Having it as an issue would be a place to put your plot, though it would help to make a version of the plot with linear axes (though that would require going much higher in n, so that the Numba or NumPy version is visible compared to the non-Numba Awkward Array) or log-log. In log-log, the asymptotic slope would have to be 1, since it will scale as n to the first power, but at sufficiently high n, the vertical offset tells us the real (lin-lin) slope of the curve.
Date: 2021-06-09 14:34:13 From: Angus Hollands (@agoose77:matrix.org)
I'll keep the log-log I posted above, and I can look at scaling up in order to produce a linear counterpart 🙂 Are you suggesting taking a linear fit to the high N region in log-log space, and determining the y intercept for the scaling factor between numba and awkward?
Date: 2021-06-09 14:37:56 From: Jim Pivarski (@jpivarski)
Nothing fancy. Just that an issue is a good place to put this information, so that it's not lost.
Date: 2021-06-09 14:47:51 From: Angus Hollands (@agoose77:matrix.org)
Righto :)
Date: 2021-06-09 14:50:49 From: Angus Hollands (@agoose77:matrix.org)
Related to the unflatten issue, the problem that I'm solving is a groupby with pre-defined groups. I want therefore the grouped dimension to have a regular size.
The best solution that I've come up with is:
- generate regular-array of run-lengths (custom numba fn)
- create a
ListOffsetArrayover theaxis+1contents which indexes with the run-length derived offsets - create another
ListOffsetArraywhich replaces the existingaxiswith offsets ofarange(...)*3, and content of the aforementionedListOffsetArray.
Does this seem like the best way to do this?
Date: 2021-06-09 14:52:08 From: Angus Hollands (@agoose77:matrix.org)
For posterity, the run lengths are given by
@njit_at_dim(1)
def run_lengths_of(data, builder):
count = 0
i = 0
# pre-defined integer groups
for group in range(3):
for i in range(i, len(data)):
if data[i] != group:
break
count += 1
builder.integer(count)
count = 0
return builderDate: 2021-06-09 14:54:35 From: Jim Pivarski (@jpivarski)
Yes, that looks good. Only comment: if the output array is flat (i.e. can be a NumPy array), make a NumPy array instead of using an ArrayBuilder. You know the size of the array of integers before entering the for loop, which is all you need to allocate with np.empty(len(data)*3) and fill it up. That avoids a lot of overhead, including allocate-only-once, which would still be an issue with TypedArrayBuilder.
Date: 2021-06-09 14:56:56 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: yeah, I'm suffering from the itch to generalise everything at the moment which was why I opened the Discussion 😕 I think maybe I need to be a bit more concrete about how many dimensions my inputs have etc, e.g. do they operate at array of event or event level.
Date: 2021-06-09 15:28:22 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: here's a weird one (at least, weird to me): https://mybinder.org/v2/gist/agoose77/90d13a395af9f61f80328ed041cfa789/07755625965352f9ee22ea6374f31e2bc75d8ae5
Date: 2021-06-09 15:28:38 From: Angus Hollands (@agoose77:matrix.org)
offline link - https://mybinder.org/v2/gist/agoose77/90d13a395af9f61f80328ed041cfa789
Date: 2021-06-10 12:25:21 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is there a quick rule to work out why a concatenate operation on an internal axis (!=0) is producing a union type, when the types look the same?
Date: 2021-06-10 12:35:07 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org There is a chance that there's a bug: the general strategy is that it builds a UnionArray and then calls simplify_uniontype(), but if the latter isn't called, then compatible types won't get merged. What are the types that look like they should have been merged? Or the full example?
Date: 2021-06-10 12:37:06 From: Angus Hollands (@agoose77:matrix.org)
1138 * var * var * pad["u": int64, "v": int64, "time": float64, "amplitude": float64, "region": int64]
1138 * var * var * pad["u": int64, "v": int64, "time": float64, "amplitude": float64, "region": int64]Date: 2021-06-10 12:38:30 From: Angus Hollands (@agoose77:matrix.org)
I promise that I'm not on a bug hunting mission at the moment 😆
Date: 2021-06-10 12:48:07 From: Jim Pivarski (@jpivarski)
If the two pad types have different parameters, that could be the reason why. In principle, it ought to be
- getting a request to simplify the contents in
concatenate: https://github.com/scikit-hep/awkward-1.0/blob/3f36053f0a2d40c6920aa27c65d9fb08e0677495/src/awkward/operations/structure.py#L1641-L1643 - which calls
simplify_uniontype: https://github.com/scikit-hep/awkward-1.0/blob/3f36053f0a2d40c6920aa27c65d9fb08e0677495/src/libawkward/array/UnionArray.cpp#L523 - which checks to see if your nested lists are mergeable: https://github.com/scikit-hep/awkward-1.0/blob/3f36053f0a2d40c6920aa27c65d9fb08e0677495/src/libawkward/array/ListOffsetArray.cpp#L1036-L1118 or https://github.com/scikit-hep/awkward-1.0/blob/3f36053f0a2d40c6920aa27c65d9fb08e0677495/src/libawkward/array/ListArray.cpp#L977-L1058
- which eventually gets down into the check of mergeable records: https://github.com/scikit-hep/awkward-1.0/blob/3f36053f0a2d40c6920aa27c65d9fb08e0677495/src/libawkward/array/RecordArray.cpp#L1131-L1212
You should be able to call mergeable from Python by getting into the layout of your arrays and checking the mergeable property of each level to see if that fails at some point. The first step in this is to isolate where the issue is.
Date: 2021-06-10 12:54:10 From: Angus Hollands (@agoose77:matrix.org)
So, the innermost layouts are mergeable, but the LHS and RHS have a different number of layouts, and I wonder if that's the cause
Date: 2021-06-10 13:49:18 From: Angus Hollands (@agoose77:matrix.org)
Another query - I can't seem to figure out how to broadcast two arrays by right broadcasting. The code for ak.broadcast_arrays is quite involved so I thought I'd just ask up front.
I have two arrays var * 3 * var and 3 * 1, and I want to broadcast them such that the RHS is effectively given a leading 1 axis, e.g. 1 * 3 * 1. This would behave as-NumPy. My reasoning is that I'm acting on all events, not per-event. Passing left_broadcasting=False to the function doesn't prevent it from erroring - it seems as though there is no direct way to force a particular kind of broadcasting when it can infer from the types whether to left or right?
Date: 2021-06-10 13:49:31 From: Angus Hollands (@agoose77:matrix.org)
So, do I just need to construct the slice manually myself?
Date: 2021-06-10 14:51:02 From: Jim Pivarski (@jpivarski)
The "left" and "right" are about where new dimensions are inserted to make the number of dimensions in two arrays match. See this for details: https://awkward-array.readthedocs.io/en/latest/_auto/ak.broadcast_arrays.html
Date: 2021-06-10 15:47:52 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski thst matches my understanding but it doesn't seem to consider my preference (I want to left pad the shape by 1 to broadcast over all events)
Date: 2021-06-10 16:33:35 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org That's an unusual case: we're usually padding on the right, aligning left for axis!=-1 reducers and sorting, and left-broadcasting when not constrained to do otherwise by NumPy's precedent. It comes from a situation in which the most important objects are at the beginning of the lists, not the ends (e.g. physics objects sorted in decreasing order of pT). If it's an arbitrary choice for you and you're putting the "most important" objects at the ends of lists, rather than the beginnings, then it might be easier to rethink your workflow to go along the grain.
Otherwise, you can left-pad by axis>0 concatenation, but you have to prepare the array full of the right number of Nones to concatenate.
Date: 2021-06-10 18:45:28 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I'm not sure that I quite follow. I understand that in many HEP physical analyses, the leading dimension (outermost) is usually event-like, and everything moves in from there. I have the same scenario. The data I am computing are ultimately Cartesian coordinates of charge quanta in a TPC. These elements are grouped by detector region, to account for different scale factors required to compute their locations. In my mind, that means var (n events) * var (n regions) * var (n elements) * element[...]. You mention location "in" lists, which is where I am a bit lost.
Date: 2021-06-10 20:06:02 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org I'm talking about objects in a dimension, such as jets in an event. The first jets have the highest pT and are the most likely to be real jets, and it trails off from there. That's why we want to left-align, because we'd be matching the first jet in one event with the first jet in the next event, second and second, etc. If we right-aligned, we'd be matching noise with noise, and the first jet of one event might be matched with the third or fourth jet of the next event, depending on how many noise jets there are.
That's all.
Date: 2021-06-10 20:12:44 From: Angus Hollands (@agoose77:matrix.org)
Right I see, this is within the dimension. So, is the left/right broadcasting a different phenomenon than the NumPy left-padding by 1 in the shape? Is there a code sample of this anywhere? I'm unable to visualise how this works :/
Date: 2021-06-10 20:25:29 From: Jim Pivarski (@jpivarski)
Left and right broadcasting is a number of dimensions issue. If the axes are regular (like NumPy), then a 2 * 3 * T1 array has to be broadcasted with a 3 * T2 array if the number of dimensions differ at all. That's right broadcasting.
>>> ak.Array(np.array([[1, 2, 3], [4, 5, 6]])), ak.Array(np.array([10, 20, 30]))
(<Array [[1, 2, 3], [4, 5, 6]] type='2 * 3 * int64'>,
<Array [10, 20, 30] type='3 * int64'>)
>>> ak.Array(np.array([[1, 2, 3], [4, 5, 6]])) + ak.Array(np.array([10, 20, 30]))
<Array [[11, 22, 33], [14, 25, 36]] type='2 * 3 * int64'>If the axes are irregular (optimized for physics cases), then a 2 * var * T1 array has to be broadcasted with a 2 * T2 array if the number of dimensions differ at all. This is left broadcasting.
>>> ak.Array([[1, 2, 3], [4, 5, 6]]), ak.Array([10, 20])
(<Array [[1, 2, 3], [4, 5, 6]] type='2 * var * int64'>,
<Array [10, 20] type='2 * int64'>)
>>> ak.Array([[1, 2, 3], [4, 5, 6]]) + ak.Array([10, 20])
<Array [[11, 12, 13], [24, 25, 26]] type='2 * var * int64'>Date: 2021-06-10 20:30:37 From: Jim Pivarski (@jpivarski)
The left and right alignment of items within a dimension is not related to broadcasting. It's for operations like reducers and sorting. When there's a variable number of items in each list, we combine them as though they were indented with left-justification, not right.
>>> ak.sum(ak.Array([[ 1, 2, 4, 8],
... [16, 32],
... [64, 128, 256]]), axis=-1)
<Array [15, 48, 448] type='3 * int64'>
>>> ak.sum(ak.Array([[ 1, 2, 4, 8],
... [16, 32],
... [64, 128, 256]]), axis=-2)
<Array [81, 162, 260, 8] type='4 * int64'>Date: 2021-06-10 20:33:32 From: Angus Hollands (@agoose77:matrix.org)
Fab, this makes perfect sense
Date: 2021-06-10 20:49:46 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I'm sure I've just been at this too long - looking at unflatten, I'm wondering whether we need to handle the depth differently. We treat axis as though depth==axis+1, but the IndexedArray layouts add to depth but not to layout
Date: 2021-06-10 20:51:25 From: Angus Hollands (@agoose77:matrix.org)
Ah, ffs, you're already ahead of me - the depth doesn't increase inside these layouts
Date: 2021-06-10 20:55:13 From: Angus Hollands (@agoose77:matrix.org)
This is really well thought out; there's lots of places where it would seem obvious to do something "simpler" and fall foul later on
Date: 2021-06-10 20:59:23 From: Angus Hollands (@agoose77:matrix.org)
And the run_lengths implementation is really simple,but it never occurred to me to use the transition points like that in pure numpy
Date: 2021-06-12 15:14:48 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I know it's a weekend so feel free to ignore this.
w.r.t ak.packed for UnionArray, is it better that I call shallow_simplify on the array first (to potentially merge any nested union arrays) and then project the various contents?
Date: 2021-06-12 15:59:32 From: Angus Hollands (@agoose77:matrix.org)
Hmm, this doesn't do what I expected: calling simplify on
one0 = ak.layout.NumpyArray(np.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5], dtype=np.float64))
one1 = ak.layout.NumpyArray(np.array([4, 5], dtype=np.int64))
onetags = ak.layout.Index8(np.array([0, 0, 0, 0, 1, 1], dtype=np.int8))
oneindex = ak.layout.Index64(np.array([0, 1, 2, 5, 0, 1], dtype=np.int64))
layout = ak.layout.UnionArray8_64(onetags, oneindex, [one0, one1])produces a single NumpyArray and collapses the types into float
Date: 2021-06-12 16:05:07 From: Angus Hollands (@agoose77:matrix.org)
I might move this to the PR actually
Date: 2021-06-15 16:40:40 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: what is nbytes expected to return for a NumpyArray layout?
Date: 2021-06-15 16:55:52 From: Jim Pivarski (@jpivarski)
The number of bytes in just the data buffer, not any Python objects, like NumPy.
>>> np.array([1, 2, 3, 4, 5]).nbytes
40
>>> ak.Array(np.array([1, 2, 3, 4, 5])).nbytes
40Date: 2021-06-15 16:57:21 From: Angus Hollands (@agoose77:matrix.org)
If I do from_numpy, does Awkward not consider the memory because it's not owned?
Date: 2021-06-15 16:58:43 From: Jim Pivarski (@jpivarski)
It includes everything in the array (Awkward doesn't even know if it has exclusive ownership), but it doesn't count overlaps.
Date: 2021-06-15 16:59:33 From: Jim Pivarski (@jpivarski)
>>> x = ak.Array([1, 2, 3, 4, 5])
>>> ak.zip((x, x))
<Array [(1, 1), (2, 2), ... 3), (4, 4), (5, 5)] type='5 * (int64, int64)'>
>>> x.nbytes
40
>>> ak.zip((x, x)).nbytes
40Date: 2021-06-15 16:59:37 From: Angus Hollands (@agoose77:matrix.org)
Hmm, okay. This sample returns a tiny size for some reason
Date: 2021-06-15 16:59:41 From: Angus Hollands (@agoose77:matrix.org)
>>> ak.from_numpy(np.random.random(size=(4, 100*1024*1024//8//4))).nbytes
32Date: 2021-06-15 17:00:43 From: Jim Pivarski (@jpivarski)
>>> np.random.random(size=(4, 100*1024*1024//8//4)).nbytes
104857600
>>> ak.from_numpy(np.random.random(size=(4, 100*1024*1024//8//4))).nbytes
32
>>> ak.Array(np.random.random(size=(4, 100*1024*1024//8//4))).nbytes
32There's something wrong with that.
Date: 2021-06-15 17:01:03 From: Angus Hollands (@agoose77:matrix.org)
OK, that's good, I'll look at filing a fix
Date: 2021-06-15 17:01:12 From: Jim Pivarski (@jpivarski)
Maybe it's not following RegularArrays?
Date: 2021-06-15 17:02:09 From: Angus Hollands (@agoose77:matrix.org)
Oh hmm
Date: 2021-06-15 17:02:24 From: Angus Hollands (@agoose77:matrix.org)
>>> ak.from_numpy(np.random.random(size=(4, 100 * 1024 * 1024 // 16 // 4))).layout
<NumpyArray format="d" shape="4 1638400" data="0x b0fb8ce3 5758a83f 8463bcf0 ec47e63f ... cafa7e66 44fad53f a20b36d2 6227d53f" at="0x7f278a9fe010"/>Date: 2021-06-15 17:02:39 From: Jim Pivarski (@jpivarski)
Oh, no RegularArrays are fine, multidimensional NumpyArrays aren't:
>>> ak.from_numpy(np.random.random(size=(4, 100*1024*1024//8//4)), regulararray=False).nbytes
32
>>> ak.from_numpy(np.random.random(size=(4, 100*1024*1024//8//4)), regulararray=True).nbytes
104857600Date: 2021-06-15 17:02:41 From: Angus Hollands (@agoose77:matrix.org)
It's doing shape[0]*itemsize
Date: 2021-06-15 17:02:48 From: Jim Pivarski (@jpivarski)
That's it.
Date: 2021-06-15 17:03:00 From: Angus Hollands (@agoose77:matrix.org)
Date: 2021-06-15 17:03:14 From: Angus Hollands (@agoose77:matrix.org)
I can modify that to take the product of the shape
Date: 2021-06-15 17:04:00 From: Jim Pivarski (@jpivarski)
Or just use bytelength:
Date: 2021-06-15 17:04:57 From: Jim Pivarski (@jpivarski)
If the strides are such that it skips unreachable data, bytelength would count that. (I think that's what we want...)
Date: 2021-06-15 17:06:20 From: Angus Hollands (@agoose77:matrix.org)
Nice, that also work
Date: 2021-06-15 17:06:21 From: Angus Hollands (@agoose77:matrix.org)
works*
Date: 2021-06-15 17:06:53 From: Angus Hollands (@agoose77:matrix.org)
Except, won't we want the full contiguous memory size because it's used to guide the cache eviction?
Date: 2021-06-15 17:07:08 From: Angus Hollands (@agoose77:matrix.org)
I don't really know what;s going on with the cache stuff, I just read somewhere that it depends upon the size of the arrays in bytes
Date: 2021-06-15 17:08:05 From: Angus Hollands (@agoose77:matrix.org)
We target C++11 right?
Date: 2021-06-15 17:08:29 From: Jim Pivarski (@jpivarski)
I think we'd want to use the full contiguous memory size because of its use in deciding cache eviction. (What you read is right: it decides when a cache is full based on how many megabytes it contains, which it gets from numbytes.)
However, this isn't what NumPy does:
>>> np.arange(5*7).reshape(5, 7)
array([[ 0, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27],
[28, 29, 30, 31, 32, 33, 34]])
>>> np.arange(5*7).reshape(5, 7)[:, -3:]
array([[ 4, 5, 6],
[11, 12, 13],
[18, 19, 20],
[25, 26, 27],
[32, 33, 34]])
>>> np.arange(5*7).reshape(5, 7).nbytes
280
>>> np.arange(5*7).reshape(5, 7)[:, -3:].nbytes
120Date: 2021-06-15 17:08:43 From: Angus Hollands (@agoose77:matrix.org)
hahaha oh fab
Date: 2021-06-15 17:08:45 From: Jim Pivarski (@jpivarski)
Yes. No C++14.
Date: 2021-06-15 17:09:28 From: Angus Hollands (@agoose77:matrix.org)
Oh, I read cxx_std_11 in the CMakeLists and assumed
Date: 2021-06-15 17:09:34 From: Jim Pivarski (@jpivarski)
It's evidently multiplying the itemsize by the product of the shape. So I guess that's what we have to do. That makes numbytes less useful as a way to measure memory usage.
Date: 2021-06-15 17:09:55 From: Jim Pivarski (@jpivarski)
I meant, "yes, C++11 only; C++14 is not allowed."
Date: 2021-06-15 17:22:09 From: Angus Hollands (@agoose77:matrix.org)
Ah fab
Date: 2021-06-15 17:23:03 From: Angus Hollands (@agoose77:matrix.org)
(I was checking about range support)
Date: 2021-06-15 17:23:26 From: Angus Hollands (@agoose77:matrix.org)
I just re read your message, it made sense the first time round
Date: 2021-06-15 17:24:29 From: Angus Hollands (@agoose77:matrix.org)
I think it's good to have two measures of memory usage - actual and theoretical. The latter is kinda what you'd get from calling packed, but not exactly.
Date: 2021-06-15 19:44:35 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I was writing a PR to add a context manager for deprecations, but then I realised we're basically implementing features that already exist in the warnings module. Is the deprecations_as_warnings public api something we can break?
Date: 2021-06-15 20:20:46 From: Jim Pivarski (@jpivarski)
deprecations_as_warnings is not currently a public API. It was a public API back in February when there were active deprecations , and the fill_none change will activate it again, but while there are no active deprecations in any release, you're free to change it.
I don't know the warnings standard library module very well. I might have reproduced some features without knowing it. However, keep in mind that whatever we add to Awkward Array has to work in Python 2.7—be sure that the warnings features you're looking at aren't new additions in Python 3.
Date: 2021-06-15 20:28:25 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: will do - I'll make sure to check against py2!
Date: 2021-06-16 21:29:08 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I've been running into some problems when using Awkward with Dask concerning the incompatability of forms (when using partitioned) / schemas (when using Parquet).
Date: 2021-06-16 21:29:28 From: Angus Hollands (@agoose77:matrix.org)
From your knowledge, does the parquet schema depend upon the order of the fields in a record array?
Date: 2021-06-16 21:30:25 From: Angus Hollands (@agoose77:matrix.org)
I will try and answer this myself, but I'm on a bit of a deadline so expert knowledge is appreciated :P
Date: 2021-06-16 21:35:54 From: Angus Hollands (@agoose77:matrix.org)
OK, so it looks like it does. I think this might be coming from ak.zip, but I'll do some more digging.
Date: 2021-06-17 12:39:44 From: Jim Pivarski (@jpivarski)
The JSON representation of RecordForms unfortunately use JSON objects ({...}) for the fields—a bad choice relative to two lists, because some systems can change the order. Python 3.6+ maintains the order, so the problem is partially fixed.
But I don't think this problem affects Parquet reading: we get fields from Parquet by string name, not order position. The Forms order issue (in Python 2.7 or 3.5) would just mean that the correct RecordArray comes out with field names AND values in the same wrong order (i.e. the right name still maps to the right value).
What's the symptom?
Date: 2021-06-17 13:59:14 From: Angus Hollands (@agoose77:matrix.org)
When deserialising arrays in a dataset, the partitioned array failed because the partitions didn't have the same schemas
Date: 2021-06-17 14:00:15 From: Angus Hollands (@agoose77:matrix.org)
That doesn't sound quite right so it might have been related to it. I'm using dask and its been a busy 24 hrs 😂
Date: 2021-06-17 14:00:45 From: Angus Hollands (@agoose77:matrix.org)
I'm now using hdf5 as the parquet stuff is a little unstable rn
Date: 2021-06-17 14:01:00 From: Angus Hollands (@agoose77:matrix.org)
It's on my to do list, but I'm not quite ready at the moment to tackle it
Date: 2021-06-17 14:10:35 From: Jim Pivarski (@jpivarski)
Partitions have to have the same schemas. If that didn't raise an error message, it should have.
Date: 2021-06-17 14:28:05 From: Angus Hollands (@agoose77:matrix.org)
Yeah that sounds about right
Date: 2021-06-17 18:19:23 From: Angus Hollands (@agoose77:matrix.org)
Haha Jim you really have a PDF for everything https://github.com/jpivarski-talks/2021-05-21-dasksummit-awkward-collection
Date: 2021-06-17 21:54:38 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I'm running into a problem concerning partitioning and could use some input
Date: 2021-06-17 21:55:33 From: Angus Hollands (@agoose77:matrix.org)
I've implemented a "dataset" on top of to_buffers (& h5py), and load it using a partitioned virtual array.
Date: 2021-06-17 22:03:41 From: Angus Hollands (@agoose77:matrix.org)
I think the schemas are varying between regular arrays and listoffsetarrays. Is there any application agnostic way to resolve this?
Date: 2021-06-17 22:07:09 From: Angus Hollands (@agoose77:matrix.org)
The main issue is that I don't have all of the partitions that are written to disk in memory at the same time/process. I'm currently thinking my options are either:
Date: 2021-06-17 22:07:51 From: Angus Hollands (@agoose77:matrix.org)
- run another pass over the data and cast to the first form
- manually specify the form in the writer
- convert at read-time to a common form
Date: 2021-06-17 22:08:47 From: Angus Hollands (@agoose77:matrix.org)
I also am guessing that ak.packed is what has introduced these bugs, because it can change the layout according to the data
Date: 2021-06-17 22:12:43 From: Angus Hollands (@agoose77:matrix.org)
The data are always regular in the last axis, but something (presumably packed) is sometimes losing that information
Date: 2021-06-17 22:59:08 From: Jim Pivarski (@jpivarski)
Regular and irregular lists are different types, so if ak.packed is not making a distinction between them, then that's a bug. An input method, like ak.from_iter or ak.from_numpy, always makes data of one type or the other, so it shouldn't be the case that a partition of data that "happens to be regular" would make the partitions differ on this point: the input method would always create regular or irregular typed data regardless of the content of the input.
Date: 2021-06-18 09:57:19 From: Angus Hollands (@agoose77:matrix.org)
Right that makes sense - I was spit balling after a long day trying to move my analysis to awkward + dask. I've filed an issue now that I have worked out the cause!
Date: 2021-06-18 12:26:01 From: Lukas (@lukasheinrich)
hi
Date: 2021-06-18 12:26:11 From: Lukas (@lukasheinrich)
is there an easy way to rename fields of an existing array
Date: 2021-06-18 12:28:44 From: Angus Hollands (@agoose77:matrix.org)
@lukasheinrich: I think the easiest way is just to use ak.with_name and create a new RecordArray. Awkward arrays are supposed to be immutable (which they mostly are, apart from a few areas), so anything you do needs ultimately to create a new array.
Date: 2021-06-18 12:39:07 From: Lukas (@lukasheinrich)
do you have an example?
Date: 2021-06-18 12:39:14 From: Lukas (@lukasheinrich)
how to use with_name?
Date: 2021-06-18 12:39:44 From: Angus Hollands (@agoose77:matrix.org)
Sure
Date: 2021-06-18 12:40:50 From: Angus Hollands (@agoose77:matrix.org)
ak.with_name just adds / replaces the __record__ (name) of the first RecordArray it encounters. It will visit the entire layout of the array, starting from the root (outermost dimension)
Date: 2021-06-18 12:41:07 From: Angus Hollands (@agoose77:matrix.org)
with_new_name = ak.with_name(array, "MyArray")Date: 2021-06-18 12:42:37 From: Angus Hollands (@agoose77:matrix.org)
If your RecordArray is nested inside of some other records, then you need to build a new array:
to_rename = array.some.nested.recordarray
with_new_name = ak.with_name(to_rename, "MyArray")
new_array = ak.with_field(array, with_new_name, ("some","nested","recordarray"), )Date: 2021-06-18 12:43:18 From: Angus Hollands (@agoose77:matrix.org)
The latter code just pulls out the record array that you want to rename, gives it the new name, and then rebuilds the original array such that the renamed recordarray is in the right location
Date: 2021-06-18 15:19:29 From: Lukas (@lukasheinrich)
hm - does this just cchange the give a name of the record arary type/
Date: 2021-06-18 15:19:44 From: Lukas (@lukasheinrich)
does this acctually change the field names of tthe recordarray?
Date: 2021-06-18 15:19:57 From: Lukas (@lukasheinrich)
i.e imagine I have this
Date: 2021-06-18 15:27:01 From: Lukas (@lukasheinrich)
a = ak.zip({'some_clunky_name': ak.Array([[1,2,3],[],[4,5]]), 'ok_name': ak.Array([[1,2,3],[],[4,5]])})
<ListOffsetArray64>
<offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x7fe04712a600"/></offsets>
<content><RecordArray length="5">
<field index="0" key="some_clunky_name">
<NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x7fe045af1800"/>
</field>
<field index="1" key="ok_name">
<NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x7fe04710fe00"/>
</field>
</RecordArray></content>
</ListOffsetArray64>
Date: 2021-06-18 15:27:18 From: Lukas (@lukasheinrich)
and I want to change some_clunky_name -> to something better
Date: 2021-06-18 15:27:36 From: Lukas (@lukasheinrich)
(but in reality it will be deeply nested.. iis there a easy way to change the field?)
Date: 2021-06-18 15:31:02 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org and @lukasheinrich ak.with_name changes the name of the record (e.g. rename "Electron" as "Muon"). It does not change the names of the fields.
Date: 2021-06-18 15:32:12 From: Jim Pivarski (@jpivarski)
To change the field names, re-zipping it is probably the best option.
Date: 2021-06-18 15:32:34 From: Jim Pivarski (@jpivarski)
ak.unzip and ak.fields are your friends, here.
Date: 2021-06-18 15:34:07 From: Jim Pivarski (@jpivarski)
>>> a = ak.zip({'some_clunky_name': ak.Array([[1,2,3],[],[4,5]])})
>>> ak.fields(a)
['some_clunky_name']
>>> ak.unzip(a)
(<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>,)
>>> b = ak.zip(dict(zip(["better_name"], ak.unzip(a))))
>>> b
<Array [[{better_name: 1, ... better_name: 5}]] type='3 * var * {"better_name": ...'>
>>> b.layout
<ListOffsetArray64>
<offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x56208dee9320"/></offsets>
<content><RecordArray length="5">
<field index="0" key="better_name">
<NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x56208df4a170"/>
</field>
</RecordArray></content>
</ListOffsetArray64>Date: 2021-06-18 16:35:20 From: Angus Hollands (@agoose77:matrix.org)
@lukasheinrich @jpivarski sorry, another example of not reading the question. Running a bit low on sleep the last week!
Date: 2021-06-19 11:53:04 From: Andrew Naylor (@asnaylor)
I'm trying to run a numba function over awkward-arrays on dask. The function works fine outside of dask but dask complains of a Typing Error when trying to run the function with dask.delayed:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type pyobject
During: typing of argument at <ipython-input-6-0f4baa26940d> (75)
File "<ipython-input-6-0f4baa26940d>", line 75:
<source missing, REPL/exec in use?>
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class 'awkward.highlevel.Array'>
- argument 1: cannot determine Numba type of <class 'awkward.highlevel.Array'>
I'd read online that sometimes with Numba you need to explicitly define the signature for the input and outputs. How do you do that for awkward arrays?
Date: 2021-06-19 15:05:29 From: Jim Pivarski (@jpivarski)
Not in this case (nb.vectorize is the only one that sometimes needs it, depending on what you're doing). In this case, it's because the remote workers don't have the Awkward-Numba definitions. That's an installation/configuration thing, but you can try to force it with this line:
Date: 2021-06-19 15:05:32 From: Jim Pivarski (@jpivarski)
Date: 2021-06-19 15:05:48 From: Jim Pivarski (@jpivarski)
ak.numba.register()
Date: 2021-06-19 15:06:54 From: Jim Pivarski (@jpivarski)
If it's installed on the remote Dask workers, but the "entry point" isn't set correctly for some reason, this will fix it. If it's just not installed on the remote Dask workers, then this will give a more useful error message.
Date: 2021-06-19 15:34:22 From: Andrew Naylor (@asnaylor)
Ah, thank you for such a quick response @jpivarski that line of magic fixed it, the function works on dask now
Date: 2021-06-22 17:11:14 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I am going AFK now, so if you decide that you are happy to merge https://github.com/scikit-hep/awkward-1.0/pull/935 today then feel free! If you want to make any changes, also feel free! I used the test-suite for most commits locally during development, and I made changes very slowly, so I'm relatively hopeful we haven't introduced (many) bugs! Given that it passes the test suite, I feel as though we can merge and then anything new can be fixed relatively quickly.
Date: 2021-06-22 17:22:40 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org It looks good and I've enabled auto-merge. Congrats on finishing it; it looked like a big project!
Date: 2021-06-22 21:43:00 From: Angus Hollands (@agoose77:matrix.org)
Thanks Jim, appreciate the reviews
Date: 2021-06-22 21:43:21 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: how do you feel about possibly making fsspec a dependency for Awkward?
Date: 2021-06-22 21:46:38 From: Angus Hollands (@agoose77:matrix.org)
I think API wise, it would be nice if we don't require the user to pass in a FS and can make use of the fsspec URIs
Date: 2021-06-22 21:48:47 From: Angus Hollands (@agoose77:matrix.org)
i.e. s3://...
Date: 2021-06-22 21:49:07 From: Angus Hollands (@agoose77:matrix.org)
However, I am struggling to think of a way to support this and maintain backwards compatability.
Date: 2021-06-22 21:54:57 From: Angus Hollands (@agoose77:matrix.org)
In particular, I'm not sure what path-like operations we can implement, given things like https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining
Date: 2021-06-22 22:21:10 From: Jim Pivarski (@jpivarski)
No, it shouldn't be a dependency.
For one thing, Awkward is a foundational library that should have as few dependencies as possible (because dependencies end up being a problem for someone, somewhere, no matter how good package managers are).
For another, the library isn't primarily focused on I/O, for which a dependency on fsspec (another foundational library) would make sense. ak.from_parquet is an I/O function, but nearly all of the actual I/O part of it is done by pyarrow: we just rearrange the Arrow Arrays into Awkward Arrays. And even then, Arrow isn't a dependency, either.
Date: 2021-06-23 08:05:58 From: Angus Hollands (@agoose77:matrix.org)
I was thinking of more a soft dependency i.e. required for from_parquet, but having taking a look again, I see that the url chaining feature (as one example) is a non-fs feature, so it's not really applicable to pyarrow.
Date: 2021-06-23 16:26:00 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski do you have any spare cycles to look at an arrow / awkward bug with me? I'm trying to fix it, but I don't know enough about arrow to know how to go about solving it.
Date: 2021-06-23 16:44:23 From: Jim Pivarski (@jpivarski)
I have 15 minutes before a meeting.
Date: 2021-06-23 16:44:27 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org
Date: 2021-06-23 16:44:38 From: Angus Hollands (@agoose77:matrix.org)
Thanks
Date: 2021-06-23 16:45:19 From: Angus Hollands (@agoose77:matrix.org)
It's this issue @jpivarski https://github.com/scikit-hep/awkward-1.0/issues/932
Date: 2021-06-23 16:45:48 From: Angus Hollands (@agoose77:matrix.org)
The bug is actually in reading the generated file, e.g. with
g = pq.ParquetFile(f)
print(
g.read_row_group(0, [".list.item.addr.cobo"])
)for the given example
Date: 2021-06-23 16:46:24 From: Angus Hollands (@agoose77:matrix.org)
So, it would seem that it's an Arrow bug, but I'm not sure whether Arrow supports this kind of schema, and/or if we're just not writing it correctly
Date: 2021-06-23 16:47:38 From: Jim Pivarski (@jpivarski)
If it works without lazy and fails with lazy, it's our problem.
Date: 2021-06-23 16:48:15 From: Angus Hollands (@agoose77:matrix.org)
Yes, that's what confuses me - Arrow can read the file in one go, just not by specifying the column
Date: 2021-06-23 16:48:17 From: Jim Pivarski (@jpivarski)
I'm going to start by reproducing it in a real file, not BytesIO, since I don't think we've ever tested that before.
Date: 2021-06-23 16:48:33 From: Angus Hollands (@agoose77:matrix.org)
For the record I discovered it on a real file
Date: 2021-06-23 16:48:46 From: Angus Hollands (@agoose77:matrix.org)
It makes me think that I don't understand Arrow well enough, and that perhaps the final column isn't a column at all
Date: 2021-06-23 16:50:09 From: Jim Pivarski (@jpivarski)
Yes, this is a problem with laziness. Even though it's an ArrowInvalid error, we're somehow asking for the wrong thing in the ak.from_parquet implementation.
Date: 2021-06-23 16:51:22 From: Angus Hollands (@agoose77:matrix.org)
Hmm, OK, then I probably don't understand Arrow / our serialisation well enough ;)
Date: 2021-06-23 16:53:53 From: Jim Pivarski (@jpivarski)
One very complicated thing about lazy-reading Parquet is that Parquet doesn't have the equivalent of Awkward/Arrow's tree structure: instead of, say, a ListArray node with offsets/starts/stops in one array and generic content below it, Parquet has only the leaves of this tree with definition and repetition levels to define the list structures above it. When we eagerly read from Parquet, we just ask Arrow to take care of the full conversion and then do the easy job of converting from Arrow to Awkward. When we lazily read from Parquet, we sometimes need to get a ListArray/ListOffsetArray node by itself, without reading every field of a RecordArray nested within it. To make the ListArray/ListOffsetArray, we need offsets/starts/stops, so we need to ask Parquet to read just one field—the information is not available in any other way. So that's why the code path is different for lazy versus eager.
Date: 2021-06-23 16:54:49 From: Angus Hollands (@agoose77:matrix.org)
Sure
Date: 2021-06-23 16:55:12 From: Angus Hollands (@agoose77:matrix.org)
I did wonder if we should make the eager implementation use the same codde-path, but I imagine it will be slow
Date: 2021-06-23 16:55:30 From: Jim Pivarski (@jpivarski)
I'm trying to find a more minimal reproducer of your example, if possible.
Date: 2021-06-23 16:55:58 From: Angus Hollands (@agoose77:matrix.org)
I did try to myself, but removing anything makes it work 🤣
Date: 2021-06-23 16:56:06 From: Jim Pivarski (@jpivarski)
For the reason I described above, the lazy code path can't be the same as the eager code path.
Date: 2021-06-23 16:56:36 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: no, I mean removing the eager code path entirely, and having it effectively materialize a lazy array
Date: 2021-06-23 16:59:38 From: Jim Pivarski (@jpivarski)
That can be made to avoid any double-reading (by passing a no-eviction "cache" to the lazy code path), but it would probably involve a lot more requests to the filesystem—one per column, rather than one for all the columns you know you need in one batch—and thus be slower. It would certainly make the eager case unnecessarily complicated, though I agree that outside users wouldn't see that complication, and the complication has to exist for the lazy case, anyway.
Date: 2021-06-23 17:00:25 From: Angus Hollands (@agoose77:matrix.org)
Sometimes programming is hard,and I suppose this is just one of those cases
Date: 2021-06-23 17:00:45 From: Angus Hollands (@agoose77:matrix.org)
I think we can add some more cases to the test suite to catch these bugs though. I'll do that when I'm next free.
Date: 2021-06-23 17:01:19 From: Jim Pivarski (@jpivarski)
Yes, your example is minimal, and it doesn't depend on the alphabetical order of the field names (which is used to pick the column to read for the sake of making ListArray/ListOffsetArray and was key to figuring out a previous bug).
Date: 2021-06-23 17:03:33 From: Jim Pivarski (@jpivarski)
According to the stack trace, this last leaves Awkward on line 3497 of src/awkward/operations/convert.py (main branch). If you're lucky, the lazy and eager code paths might both go through this, in which case you'd be able to put a print-out there to say what the column name is and (in the case of the eager one, which is successful), what the Arrow schema of the read data is. To get it to go through the same code path, you might have to explicitly supply a columns argument.
Date: 2021-06-23 17:06:54 From: Angus Hollands (@agoose77:matrix.org)
Thanks, it appears that it's the actual read_row_group call on the Parquet file itself that fails
Date: 2021-06-23 17:07:36 From: Angus Hollands (@agoose77:matrix.org)
: large_list<item: struct<addr: struct<aget: int64 not null, cobo: int64 not null> not null> not null> not null
child 0, item: struct<addr: struct<aget: int64 not null, cobo: int64 not null> not null> not null
child 0, addr: struct<aget: int64 not null, cobo: int64 not null> not null
child 0, aget: int64 not null
child 1, cobo: int64 not null
the path it takes is ".list.item.addr.cobo", which fails, whereas ".list.item.addr" succeeds
Date: 2021-06-23 17:08:36 From: Angus Hollands (@agoose77:matrix.org)
The error is odd - it seems to think that the column is a struct, and then compares it to the addr struct
Date: 2021-06-23 17:08:51 From: Angus Hollands (@agoose77:matrix.org)
so I wonder if I misunderstand what it's supposed to do
Struct child array #0 does not match type field: struct<cobo: int64 not null> vs struct<aget: int64 not null, cobo: int64 not null>
Date: 2021-06-23 17:09:04 From: Jim Pivarski (@jpivarski)
Can you list all the columns in the file? Maybe use the Parquet schema or the Arrow schema (from the pyarrow ParquetFile object).
Date: 2021-06-23 17:10:15 From: Jim Pivarski (@jpivarski)
We're probably generating column names incorrectly. Actually, I don't see how ".list.item.addr" can be a column in the Parquet file because it's not a leaf node.
Date: 2021-06-23 17:12:46 From: Angus Hollands (@agoose77:matrix.org)
Yeah I think that might be it - it doesn't fail when you only have one field in awkward terms, which would follow if our name is pulling out the struct rather than the leaf because it would be equal to itself
Date: 2021-06-23 17:13:50 From: Angus Hollands (@agoose77:matrix.org)
I'm struggling to find information on how to assemble this kind of table by hand
Date: 2021-06-23 17:19:30 From: Jim Pivarski (@jpivarski)
In your new function, https://github.com/scikit-hep/awkward-1.0/blob/d9b8082314911041acc7b1c1b63313d66c2d3315/src/awkward/operations/convert.py#L3465-L3487, there's an example of getting
file = pyarrow.parquet.ParquetFile(filename)
print(file.schema)
print(file.schema_arrow)to see what the columns are to find out what's wrong with our derivation of the column name.
Date: 2021-06-23 17:21:58 From: Angus Hollands (@agoose77:matrix.org)
Yeah I'm looking at it now although not making much headway :P
Date: 2021-06-23 17:22:47 From: Jim Pivarski (@jpivarski)
What columns are in the Parquet schema? (I'm in my meeting, but splitting my attention, as usual.)
Date: 2021-06-23 17:24:10 From: Angus Hollands (@agoose77:matrix.org)
<pyarrow._parquet.ParquetSchema object at 0x7f030bb87980>
required group field_id=0 schema {
required group field_id=1 (List) {
repeated group field_id=2 list {
required group field_id=3 item {
required group field_id=4 addr {
required int64 field_id=5 aget;
}
}
}
}
}
Date: 2021-06-23 17:24:35 From: Jim Pivarski (@jpivarski)
What does the Arrow schema look like?
Date: 2021-06-23 17:24:37 From: Angus Hollands (@agoose77:matrix.org)
and f.schema.names has ["aget"]
Date: 2021-06-23 17:24:52 From: Jim Pivarski (@jpivarski)
(I forgot that the ParquetSchema doesn't print out full column names.)
Date: 2021-06-23 17:25:25 From: Jim Pivarski (@jpivarski)
Or does the pyarrow.parquet.ParquetFile have some way to dump full column names?
Date: 2021-06-23 17:25:33 From: Angus Hollands (@agoose77:matrix.org)
: large_list<item: struct<addr: struct<aget: int64 not null> not null> not null> not null
child 0, item: struct<addr: struct<aget: int64 not null> not null> not null
child 0, addr: struct<aget: int64 not null> not null
child 0, aget: int64 not null
-- field metadata --
PARQUET:field_id: '5'
-- field metadata --
PARQUET:field_id: '4'
-- field metadata --
PARQUET:field_id: '3'
-- field metadata --
PARQUET:field_id: '1'
Date: 2021-06-23 17:25:45 From: Angus Hollands (@agoose77:matrix.org)
I don't think so. I'm find the whole thing a bit of a black box to be honest
Date: 2021-06-23 17:26:18 From: Angus Hollands (@agoose77:matrix.org)
The empty : is just because we use "" for a column name. I've modified that in the source to use "root" instead and the bug doesn't disappear
Date: 2021-06-23 17:26:38 From: Jim Pivarski (@jpivarski)
Part of the information is there. The "list", "item", "addr", and "aget" names are built into a column name by putting dots between them. I was just hoping that it would write them out explicitly.
Date: 2021-06-23 17:27:05 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: yes that's my view - it looks like we're doing things right
Date: 2021-06-23 17:31:55 From: Jim Pivarski (@jpivarski)
f = pyarrow.parquet.ParquetFile("some.parquet")
f.read_row_group(0, [".list.item.addr.aget"])
reproduces the issue outside of Awkward's codebase. If you put a non-existent column name there, nothing happens, so the error message means that the column name is right.
Date: 2021-06-23 17:32:40 From: Angus Hollands (@agoose77:matrix.org)
Yes, this is what I'm finding too. I will perhaps open an issue on Arrow
Date: 2021-06-23 17:33:49 From: Jim Pivarski (@jpivarski)
Be sure that you can generate the file without Awkward, so that it's clear that it's an Arrow thing.
Date: 2021-06-23 17:35:02 From: Jim Pivarski (@jpivarski)
You can use ak.to_arrow_table to see the Arrow table and find a way to make it using only pyarrow. If the same bug occurs, then it would definitely be an Arrow bug.
Date: 2021-06-23 17:35:07 From: Angus Hollands (@agoose77:matrix.org)
Yes, I can load it from a pydict
Date: 2021-06-23 17:36:47 From: Jim Pivarski (@jpivarski)
It also works if the column name for the pyarrow.Table is not ""; I just tried it with "outer", and still get the error message.
Date: 2021-06-23 17:43:28 From: Angus Hollands (@agoose77:matrix.org)
Yes, I find that too :/
Date: 2021-06-23 17:43:59 From: Jim Pivarski (@jpivarski)
Did you manage to make it from a pydict and reproduce it entirely without Awkward?
Date: 2021-06-23 17:45:53 From: Angus Hollands (@agoose77:matrix.org)
Yes
Date: 2021-06-23 17:48:36 From: Jim Pivarski (@jpivarski)
Okay, then it's not our problem anymore. When it's fixed in Arrow, we can increase our minimum version of pyarrow.
Do you know where Arrow's JIRA is and how to submit a ticket? It's very different from GitHub.
Date: 2021-06-23 17:48:45 From: Angus Hollands (@agoose77:matrix.org)
Haha yes, I suffered JIRA before :P
Date: 2021-06-23 17:48:48 From: Angus Hollands (@agoose77:matrix.org)
I'm filing an issue now
Date: 2021-06-23 17:48:56 From: Jim Pivarski (@jpivarski)
:)
Date: 2021-06-25 11:59:14 From: Angus Hollands (@agoose77:matrix.org)
It appears that this is a bug (from JIRA)
Date: 2021-06-25 12:00:54 From: Angus Hollands (@agoose77:matrix.org)
I think I now understand why it doesn't work - if Arrow stores data in columns, then it doesn't make a lot of sense storage-wise to try and read only one field. I think we can modify Awkward to do the right thing and load the column + read the field, rather than waiting on Arrow to update (and introducing a high lower-bound on the arrow version)
Date: 2021-06-25 15:02:37 From: Jim Pivarski (@jpivarski)
I just commented on the issue to emphasize the importance of being able to read individual struct columns. The whole Dremel/Parquet paradigm was designed to have this columnar granularity at all levels; it would be a shame to drop that because of this metadata-manipulation issue in Arrow. (If they don't deal with it, we might need to go to fastparquet to work around it!)
I've never understood why Arrow makes a distinction between Tables, which are structs at top-level, and all other nested structs. It probably comes from a Pandas point of view, in which columns are distinct from whatever kinds of Python objects are nested within. But it's an artificial distinction: RecordArrays can be totally self-similar, nested within each other (as the Dremel/Parquet model shows, which predated Arrow).
When (I strongly hope it's a "when," not an "if") they fix this, we'll definitely take that version of Arrow as a minimum version. This is really important.
Date: 2021-06-25 15:31:52 From: Angus Hollands (@agoose77:matrix.org)
I can't put my finger on it, but I don't quite get pyarrow. It doesn't feel very intuitive, but that might just be the docs / bindings
Date: 2021-06-25 15:35:05 From: Angus Hollands (@agoose77:matrix.org)
Just about to drive, but will give your comment a read when home
Date: 2021-06-25 16:49:57 From: Jim Pivarski (@jpivarski)
I don't think it's intended for high-level users; I think it's middleware. Most of what's written about it describe it as supporting interoperability between things like Pandas, R, and Spark, and about SQL-like applications built on top of it (e.g. Gandiva). Thus, the DataFrames and SQL are the user-interfaces—I think pyarrow is about gluing in more user-interfaces through Python, which is exactly what Awkward Array is doing.
I hope I'm not mischaracterizing Arrow, but this is my impression from what I've read.
Date: 2021-06-28 16:06:23 From: Andrew Naylor (@asnaylor)
saw this with dask + uproot/awkward today:
File "flange.py", line 195, in fill_histograms
arrays = uproot_tree.arrays(variables)
File "/global/homes/a/asnaylor/.conda/envs/uproot-py38/lib/python3.8/site-packages/uproot/behaviors/TBranch.py", line 1119, in arrays
_ranges_or_baskets_to_arrays(
File "/global/homes/a/asnaylor/.conda/envs/uproot-py38/lib/python3.8/site-packages/uproot/behaviors/TBranch.py", line 3478, in _ranges_or_baskets_to
_arrays
uproot.source.futures.delayed_raise(*obj)
File "/global/homes/a/asnaylor/.conda/envs/uproot-py38/lib/python3.8/site-packages/uproot/source/futures.py", line 46, in delayed_raise
raise exception_value.with_traceback(traceback)
File "/global/homes/a/asnaylor/.conda/envs/uproot-py38/lib/python3.8/site-packages/uproot/behaviors/TBranch.py", line 3422, in basket_to_array
basket_arrays[basket.basket_num] = interpretation.basket_array(
File "/global/homes/a/asnaylor/.conda/envs/uproot-py38/lib/python3.8/site-packages/uproot/interpretation/objects.py", line 143, in basket_array
if awkward_can_optimize(self, form):
File "/global/homes/a/asnaylor/.conda/envs/uproot-py38/lib/python3.8/site-packages/uproot/interpretation/objects.py", line 43, in awkward_can_optimi
ze
return awkward._connect._uproot.can_optimize(interpretation, form)
AttributeError: module 'awkward._connect' has no attribute '_uproot'
I've not had an issue before using uproot/awkward and dask. Test the code with a few files and it worked, moved to testing it with 1000+ and got this error message. It suggests an issue with loading the branches, i think it could be to do with loading two variables (vector<vector<int>> & vector<vector<float>>) which take a long time to load in
Date: 2021-06-28 16:09:45 From: Lukas (@lukasheinrich)
I have a question on the commutativity of field-access and index-access
Date: 2021-06-28 16:09:47 From: Jim Pivarski (@jpivarski)
@asnaylor That is experimental code that will be replaced. The reason it only applies to two types is because those were the experimental cases (hard-coded in C++ that will be replaced with dynamically generated AwkwardForth).
However, its status as experimental code is a separate thing from not being able to find the awkward._connect._uproot module. Could it be a very old version of Awkwrad Array?
Date: 2021-06-28 16:10:01 From: Lukas (@lukasheinrich)
(oh sorry I didn't want tto inject)
Date: 2021-06-28 16:11:30 From: Andrew Naylor (@asnaylor)
awkward 1.2.2
Date: 2021-06-28 16:11:42 From: Andrew Naylor (@asnaylor)
uproot 4.0.7
Date: 2021-06-28 16:11:54 From: Jim Pivarski (@jpivarski)
@lukasheinrich Field access and index access is commutative up to the depth of the record. I can make an example that illustrates that exception to full commutivity, unless your question isn't about that.
Date: 2021-06-28 16:12:17 From: Lukas (@lukasheinrich)
this might be related to some nonn-ffunctionality I observe in trying to do columnar access uisng Arrow data (from cloud resources)
Date: 2021-06-28 16:13:10 From: Lukas (@lukasheinrich)
what I see is
In [14]: a.TrigMatchedObjects[0].HLT_mu26_ivarmedium[0]
Out[14]: <Array [... m_persIndex: 0}] type='1 * {"m_persKey": int32, "m_persIndex": int32}'>
Date: 2021-06-28 16:13:25 From: Lukas (@lukasheinrich)
while
Date: 2021-06-28 16:13:28 From: Lukas (@lukasheinrich)
In [16]: a[0].TrigMatchedObjects[0].HLT_mu26_ivarmedium
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-9d4ee0403026> in <module>
----> 1 a[0].TrigMatchedObjects[0].HLT_mu26_ivarmedium
~/Code/atlas100TB/parquet/venv/lib/python3.7/site-packages/awkward/highlevel.py in __getitem__(self, where)
1753 2
1754 """
-> 1755 return ak._util.wrap(self.layout[where], self._behavior)
1756
1757 def __setitem__(self, where, what):
ValueError: scalar Record can only be sliced by field name (string); try "0"
Date: 2021-06-28 16:13:35 From: Andrew Naylor (@asnaylor)
@jpivarski i don't think it's a very old version
Date: 2021-06-28 16:13:38 From: Lukas (@lukasheinrich)
so I guesss I hit such a ccase
Date: 2021-06-28 16:13:52 From: Jim Pivarski (@jpivarski)
This module was added 10 months ago (just checked).
Date: 2021-06-28 16:14:28 From: Jim Pivarski (@jpivarski)
Awkward 1.2.2 is April 12 of this year.
Date: 2021-06-28 16:14:56 From: Jim Pivarski (@jpivarski)
If you only saw this with Dask, maybe the versions on the remote workers differs from the versions on the head node?
Date: 2021-06-28 16:16:03 From: Lukas (@lukasheinrich)
the structure List[Record[Record[List[List[Record]]]]
Date: 2021-06-28 16:16:04 From: Jim Pivarski (@jpivarski)
There really needs to be a module named ak._connect._uproot if Awkward Array is newer than 10 months ago. This hasn't been touched since it was first added.
Date: 2021-06-28 16:16:13 From: Andrew Naylor (@asnaylor)
could be but i'm using the same kernel. I'm using mpirun to generate the dask-mpi workers
Date: 2021-06-28 16:17:54 From: Jim Pivarski (@jpivarski)
For a structure like List[Record[Record[List[List]]], field access of the first record can either be first or second. It can commute anywhere to the left of its natural position, but not to the right of it.
Date: 2021-06-28 16:18:11 From: Jim Pivarski (@jpivarski)
That isn't some kind of technical limitation; logically it has to be this way.
Date: 2021-06-28 16:18:52 From: Jim Pivarski (@jpivarski)
You should be able to look at this module in your script. If it's not there, it's some sort of installation problem.
Date: 2021-06-28 16:20:27 From: Andrew Naylor (@asnaylor)
(uproot-py38) user@cori:~/> python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward as ak
>>> ak.__version__
'1.2.2'
>>> ak._connect
ak._connect
>>> ak._connect._uproot
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'awkward._connect' has no attribute '_uproot'
>>>
Date: 2021-06-28 16:22:08 From: Jim Pivarski (@jpivarski)
What if you import awkward._connect._uproot
Date: 2021-06-28 16:22:09 From: Jim Pivarski (@jpivarski)
?
Date: 2021-06-28 16:23:44 From: Jim Pivarski (@jpivarski)
Uproot does the explicit import: https://github.com/scikit-hep/uproot4/blob/fb978c505eca58347a740122c5cfdf27fd09353a/src/uproot/interpretation/objects.py#L29-L43
Even if a new Uproot was paired with an old Awkward Array, it would still work because awkward_can_optimize would return False.
Date: 2021-06-28 16:23:46 From: Andrew Naylor (@asnaylor)
When i'm in interactive i'm not seeing it. I tried it on another machine with the latest uproot and awkward and still no dice :
$ python
Python 3.8.2 (default, Mar 9 2020, 16:03:22)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward as ak
ak.__version__
ak.co>>> ak.__version__
'1.3.0'
>>> ak._connect._
ak._connect._autograd ak._connect._jax ak._connect._numba ak._connect._numexpr ak._connect._numpy
>>> ak._connect._uproot
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'awkward._connect' has no attribute '_uproot'
Date: 2021-06-28 16:24:16 From: Andrew Naylor (@asnaylor)
I get:
>>> import awkward._connect._uproot
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/anaylor/temp/tmp/looking-at-lz-data-python-session/LZ_Python_Kernel/lib/python3.8/site-packages/awkward/_connect/_uproot.py", line 8, in <module>
import uproot4
ModuleNotFoundError: No module named 'uproot4'
>>>
Date: 2021-06-28 16:25:28 From: Jim Pivarski (@jpivarski)
This is the biggest clue, the fact that something is referencing uproot4. I'm looking into it.
Date: 2021-06-28 16:26:26 From: Andrew Naylor (@asnaylor)
Thanks
Date: 2021-06-28 16:26:51 From: Lukas (@lukasheinrich)
i'm tryring to paste an image in a thead, but it seems Gitter doesn't support this
Date: 2021-06-28 16:26:56 From: Lukas (@lukasheinrich)
In [31]: a["TrigMatchedObjects"]["HLT_mu26_ivarmedium"]["m_persKey"]
Out[31]: <Array [[[980095599]], ... [[980095599]]] type='2000 * var * var * int32'>
In [32]: a["TrigMatchedObjects"]["HLT_mu26_ivarmedium"]["m_persKey"][0][0][0]
Out[32]: 980095599
In [33]: a["TrigMatchedObjects"]["HLT_mu26_ivarmedium"][0]["m_persKey"][0][0]
Out[33]: 980095599
In [34]: a["TrigMatchedObjects"]["HLT_mu26_ivarmedium"][0][0]["m_persKey"][0]
Out[34]: 980095599
In [35]: a["TrigMatchedObjects"]["HLT_mu26_ivarmedium"][0][0][0]["m_persKey"]
Out[35]: 980095599
In [36]: a["TrigMatchedObjects"][0]["HLT_mu26_ivarmedium"][0][0]["m_persKey"]
Out[36]: 980095599
In [37]: a[0]["TrigMatchedObjects"][0]["HLT_mu26_ivarmedium"][0]["m_persKey"]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Date: 2021-06-28 16:27:06 From: Jim Pivarski (@jpivarski)
I don't know why that's only affecting your case, but I'll modernize that right away.
Date: 2021-06-28 16:27:19 From: Lukas (@lukasheinrich)
ok thanks, that makes it clear
Date: 2021-06-28 16:28:05 From: Andrew Naylor (@asnaylor)
Yeah that's very strange
Date: 2021-06-28 16:30:06 From: Jim Pivarski (@jpivarski)
Still, though, even if this was always failing with a ModuleNotFoundError, awkward_can_optimize would return False. It would not attempt to use the module if it can't import it.
Date: 2021-06-28 16:30:26 From: Jim Pivarski (@jpivarski)
I don't understand what's happening in your case, but I am going to get rid of the references to uproot4.
Date: 2021-06-28 16:31:36 From: Andrew Naylor (@asnaylor)
Also can't find uproot4 reference when doing github advanced search https://github.com/search?q=uproot4+repo%3Ascikit-hep%2Fawkward-1.0&type=code it should search through the code to find all references
Date: 2021-06-28 16:33:57 From: Jim Pivarski (@jpivarski)
I know. GitHub didn't find it for me, either.
Date: 2021-06-28 16:34:46 From: Jim Pivarski (@jpivarski)
I accidentally committed the fix to Awkward's main branch—I meant for it to be a PR, but too many things are going on at once. The fix will go into the 1.4.0 release that we're working on now.
Date: 2021-06-28 16:35:29 From: Andrew Naylor (@asnaylor)
Thanks, i'll see if doing a pip install uproot4 will fix it for now and then update to the latest 1.4.0 when released
Date: 2021-06-28 16:36:43 From: Jim Pivarski (@jpivarski)
Great. It's still not understood, though: any failed attempt to import awkward._connect._uproot should have resulted in awkward_can_optimize returning False. You should never have seen this.
Date: 2021-06-28 16:59:58 From: Andrew Naylor (@asnaylor)
What's strange is that my virtualenv could do awkward._connect._uproot once i did pip install uproot4 but my conda environment was still complaining it didn't exist
Date: 2021-06-28 17:03:17 From: Jim Pivarski (@jpivarski)
You could take my slip-up to be an advantage: installing Awkward from the main of GitHub would contain the proposed fix. See if it helps. (It would be versioned as 1.4.0rc1, though technically it would be rc2.)
Thankfully, the tests on main are passing.
Date: 2021-06-30 15:38:22 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: what is the best way to handle the case where layout.axis_wrap_if_negative raises a ValueError because the layout is not deep enough? I want to pass-through in such a case; is it better to catch the error and move on, or is purelist_depth somehow useful (EAFP vs LBYL)
Date: 2021-06-30 15:46:34 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org Look Before you Leap in this case. (Also, that's generally what the Awkward codebase does; I think try-catch blocks are syntactically clunky.) This function can raise exceptions for two cases:
(std::invalid_argument is ValueError) and using an if-statement guard with purelist_depth (or minmax_depth, what this function does) would make it more obvious what the code is guarding against. The normal flow of this is to keep calling axis_wrap_if_negative at each level of depth, and when it reaches zero-depth without the axis being equivalent to the depth, then that's the user's error (axis corresponds to a dimension that doesn't exist).
Date: 2021-06-30 15:47:04 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: thanks. This relates to _packed - it doesn't handle negative indices very well right now
Date: 2021-06-30 15:47:49 From: Angus Hollands (@agoose77:matrix.org)
I'm not actually sure what the best solution is; there isn't a nice one-pass way to only pack the expected layouts in the presence of unions.
Date: 2021-06-30 15:48:33 From: Angus Hollands (@agoose77:matrix.org)
I'm thinking that, if one wants to support negative axes (which I think we do, because we use it in unpack to minimise overhead), one needs to do two passes, first to identify the layouts and second to actually perform the simplification
Date: 2021-06-30 15:49:22 From: Jim Pivarski (@jpivarski)
Unions might give you strange answers for purelist_depth (I don't remember exactly what it does, but it might return -1 for ambiguous cases). The minmax_depth always gives you the minimum depth (for all possible branches) and the maximum depth (for all possible branches) at this point.
Date: 2021-06-30 15:49:39 From: Angus Hollands (@agoose77:matrix.org)
Ah, that's perfect!
Date: 2021-06-30 15:49:58 From: Jim Pivarski (@jpivarski)
I'm not worried about the time spent packing and unpacking, but it would likely be simpler and easier to read if it's one pass.
Date: 2021-06-30 15:50:36 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I'm thinking more about the perf of copying everything below unflatten when we don't absolutely need to
Date: 2021-06-30 15:51:11 From: Angus Hollands (@agoose77:matrix.org)
minmax should give the ability to gate on the depth, which is fab
Date: 2021-06-30 15:53:25 From: Jim Pivarski (@jpivarski)
See the implementation of axis_wrap_if_negative above: it uses minmax_depth, so you can carve out exactly the same case that it raises exceptions on.
Date: 2021-06-30 15:58:27 From: Angus Hollands (@agoose77:matrix.org)
Thanks a bunch, that should do it.
Date: 2021-07-01 15:18:18 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org I was going to add an ak.unflatten example to next week's PyHEP tutorial, but it seems to be broken—I think just for axis=0. Are there any tests for axis=0 and does this look like an easy fix?
>>> ak.unflatten(array, lumilengths, axis=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-96-d302af0752f5> in <module>
----> 1 ak.unflatten(array, lumilengths, axis=0)
~/miniconda3/lib/python3.8/site-packages/awkward/operations/structure.py in unflatten(array, counts, axis, highlevel, behavior)
1962 # the `counts` array, which is computed through these layouts, aligns with
1963 # the layout to be unflattened (#910)
-> 1964 layout = _packed(layout, axis=axis - 1, highlevel=False)
1965
1966 if isinstance(counts, (numbers.Integral, np.integer)):
~/miniconda3/lib/python3.8/site-packages/awkward/operations/structure.py in _packed(array, axis, highlevel, behavior)
2298 )
2299
-> 2300 out = apply(layout, 1, axis)
2301
2302 return ak._util.maybe_wrap_like(out, array, behavior, highlevel)
~/miniconda3/lib/python3.8/site-packages/awkward/operations/structure.py in apply(layout, depth, posaxis)
2118 if posaxis is not None:
2119 # If a particular axis was given
-> 2120 posaxis = layout.axis_wrap_if_negative(posaxis)
2121 # Do not proceed past that axis
2122 if posaxis < depth - 1:
ValueError: axis == -1 exceeds the min depth == 1 of this array
(https://github.com/scikit-hep/awkward-1.0/blob/1.4.0rc3/src/libawkward/Content.cpp#L1722)
If this isn't an easy fix, let's not rush it. I'll just not do an ak.unflatten example. This example was to take an array in which entries are events to one in which entries are luminosity blocks, the next level up under runs. I wanted to show how easy that is (usually!).
Date: 2021-07-01 15:18:55 From: Jim Pivarski (@jpivarski)
If needed, I can provide the 25 MB array (as a Parquet file).
Date: 2021-07-01 15:19:02 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: that should be an easy fix, try https://github.com/scikit-hep/awkward-1.0/pull/972
Date: 2021-07-01 15:19:24 From: Angus Hollands (@agoose77:matrix.org)
I was tying myself in a knot to shoe-horn in support for something we don't need
Date: 2021-07-01 15:20:34 From: Jim Pivarski (@jpivarski)
Compiling now.
Date: 2021-07-01 15:21:13 From: Jim Pivarski (@jpivarski)
If this works, is #972 ready to merge and include in 1.4.0 (to be released today or tomorrow)?
Date: 2021-07-01 15:21:36 From: Angus Hollands (@agoose77:matrix.org)
#972 is ready to go - I didn't add any new tests, because it only modifies private functions.
Date: 2021-07-01 15:22:09 From: Angus Hollands (@agoose77:matrix.org)
I'm not sure how the tests didn't catch the axis=0 case. Perhaps we don't already have a case for that#
Date: 2021-07-01 15:23:54 From: Jim Pivarski (@jpivarski)
Since it makes changes to ak.unflatten, #972 is tested by existing tests (which apparently didn't have any axis=0, because otherwise main would be broken).
Date: 2021-07-01 15:24:26 From: Jim Pivarski (@jpivarski)
I see that you're introducing a LayoutTransformer as a kind of more general recursively_apply. (recursively_apply still exists, right?)
Date: 2021-07-01 15:24:31 From: Angus Hollands (@agoose77:matrix.org)
Yes, I had assumed we might already have such a test! We can add one too.
Date: 2021-07-01 15:25:10 From: Angus Hollands (@agoose77:matrix.org)
Yes, it does - this just replaces the underlying implementation
Date: 2021-07-01 15:26:24 From: Angus Hollands (@agoose77:matrix.org)
Hold off on merging for a few minutes though
Date: 2021-07-01 15:26:37 From: Angus Hollands (@agoose77:matrix.org)
I want to see if I've missed a really obvious way to do this
Date: 2021-07-01 15:28:18 From: Jim Pivarski (@jpivarski)
I'll wait for the test. Meanwhile, I'm happy with it: it seems to have unflattened my array (and I'm digging into the details and finding no errors).
Date: 2021-07-01 15:28:46 From: Jim Pivarski (@jpivarski)
Wait—is recursively_apply defined in terms of LayoutTransformer now?
Date: 2021-07-01 15:28:51 From: Angus Hollands (@agoose77:matrix.org)
Yes
Date: 2021-07-01 15:29:22 From: Jim Pivarski (@jpivarski)
If so, that's a nice code reuse (and recursively_apply is heavily tested by many other functions, so passing tests is a good demonstration).
Date: 2021-07-01 15:30:25 From: Jim Pivarski (@jpivarski)
I'll be watching #972. Let me know on GitHub when it's done and I'll merge it. It's looking good to me.
Date: 2021-07-01 15:34:00 From: Angus Hollands (@agoose77:matrix.org)
Actually I think this can be done really cleanly with a purely functional interface
Date: 2021-07-01 15:34:31 From: Angus Hollands (@agoose77:matrix.org)
We just need to pull out generic_transform into a free function, and the user transform function will call it like generic_transform(transform, ...)
Date: 2021-07-01 15:34:49 From: Angus Hollands (@agoose77:matrix.org)
I've got to head off for an hour or so, I'll make the change later and update the PR
Date: 2021-07-01 15:35:20 From: Jim Pivarski (@jpivarski)
Don't make too many last-minute changes: there's time pressure for 1.4.0 and I'd like to include this.
Date: 2021-07-01 15:35:52 From: Angus Hollands (@agoose77:matrix.org)
e.g.
def transform(layout, depth, user):
if depth == 1:
return layout.simplify()
else:
return generic_transform(transform, layout, depth, user)Date: 2021-07-01 15:35:59 From: Jim Pivarski (@jpivarski)
Since this works, I'd like to merge it. Since it's internal code, things can be renamed and refactored later without hurting anybody's downstream projects.
Date: 2021-07-01 15:36:06 From: Angus Hollands (@agoose77:matrix.org)
I will re-test the test suite + my own use case to ensure it's valid
Date: 2021-07-01 15:36:22 From: Angus Hollands (@agoose77:matrix.org)
OK, if you need to get this release out, I can rip it out and replace it in master :)
Date: 2021-07-01 15:37:12 From: Jim Pivarski (@jpivarski)
You mean some refactoring you started on locally?
Date: 2021-07-01 15:37:42 From: Angus Hollands (@agoose77:matrix.org)
Sorry, I'm not quite sure what you're asking?
Date: 2021-07-01 15:38:39 From: Jim Pivarski (@jpivarski)
Unless there's something actively wrong with the PR as it is, I'd like to merge it and do another release candidate. that, by itself, will take more than an hour. If you want to do refactoring, like the transform/generic_transform thing above, then that can be a new PR after 1.4.0.
Date: 2021-07-01 15:38:50 From: Angus Hollands (@agoose77:matrix.org)
OK, that sounds fine to me.
Date: 2021-07-01 15:39:07 From: Angus Hollands (@agoose77:matrix.org)
I have another suggestion
Date: 2021-07-01 15:39:22 From: Angus Hollands (@agoose77:matrix.org)
I think your bug can easily be fixed if we pack the entire layout instead of up to the axis
Date: 2021-07-01 15:39:27 From: Jim Pivarski (@jpivarski)
Since it's internal, there's no rush to get the refactoring into 1.4.0. But since ak.unflatten is broken in main, there is a rush to get #972 into 1.4.0.
Date: 2021-07-01 15:40:14 From: Angus Hollands (@agoose77:matrix.org)
i.e. line 1964 in ~/miniconda3/lib/python3.8/site-packages/awkward/operations/structure.py becomes
packed = _packed(layout, axis=None, highlevel=False)
Date: 2021-07-01 15:40:59 From: Angus Hollands (@agoose77:matrix.org)
If that fixes the bug then I can update this PR to improve the internals and make the packing less aggressive
Date: 2021-07-01 15:41:27 From: Angus Hollands (@agoose77:matrix.org)
On second thoughts, the axis parameter is rather unideal, and even though it's internal ... just make the release for now
Date: 2021-07-01 15:41:48 From: Jim Pivarski (@jpivarski)
Packing the entire layout is an alternative to PR #972? (It wouldn't be faster to try an alternative, so we'd only consider an alternative if we had some reason to be wary of #972.)
Date: 2021-07-01 15:42:00 From: Jim Pivarski (@jpivarski)
Is there any reason to be wary of it?
Date: 2021-07-01 15:42:32 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: the only reason we dont pack the entire layout is because I think it's wasteful to implement copies we don't absolutely need ot
Date: 2021-07-01 15:42:39 From: Angus Hollands (@agoose77:matrix.org)
But it should not change behaviour
Date: 2021-07-01 15:43:08 From: Angus Hollands (@agoose77:matrix.org)
It does mean that layouts will differ between awkward versions after calling unflatten, but that's already the case for the version that introduces ak.packed
Date: 2021-07-01 15:43:51 From: Jim Pivarski (@jpivarski)
We don't guarantee stability of layouts/forms from one version to the next, only stability of types.
Date: 2021-07-01 15:44:13 From: Angus Hollands (@agoose77:matrix.org)
That was my guess
Date: 2021-07-01 15:44:34 From: Jim Pivarski (@jpivarski)
So if the new unflatten produces different layouts with equivalent data (same .tolist() and same .type), then that's not considered an interface-breaking change.
Date: 2021-07-01 15:45:21 From: Angus Hollands (@agoose77:matrix.org)
I'll be back at my pc in about 35 minutes and I expect the change to my pr will take all of ten minutes, to keep you informed!
Date: 2021-07-01 15:45:40 From: Jim Pivarski (@jpivarski)
Okay, then I'll wait.
Date: 2021-07-01 16:45:05 From: Angus Hollands (@agoose77:matrix.org)
Nearly done.
Date: 2021-07-01 16:58:11 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I've updated the PR. It adds less code, and is still recursive so it's not functionally very different. I think this approach is slightly cleaner because there's less boilerplate, and it's simple recursion. If you prefer the previous implementation then I can revert the changes.
Date: 2021-07-01 16:58:15 From: Angus Hollands (@agoose77:matrix.org)
https://github.com/scikit-hep/awkward-1.0/pull/972
Date: 2021-07-01 17:16:01 From: Jim Pivarski (@jpivarski)
Okay, I've tested it (still works), and I've scanned through the diff. There are no tests—in fact, one test was removed—and it looks like the main difference is that the LayoutTransformer class was replaced by a pure function. That's fine, but I actually didn't have an opinion either way: functions are not better than classes, but if this class didn't provide any advantage by encapsulating state, then it didn't need to be a class.
Date: 2021-07-01 17:17:23 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: that's pretty much it. I felt that the classes were just adding a level of indirection that didn't make much sense given that most usage would not be as a new subclass.
Date: 2021-07-01 17:17:37 From: Angus Hollands (@agoose77:matrix.org)
The new PR is easier to read and is KISS
Date: 2021-07-01 17:18:25 From: Angus Hollands (@agoose77:matrix.org)
As I'm sure you're aware, I removed the test because it was testing an internal implementation detail, rather than the public API.
Date: 2021-07-01 17:18:34 From: Angus Hollands (@agoose77:matrix.org)
Is there anything I can do to help with the release?
Date: 2021-07-01 17:19:55 From: Jim Pivarski (@jpivarski)
Since it fixes the bug, looks easy to read as it is, is tested by previously existing tests, and has your seal of approval, then I'll take it.
As for the release, this is it. I should be able to just create a tag and get a new deployment, if it hasn't broken since 1.4.0rc3. I'll have to wait for this to pass tests and get merged, first.
Date: 2021-07-01 17:20:16 From: Angus Hollands (@agoose77:matrix.org)
OK, fingers crossed that things go smoothly!
Date: 2021-07-01 17:36:10 From: Jim Pivarski (@jpivarski)
It's started: https://github.com/scikit-hep/awkward-1.0/actions/runs/990980447 (It will take about an hour.)
Date: 2021-07-01 18:07:23 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: could you test something for me? what does ak.flatten(ak.from_numpy(np.zeros((3, 3, 5))), axis=None) give for you?
Date: 2021-07-01 18:13:35 From: Jim Pivarski (@jpivarski)
Darn:
>>> ak.flatten(ak.from_numpy(np.zeros((3, 3, 5)), regulararray=True), axis=None)
<Array [0, 0, 0, 0, 0, 0, ... 0, 0, 0, 0, 0, 0] type='45 * float64'>
>>> ak.flatten(ak.from_numpy(np.zeros((3, 3, 5)), regulararray=False), axis=None)
<Array [[[0, 0, 0, 0, 0], ... [0, 0, 0, 0, 0]]] type='3 * 3 * 5 * float64'>ak.flatten with axis=None is not flattening multidimensional NumpyArrays.
Date: 2021-07-01 18:14:43 From: Jim Pivarski (@jpivarski)
Date: 2021-07-01 18:15:25 From: Jim Pivarski (@jpivarski)
ak._util.completely_flatten's NumpyArray handling: https://github.com/scikit-hep/awkward-1.0/blob/34eae106ae61dfdd22397ff112ba9a5611928166/src/awkward/_util.py#L572-L576
Date: 2021-07-01 18:15:41 From: Jim Pivarski (@jpivarski)
It should .reshape(-1) after the .asarray.
Date: 2021-07-01 18:15:42 From: Angus Hollands (@agoose77:matrix.org)
OK, do we need to implement flatten in nplike, or can we do something else here
Date: 2021-07-01 18:15:47 From: Angus Hollands (@agoose77:matrix.org)
maybe reshape
Date: 2021-07-01 18:15:58 From: Angus Hollands (@agoose77:matrix.org)
OK, I'll make a fix
Date: 2021-07-01 18:16:40 From: Jim Pivarski (@jpivarski)
Thank you! Ianna's sorting PR (#946) is also in flight, so there's time to get this in.
Date: 2021-07-01 18:23:36 From: Angus Hollands (@agoose77:matrix.org)
This passes my tests locally
Date: 2021-07-01 18:23:36 From: Angus Hollands (@agoose77:matrix.org)
https://github.com/scikit-hep/awkward-1.0/pull/974
Date: 2021-07-01 18:24:09 From: Angus Hollands (@agoose77:matrix.org)
I haven't tested the jax, awkward0, autograd, or numexpr tests, but this is definitely correct 🤣
Date: 2021-07-01 19:12:47 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is ak.mask supposed to work for regular arrays?
Date: 2021-07-01 19:13:08 From: Jim Pivarski (@jpivarski)
It should...
Date: 2021-07-01 19:14:07 From: Angus Hollands (@agoose77:matrix.org)
Just trying an older version of Awkward:
array = ak.Array(
ak.layout.RegularArray(ak.layout.NumpyArray(np.r_[1, 2, 3, 4, 5, 6, 7, 8, 9]), 3)
)
ak.mask(array, array == 2)Date: 2021-07-01 19:15:32 From: Jim Pivarski (@jpivarski)
I tried some examples that worked:
>>> array
<Array [[[0, 1, 2, 3, 4, ... 26, 27, 28, 29]]] type='2 * 3 * 5 * int64'>
>>> array = ak.from_numpy(np.arange(2*3*5).reshape(2, 3, 5), regulararray=True)
>>> array
<Array [[[0, 1, 2, 3, 4, ... 26, 27, 28, 29]]] type='2 * 3 * 5 * int64'>
>>> array.mask[[False, True]]
<Array [None, ... 24], [25, 26, 27, 28, 29]]] type='2 * option[3 * 5 * int64]'>
>>> array.mask[[[False, True, False], [True, True, False]]]
<Array [[None, [5, 6, 7, ... 23, 24], None]] type='2 * var * option[5 * int64]'>
I have to look closely at what's different about yours.
Date: 2021-07-01 19:16:11 From: Angus Hollands (@agoose77:matrix.org)
Try ak.mask(array, array==3)
Date: 2021-07-01 19:16:30 From: Angus Hollands (@agoose77:matrix.org)
I think it's the mask itself that's causing the issue
Date: 2021-07-01 19:16:42 From: Angus Hollands (@agoose77:matrix.org)
Is this the fancy indexing rules?
Date: 2021-07-01 19:16:53 From: Angus Hollands (@agoose77:matrix.org)
I.e. regular is behaving as numpy, which should be 1d
Date: 2021-07-01 19:16:56 From: Jim Pivarski (@jpivarski)
array.mask[something] is exactly the same as ak.mask(array, something).
Date: 2021-07-01 19:17:09 From: Angus Hollands (@agoose77:matrix.org)
Yeah I realised, same error ha
Date: 2021-07-01 19:17:59 From: Angus Hollands (@agoose77:matrix.org)
The "bug" is caused by having regular arrays in the index. I'm not sure if this is expected to work or not, because we don't treat jagged arrays in the same way as regualr ones
Date: 2021-07-01 19:18:16 From: Angus Hollands (@agoose77:matrix.org)
My gut is telling me this is actually "expected"
Date: 2021-07-01 19:18:35 From: Jim Pivarski (@jpivarski)
An ak.Index must be one-dimensional.
Date: 2021-07-01 19:19:01 From: Jim Pivarski (@jpivarski)
More likely, it's the implementation of ak.mask in an unconsidered case.
Date: 2021-07-01 19:19:07 From: Angus Hollands (@agoose77:matrix.org)
Right, what I mean is that the mask argument itself is the bit that's throwing the ValueError
Date: 2021-07-01 19:19:56 From: Angus Hollands (@agoose77:matrix.org)
i.e. this works
array = ak.Array(
ak.layout.RegularArray(ak.layout.NumpyArray(np.r_[1, 2, 3, 4, 5, 6, 7, 8, 9]), 3)
)
ak.mask(array, ak.from_regular(array == 2))Date: 2021-07-01 19:21:01 From: Jim Pivarski (@jpivarski)
What hadn't been considered is that the slicing array would have regular dimensions.
Date: 2021-07-01 19:21:41 From: Jim Pivarski (@jpivarski)
>>> array.mask[array % 5 == 0]
raises the exception but
>>> array.mask[ak.from_regular(ak.from_regular(array % 5 == 0, axis=1), axis=2)]
<Array [[[0, None, None, ... None, None]]] type='2 * var * var * ?int64'>
does not.
Date: 2021-07-01 19:22:08 From: Jim Pivarski (@jpivarski)
For my array ak.from_numpy(np.arange(2*3*5).reshape(2, 3, 5), regulararray=True).
Date: 2021-07-01 19:22:12 From: Angus Hollands (@agoose77:matrix.org)
Yes, indeed.
Date: 2021-07-01 21:07:48 From: Angus Hollands (@agoose77:matrix.org)
I just spent a minute looking at this issue and it seems quite simple that the result of binop(x, 3) is a NumpyArray in this. The getfunction expects a flat NumpyArray rather than a multidimensional one
Date: 2021-07-01 21:10:11 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is a reasonable solution here something like
if isinstance(layoutmask, ak.layout.NumpyArray):
if layoutmask.ndim > 1:
layoutmask = layoutmask.toRegularArray()
# somehow return back to callerDate: 2021-07-01 21:11:38 From: Jim Pivarski (@jpivarski)
Since it works for RegularArrays, I'd use the function in convert.py (forgot its name) that converts NumpyArrays into RegularArrays.
Date: 2021-07-01 21:12:11 From: Angus Hollands (@agoose77:matrix.org)
haha there's always a function for it
Date: 2021-07-01 21:12:31 From: Angus Hollands (@agoose77:matrix.org)
fab.
Date: 2021-07-01 21:13:35 From: Jim Pivarski (@jpivarski)
Actually, it's not in convert.py: toRegularArray() is a method on NumpyArray.
Date: 2021-07-01 21:13:46 From: Jim Pivarski (@jpivarski)
Date: 2021-07-01 21:13:49 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: yes but you recall correctly - regularize_numpyarray
Date: 2021-07-01 21:14:07 From: Jim Pivarski (@jpivarski)
Is it in multiple places? That's an oversight.
Date: 2021-07-01 21:14:08 From: Angus Hollands (@agoose77:matrix.org)
ah
Date: 2021-07-01 21:14:11 From: Angus Hollands (@agoose77:matrix.org)
Ah, that's a useful arg
Date: 2021-07-01 21:14:31 From: Angus Hollands (@agoose77:matrix.org)
No, I mean in convert.py we have regularize_numpyarray which visits the layouts and replaces numpyarrays with regualrarrays
Date: 2021-07-01 21:15:07 From: Jim Pivarski (@jpivarski)
For a whole tree, whereas toRegularArray only does one NumpyArray node?
Date: 2021-07-01 21:15:24 From: Angus Hollands (@agoose77:matrix.org)
Yes, it's only defined on the NumpyArray layout
Date: 2021-07-01 21:15:43 From: Jim Pivarski (@jpivarski)
There's also this, which goes in the opposite direction: https://github.com/scikit-hep/awkward-1.0/blob/edd85bc820011b74c307c9fd9f3caa7909e6b00c/src/awkward/_connect/_numpy.py#L109-L133
Hopefully, that's the only one.
Date: 2021-07-01 21:16:07 From: Angus Hollands (@agoose77:matrix.org)
Well, technically it's defined on all of the list types
Date: 2021-07-01 21:16:24 From: Angus Hollands (@agoose77:matrix.org)
But the high-level functions invoke it across the layout tree rather than on a particular content
Date: 2021-07-01 21:17:29 From: Angus Hollands (@agoose77:matrix.org)
Date: 2021-07-01 21:19:33 From: Angus Hollands (@agoose77:matrix.org)
I think the numpy_to_regular argument might be all that we need here though
Date: 2021-07-01 21:20:01 From: Jim Pivarski (@jpivarski)
Okay, sounds good.
Date: 2021-07-01 21:20:58 From: Angus Hollands (@agoose77:matrix.org)
Is there any downside long term to removing nd NumpyArrays, and only having regulararrays?
Date: 2021-07-01 21:21:01 From: Angus Hollands (@agoose77:matrix.org)
Just curious
Date: 2021-07-01 22:48:08 From: Jim Pivarski (@jpivarski)
NumpyArray follows the full data model of np.ndarray so that np.ndarrays can be absorbed with zero copy. RegularArrays require their content to be C-contiguous, which is an unnecessary transformation for NumPy arrays in some situations. Consider, for instance, Fortran order, or a range-slice on the inner dimensions of an n-dimensional NumPy array.
Alternatively, RegularArray could have been made smarter, with something like a stride in addition to its one-element shape (called "size"). The stride couldn't be a number of bytes, like NumPy's, since we don't know if the RegularArray's content is a NumpyArray. It could be a number of items. Well, that could have worked, though RegularArrays would be more complicated.
Since that complication has to go somewhere, it was easier putting that complication into the NumpyArray node, since its interpretation of shape and strides could then be exactly the same as NumPy's, which made it easier to test for exact conformance. If the "stride-like thing" had been moved out of NumpyArray into RegularArray, there would have been one benefit: conversion to RegularArrays (now required) would always be zero-copy.
Date: 2021-07-02 13:01:12 From: Angus Hollands (@agoose77:matrix.org)
That makes sense, I can see how it makes things conceptually simpler to keep feature parity too
Date: 2021-07-06 15:43:18 From: Angus Hollands (@agoose77:matrix.org)
I couldn't make it to PyHEP this year. How did your talk go @jpivarski ?
Date: 2021-07-06 16:21:59 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org I think it went well—good that I had a lot of questions—but this meant that I spent a lot more time on the first stuff, Uproot, than the later stuff, Awkward Array.
Date: 2021-07-06 16:26:16 From: Angus Hollands (@agoose77:matrix.org)
I suppose that's to be expected ;)
Date: 2021-07-06 16:26:40 From: Angus Hollands (@agoose77:matrix.org)
It looked like an interesting notebook to run through (I took a peek)
Date: 2021-07-06 20:21:59 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski RE coercion, one thing I needed it for was repartitoning an array from serialised (pickle)
Date: 2021-07-06 20:23:34 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org When you pickled a PartitionedArray with different Forms among the partitions, there was a problem somewhere? Barring bugs, I think there shouldn't be.
Date: 2021-07-06 20:23:38 From: Angus Hollands (@agoose77:matrix.org)
One of the issues was key ordering in records, which I imagine we can fix. But for the case where one generates a series of partitions by some generator, its conceivable that the forms would ultimately differ. What do you think the solution to this is?
Date: 2021-07-06 20:24:17 From: Jim Pivarski (@jpivarski)
I think the partitions of a PartitionedArray should be allowed to have different Forms.
Date: 2021-07-06 20:24:30 From: Jim Pivarski (@jpivarski)
What problem does it cause?
Date: 2021-07-06 20:25:42 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski I ran into a problem whereby I couldn't lazy load a partitioned dataset. Perhaps I need to dig into this first because I had a lot of custom code
Date: 2021-07-06 20:27:12 From: Jim Pivarski (@jpivarski)
No, you're right: ak.to_buffers complains that the partitions of a PartitionedArray have different Forms.
Date: 2021-07-06 20:28:05 From: Jim Pivarski (@jpivarski)
Using my example from GitHub:
>>> onetwo = ak.Array(ak.partition.IrregularlyPartitionedArray([one.layout, two.layout]))
>>> onetwo
<Array [1.1, 2.2, 3.3, 4.4, ... 3.3, 4.4, 5.5] type='10 * ?float64'>
>>> pickle.dumps(onetwo)
ValueError: the Form of partition 1:
{
"class": "UnmaskedArray",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "d",
"primitive": "float64",
"form_key": "node1"
},
"form_key": "node0"
}
differs from the first Form:
{
"class": "BitMaskedArray",
"mask": "u8",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "d",
"primitive": "float64",
"form_key": "node1"
},
"valid_when": false,
"lsb_order": true,
"form_key": "node0"
}Date: 2021-07-06 20:30:17 From: Angus Hollands (@agoose77:matrix.org)
That's the one!
Date: 2021-07-06 20:30:49 From: Jim Pivarski (@jpivarski)
I hadn't considered to_buffers. It expects to represent all of the data in a collection using a single Form. That assumption would not be easy to relax because the interface returns a single Form object.
Date: 2021-07-06 20:33:13 From: Jim Pivarski (@jpivarski)
The data manipulation functions were all written with the idea in mind that only the Types need to be the same, so all those loops over partitions allow for different code paths for each partition, but to_buffers was built with the idea that there would be only one Form for the user to save somewhere.
Date: 2021-07-06 20:36:04 From: Jim Pivarski (@jpivarski)
It's possible that this could be expanded with new JSON syntax:
{
"class": "PartitionedArray",
"forms": [
{
"partitions": [0, 3, 4, 6],
"form": { ... }
},
{
"partitions": [1, 2, 5],
"form": { ... }
}
}but that's fairly major.
Date: 2021-07-06 20:37:53 From: Angus Hollands (@agoose77:matrix.org)
Hmm
Date: 2021-07-06 20:39:08 From: Jim Pivarski (@jpivarski)
Alternatively, to_buffers's handling of PartitionedArrays with different Forms could coerce all partitions to the most general Form, but that's going against the spirit of to_buffers, which is supposed to leave the data as-is, as much as possible.
Date: 2021-07-06 20:40:10 From: Angus Hollands (@agoose77:matrix.org)
I agree here - to_buffers seems like it should be lightweight to me
Date: 2021-07-06 20:40:32 From: Angus Hollands (@agoose77:matrix.org)
And requiring a common form imposes a hard constraint on the form whereas, as you point out, it's really an implementation detail
Date: 2021-07-06 20:40:35 From: Jim Pivarski (@jpivarski)
PartitionedArrays are supposed to allow different Forms:
>>> ak.is_valid(onetwo)
True
and here is the implementation, which does check for different Types:
Date: 2021-07-06 20:41:16 From: Angus Hollands (@agoose77:matrix.org)
So I'm leaning personally towards extending forms to capture this idea
Date: 2021-07-06 20:44:36 From: Jim Pivarski (@jpivarski)
I'm looking at that new Form syntax, and we've never had anything in the Form that scales with the length of the array before. The partitions field scales with the number of partitions. I suppose that the only set of partitions that doesn't need to be fully spelled out is the most common one, which would dramatically reduce that scaling in the most typical case: 99% of partitions have one Form, a few outliers have a few different Forms. That puts a lot of logic into the JSON-parsing, which needs to be supported forever.
Date: 2021-07-06 20:46:48 From: Jim Pivarski (@jpivarski)
That was the first thing tested in the v2 Forms, that they can accept old Form JSON (https://github.com/scikit-hep/awkward-1.0/blob/main/tests/test_0958-new-forms-must-accept-old-form-json.py), since we want v1 pickled arrays to be readable in Awkward 2.0 (unlike 0.x → 1.0).
Date: 2021-07-06 20:52:11 From: Angus Hollands (@agoose77:matrix.org)
Would using a lookup table here ameliorate things?
Date: 2021-07-06 20:52:55 From: Angus Hollands (@agoose77:matrix.org)
Either just for the schemas, or even for the buffers as well (I.e. Different partitions with same form read a different region of the shared buffer)
Date: 2021-07-06 20:55:29 From: Lukas (@lukasheinrich)
i'll not try to catch up on this diiscussion :)
Date: 2021-07-06 20:55:35 From: Lukas (@lukasheinrich)
but happy to follow from here on
Date: 2021-07-06 20:55:51 From: Lukas (@lukasheinrich)
I added annother comment on the issue
Date: 2021-07-06 20:56:59 From: Angus Hollands (@agoose77:matrix.org)
Ah, fab, I was just replying in github
Date: 2021-07-06 20:57:30 From: Angus Hollands (@agoose77:matrix.org)
@lukasheinrich in your example, as Jim says, doesn't your root file guarantee that the type isn't unknown regardless of the electron multiplicity
Date: 2021-07-06 20:57:51 From: Lukas (@lukasheinrich)
known or unknown?
Date: 2021-07-06 20:58:38 From: Lukas (@lukasheinrich)
maybe it does for ROOT, but for JSON it for sure doesn't, right?
Date: 2021-07-06 21:00:01 From: Jim Pivarski (@jpivarski)
If the data source is untyped Python objects or untyped JSON objects, it can't fill in unknowns.
Date: 2021-07-06 21:00:07 From: Angus Hollands (@agoose77:matrix.org)
For JSON no, it uses from_iter under the hood iirx
Date: 2021-07-06 21:00:37 From: Lukas (@lukasheinrich)
right, so at least for this already ak.cast / ak.coerce_type would be useful
Date: 2021-07-06 21:00:58 From: Lukas (@lukasheinrich)
but @nikoladze should say how he came across this issue when doing ROOT->parquet
Date: 2021-07-06 21:22:58 From: Angus Hollands (@agoose77:matrix.org)
@lukasheinrich out of interest, why are you reading JSON?
Date: 2021-07-06 21:24:12 From: Angus Hollands (@agoose77:matrix.org)
From_iter is a python level loop, so it's not hugely performant.
Date: 2021-07-06 21:24:21 From: Lukas (@lukasheinrich)
we're not reading JSON this was just an example to work around my limited understanding of ROOT and demonstrate the usefulness for normalizing types in general
Date: 2021-07-06 21:25:36 From: Lukas (@lukasheinrich)
i.e. I think being able to cast an array into a type with which it's compatible I think is generally ueful
Date: 2021-07-06 21:25:57 From: Lukas (@lukasheinrich)
as @jpivarski says, it's a natural extension of ak.values_astype
Date: 2021-07-06 21:28:54 From: Jim Pivarski (@jpivarski)
from_iter and from_json should be extended to use LayoutBuilder, rather than ArrayBuilder, if given a Form, but that will probably wait until v2 because it's in the part of the code that will be refactored to Python; might as well only write it once.
Date: 2021-07-07 12:52:20 From: Lukas (@lukasheinrich)
is there any nicce way to view the "size" of a given ak.RecordArray?
Date: 2021-07-07 12:52:36 From: Lukas (@lukasheinrich)
kind of like Dask shows the in-memory size of a dataframe
Date: 2021-07-07 12:52:48 From: Jim Pivarski (@jpivarski)
ak.Array.nbytes, perhaps?
Date: 2021-07-07 12:53:28 From: Lukas (@lukasheinrich)
perfect
Date: 2021-07-07 12:53:52 From: Jim Pivarski (@jpivarski)
That's the number of bytes in the array buffers without double-counting, and it doesn't count the Python/C++ objects (the layout tree nodes). So for big arrays, it only counts the part that scales.
Date: 2021-07-07 12:55:24 From: Lukas (@lukasheinrich)
ok that's what I'm interested in
Date: 2021-07-07 20:20:03 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: is it unreasonable to expect ak.flatten(recordarray, None) to leave the record structure intact?
Date: 2021-07-07 20:21:55 From: Angus Hollands (@agoose77:matrix.org)
I see that we explicitly don't do that (https://github.com/scikit-hep/awkward-1.0/blob/main/src/awkward/_util.py#L566-L570) but it surprised me, and wondered whether my expectation is unreasonable
Date: 2021-07-07 22:52:34 From: Jim Pivarski (@jpivarski)
For other functions, like reducers, axis=None means "apply to all levels, regardless of axis." That's why the same thing applies here: every level is flattened. Some time ago, you were asking for that to be the default (unless my memory is wrong).
Date: 2021-07-08 09:05:44 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is erased.
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure. I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>Date: 2021-07-08 09:17:16 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is erased.
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure. I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>After taking another look,completely_flatten is actually probably doing what it should be - flattening the contents. It doesn't propagate the recordarray to the root, so that information is lost, but we also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840
Date: 2021-07-08 09:17:50 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is erased.
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure. I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>After taking another look,completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840
Date: 2021-07-08 09:21:15 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure.
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>After taking another look,completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840
I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.
Date: 2021-07-08 09:21:33 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure.
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure
After taking another look,completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840
I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.
Date: 2021-07-08 09:22:16 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure.
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure. The current implementation is more like ak.erase which would erase all structure + type information besides the dtype.
After taking another look, completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840
I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.
Date: 2021-07-08 09:24:02 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure.
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure. The current implementation is more like ak.erase which would erase all structure + type information besides the dtype.
I typically use flatten when I want to convert an array into a flat representation for e.g. histogramming. I expected flat = ak.flatten(arr) to produce a record array that could then be decomposed into the fields for a 2D historgram e.g. hist.fill(flat.x, flat.y).
After taking another look, completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840
I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.
Date: 2021-07-08 09:24:11 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.
I typically use flatten when I want to convert an array into a flat representation for e.g. histogramming. I expected flat = ak.flatten(arr) to produce a record array that could then be decomposed into the fields for a 2D historgram e.g. hist.fill(flat.x, flat.y).
For example, take this array:
arr = ak.Array([{'x': 1, 'y': 'hi'}])If we flatten with axis=None, then it actually merges the contents and loses the record structure.
>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure. The current implementation is more like ak.erase which would erase all structure + type information besides the dtype.
After taking another look, completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840
I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.
Date: 2021-07-08 12:30:21 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org This behavior is described in a few places in the documentation, always with massive warnings about the fact that you probably don't want all the fields of your records mixed together: https://awkward-array.org/how-to-restructure-flatten.html#ak-flatten-with-axis-none
The fact that axis=None completely flattens everything, including records, is a large part of why I was apprehensive about making it the default.
A different option, which flattens all levels of lists but keeps record fields (and nested record fields) unflattened, could be integrated into the axis argument as axis="records" because it's an option that would only apply in one axis case. That would have to be integrated into completely_flatten, as you're pointed out.
Date: 2021-07-08 12:31:16 From: Angus Hollands (@agoose77:matrix.org)
Gah, I forgot you'd been updating the user docs./
Date: 2021-07-08 12:35:38 From: Angus Hollands (@agoose77:matrix.org)
It feels like we'd want a new function for this, to avoid adding a string option, e.g. flatten_records, but I also see the argument for just adding another case to axis, even if it's a string parameter. I can look at this long term, and I'll open an issue for now.
Date: 2021-07-08 12:42:54 From: Angus Hollands (@agoose77:matrix.org)
Hmm, maybe this is where ravel comes in. To my mind, the existing axis=None behaviour best fits the notion of ravelling, because ravel doesn't take an axis parameter. Given that flatten and unflatten are already well defined concepts in Awkward (and strongly necessitate the idea of an axis), maybe we could deprecate the existing axis=None behaviour and schedule its replacement with axis="records". At the same time, one could add an ak.ravel which only does ak.flatten(axis=None). I.e.
def flatten(array, axis=None):
if axis is None:
return ravel(array)
...
def ravel(array):
...Date: 2021-07-08 12:46:11 From: Jim Pivarski (@jpivarski)
ravel does sound like the right name to use. The deprecation process will be painful, though, because flattening with axis=None is widely used and there's a lot of documentation that will need to change. The more that's written and talked about a concept, the harder it is to change.
Date: 2021-07-08 12:51:43 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: Yes I agree. I am happy to take the lead on it down the road. Although it will break some workflows, I suspect it won't be terrible because most of the time you don't want to merge record fields (in my opinion!), i.e. most people are probably using ak.flatten only on non-record arrays, for which the behaviour will be unchanged.
Date: 2021-07-08 12:52:08 From: Angus Hollands (@agoose77:matrix.org)
With a long deprecation window, and updated docs, hopefully things won't be awful!
Date: 2021-07-12 10:01:15 From: Angus Hollands (@agoose77:matrix.org)
Hey @jpivarski, do you have any suggestions about how to help w.r.t the JIRA report for Parquet single field reading? I can't tell whether it is something that's being worked on, or if it's stuck in limbo land (as these things sometimes do).
Date: 2021-07-12 14:56:31 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org It is likely being worked on, though the timescale could be a few months. (That's typical for fixes they've made in the past.)
Date: 2021-07-12 15:08:36 From: Angus Hollands (@agoose77:matrix.org)
Fair enough, I'll just keep my eyes on it then :)
Date: 2021-07-14 22:27:15 From: Raymond Ehlers (@raymondEhlers)
Hi All, is there a way to pretty print the result of ak.type? It's admittedly a bit trivial, but some of my type strings are now 8 lines long, and it's getting more difficult to parse at a glance. It seems that there's no formal dependence on datashape, but something like https://github.com/blaze/datashape/blob/c9d2bd75414a69d94498e7340ef9dd5fce903007/datashape/coretypes.py#L1363 ? Thanks!
Date: 2021-07-15 00:23:14 From: Jim Pivarski (@jpivarski)
@raymondEhlers There isn't yet a way to pretty print a type string. It wouldn't be too hard, but maybe that should go into version 2 because the type printing has already been translated into Python. There's also a temporary v1 → v2 converter, so if a type pretty-printer is implemented in v2 (Python), then we could benefit from pretty-printing now by doing quick v1 → v2 conversation (it's zero-copy).
Date: 2021-07-15 08:28:18 From: Raymond Ehlers (@raymondEhlers)
Okay, I see - thanks Jim! From my perspective, it's super low priority - I just wanted to check in case I had overlooked it
Date: 2021-07-15 09:08:18 From: Angus Hollands (@agoose77:matrix.org)
Long term, maybe we should have an internal representation of the datashape AST? i.e. each type will have a datashape attribute that holds a Datashape object.
Date: 2021-07-15 09:10:03 From: Angus Hollands (@agoose77:matrix.org)
I can only see the Blaze repo realisation of datashapes in a Python library. Perhaps this is something that would benefit from being developed? (Not suggesting you take this on, I might be interested)
Date: 2021-07-15 09:28:56 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: thanks for merging the fixes yesterday. From looking at the two possible reducer paths in ListOffsetArray, it would seem that we expect starts to start at 0 (awkward_ListOffsetArray_reduce_nonlocal_nextstarts_64 handles this for the inner axis case). Although the bugs have now been fixed, I wonder whether we need to explicitly state this to avoid undefined behaviour slipping through in future (as conventionally starts can be non-zero based).
Date: 2021-07-15 11:04:51 From: Angus Hollands (@agoose77:matrix.org)
On another topic, I was thinking about v2 with only the kernels being C++. At this point, is there an argument for implementing kernels in numba (making numba a dependency ofc)? From a dev perspective it might lower the barrier to development (even though kernels shouldn't really change much vs the higher layers), and would make it easy (ier) to add CUDA support. I don't know how beneficial the parallelism feature would be to the kinds of kernels we have, but it might be useful.
Also, with v2, does it now make sense to make kernel modules (like nplike) so that we don't have per-kernel switch stmts at the call site?
I'm just musing over this stuff because i'm sure there have already been thoughts in this direction on your end.
Date: 2021-07-15 12:19:08 From: Jim Pivarski (@jpivarski)
Oh yeah, Blaze is dead. Dask was the one sub-project that came out of that and now has a life of its own. But Datashape was made explicitly for DyND (which is like Awkward Array), so it seemed appropriate to give the language, at least, life beyond the formal project.
Date: 2021-07-15 12:24:34 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I agree with the rationale; NumPy had grown since its first release, and perhaps now datashapes might make sense (i.e. masked arrays), but with Awkward it is more reasonable to start with them from the get-go.
Date: 2021-07-15 12:43:00 From: Jim Pivarski (@jpivarski)
"Internal representation of the Datashape AST": actually, this is our Type objects. So we do have one. In fact, Reik's Datashape parser (ak.type.from_string?) generates Type objects, just as a parser to an AST would.
Date: 2021-07-15 12:56:12 From: Jim Pivarski (@jpivarski)
"Kernels in Numba": it's possible, but I'd like to avoid it. (1) We don't want a strict dependency in Numba (and hence LLVM). A dependence on Numba would be a reasonable trade for a dependence on precompiled code, but there are some non-kernels that will have to be precompiled, so we don't get to make that trade. (2) For prototyping, we just write the function inline, since speed is not a concern during development, and then "crystallize" it as a kernel when done. (3) One could imagine numba-kernels in addition to cpu-kernels, but that means maintaining another copy of a few hundred functions. (4) If a kennel is written in Numba first, we'd be tempted to rely on the fact that it can be JIT-compiled, which adds abilities that aren't available when statically compiling. The implementation of matrix multiplication (currently in Numba only) has this problem: it uses JIT instead of columnar techniques and I don't know how to make it columnar. (I haven't put in the effort because I don't know if anybody's using it. But this is illustrative if the problem we would introduce if we allowed all kennels to be JIT-compiled; we'd find it hard to work without those superpowers!)
Date: 2021-07-15 12:56:40 From: Jim Pivarski (@jpivarski)
Also, with v2, does it now make sense to make kernel modules (like nplike) so that we don't have per-kernel switch stmts at the call site?
Date: 2021-07-15 12:56:42 From: Angus Hollands (@agoose77:matrix.org)
Hmm, just looking at this again and you're right; we cover the full subset we need
Date: 2021-07-15 12:56:46 From: Angus Hollands (@agoose77:matrix.org)
(from_datashape)
Date: 2021-07-15 12:57:23 From: Jim Pivarski (@jpivarski)
That's how v2 kernel-calling is already being implemented: through nplike (no decision at call-site).
Date: 2021-07-15 13:01:52 From: Angus Hollands (@agoose77:matrix.org)
Nice, that's all my questions 🥳
Date: 2021-07-15 16:22:40 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: how would you feel about adding https://github.com/jpivarski-talks/2021-07-06-pyhep-uproot-awkward-tutorial as a binder link on the main repo?
Date: 2021-07-15 18:38:45 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org Maybe in the "Getting help" section, which is unfortunately copy-pasted in several places (since we don't know which one somebody will find first): GitHub README.md, PyPI (README-pypi.md), and Quickstart (docs-src/quickstart.md). These three places would need a new bullet point (perhaps as the last one; it will stand out visually).
- Tutorial for particle physicists, also available in Binder.
Date: 2021-07-15 20:17:57 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: in _form_to_layout (see here, in which cases is length expected to be None?
To fix #1007, we need to know the expected length of the array in order to support empty buffers.
Date: 2021-07-15 20:20:04 From: Angus Hollands (@agoose77:matrix.org)
Perhaps it's not well defined, in which case I will just fall back to 1 if length is None else length
Date: 2021-07-15 20:21:54 From: Jim Pivarski (@jpivarski)
For full generality, the length is required to produce an array from buffers (ak.from_buffers, which calls _form_to_layout). But before I recognized this, pickled arrays and arrays that were otherwise saved with the old ak.to_arrayset (no longer exists) didn't write out the length. To allow the cases that can be read back to be read back without backward incompatibility, length is internally allowed to be None even though the public API insists that it must be an integer.
Date: 2021-07-15 20:22:38 From: Angus Hollands (@agoose77:matrix.org)
Right, OK. So in these cases, from_buffers would already fail for the case that I'm fixing
Date: 2021-07-15 20:23:32 From: Angus Hollands (@agoose77:matrix.org)
What do you prefer in this instance — In the case that no length is given and we require one because the array has no itemsize, I can either raise a ValueError, or I can use a default 1.
Date: 2021-07-15 20:24:24 From: Angus Hollands (@agoose77:matrix.org)
I'm erring towards raising a ValueError
Date: 2021-07-15 20:25:11 From: Jim Pivarski (@jpivarski)
ValueError if there is no way to produce a correct array when the length is not known.
Date: 2021-07-15 20:25:49 From: Angus Hollands (@agoose77:matrix.org)
Thanks! (I went for the cool emoji, but it is decidedly uncool)
Date: 2021-07-15 20:26:18 From: Jim Pivarski (@jpivarski)
Since allowing None was just to accept data that had been pickled before ak.to_buffers, if the array can't be reconstructed without the length now, it couldn't have been reconstructed without the length then, so nothing is lost.
Date: 2021-07-15 20:26:32 From: Angus Hollands (@agoose77:matrix.org)
Yes, my thoughts too
Date: 2021-07-15 20:48:09 From: Angus Hollands (@agoose77:matrix.org)
Did you mean to merge this, or just approve? https://github.com/scikit-hep/awkward-1.0/pull/1008
Date: 2021-07-15 20:55:38 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org Yes, let's take it as a convention that if I approve your PR and it passes tests, then you can squash-and-merge it. (I know it can be ambiguous, so let's establish that that's what I mean by "Approve.")
Date: 2021-07-15 20:57:56 From: Angus Hollands (@agoose77:matrix.org)
OK, that sounds reasonable. I am still happy to leave you to "merge on test success" if that suits you better, but I'll take your stmt above as the new consensus. It's good that you've spelled it out, because there are a myriad of ways one can show "approval" and not intend on merging ;)
Date: 2021-07-15 21:01:03 From: Jim Pivarski (@jpivarski)
Yeah, it's also a concurrency problem because I cycle through tabs and activities (as I imagine you do, too), so I'm often in a position of asking, "Are you done with this PR? Can I merge it yet?" I don't want to merge it while they're adding one last commit. In principle, "Draft" should mean "still working" and "Approve" should mean "go ahead and merge," but everyone would have to be consistent with that if that's what we want them to mean. (And I haven't been entirely consistent, either. Consistency is hard.)
Date: 2021-07-15 21:04:17 From: Angus Hollands (@agoose77:matrix.org)
Hmm, that's a good point. Shall I settle on finishing a PR with a "ready to ship 🚀" comment to make it more explicit?
Date: 2021-07-15 21:11:40 From: Jim Pivarski (@jpivarski)
I'm good with that convention. "Approve" from me, "🚀" from you, and the tests pass = squash-and-merge.
Date: 2021-07-15 21:12:29 From: Angus Hollands (@agoose77:matrix.org)
Agreed. To confirm my understanding, that's Angus->Jim->Angus if all tests pass :)
Date: 2021-07-15 22:17:22 From: Jim Pivarski (@jpivarski)
Right!
Date: 2021-07-16 21:05:43 From: Angus Hollands (@agoose77:matrix.org)
Hey Jim, is there much thought at the moment as to whether the API surface of the v1 layouts will be replicated for v2?
Date: 2021-07-16 21:08:15 From: Angus Hollands (@agoose77:matrix.org)
I'm looking at ak.flatten, and it is feeling increasingly like flatten(axis=None) is just layout = layout.flatten(j) for i in range(layout.ndim-1)[::-1], which seems naive.
Date: 2021-07-16 21:08:40 From: Angus Hollands (@agoose77:matrix.org)
I'm used to most of the work being at the operations level rather than low-level methods
Date: 2021-07-16 21:11:11 From: Angus Hollands (@agoose77:matrix.org)
I suppose that the current implementation is probably more effeicient as it only needs to perform one "new" allocation when merging contents
Date: 2021-07-17 02:17:33 From: Jim Pivarski (@jpivarski)
v2 will have the same layout interface as v1 unless we find good reason to change it. The methods on these classes are considered internal details (and probably should have started with underscores), but they're very stable. Anyway, if you need to charge it now, we'll work around it in the refactoring. These aren't separate projects that need to work for a range of different versions on each.
Date: 2021-07-19 14:30:55 From: Angus Hollands (@agoose77:matrix.org)
I noticed that we might need to clarify when NumPy broadcasting is chosen over left-broadcasting — I found myself needing to read the source code of broadcast_and_apply to learn that NumPy right-broadcasting is only used when all of the inputs are purelist regular. Is my understanding correct? If so, I'll make a PR to update the docs.
Date: 2021-07-19 14:40:00 From: Angus Hollands (@agoose77:matrix.org)
Also, I ran into RuntimeError: undefined operation: NumpyArray::getitem_next_jagged(array) for ndim == 2 today. I was indexing a jagged array of type 1 * var * 512 * float64 with an index of type 1 * var * var * int64. This is supported when the jagged array has a RegularArray layout instead of a 2D NumpyArray layout, so is this just not implemented yet rather than unpermitted?
Date: 2021-07-19 15:50:04 From: Jim Pivarski (@jpivarski)
Unless something has changed since I originally defined it, it would use NumPy-like broadcasting if all arrays are regular and Awkward-like broadcasting if all arrays are irregular, and would raise an error if there's a mixture.
Date: 2021-07-19 15:52:45 From: Jim Pivarski (@jpivarski)
If you hit an undefined operation, that means that we didn't expect the code path to reach that method. It was defined to satisfy C++'s static typing, but no code was written because dynamically, you shouldn't have gotten there. The error is likely upstream of this point. (And it will get easier to debug when that part of the codebase is in Python. We've done a few rounds finding missing cases in jagged/missing slices already. Part of that was written by Nick, who hit some unexpected cases.)
Date: 2021-07-19 15:54:16 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: OK, I'll hang on until v2 is more mature and take it up from there
Date: 2021-07-19 15:56:15 From: Angus Hollands (@agoose77:matrix.org)
Another question - when zipping n * var * record with n * var * 512 * int32, we don't want the RegularArray to be erased for depth_limit=2 right?
Date: 2021-07-19 16:01:31 From: Angus Hollands (@agoose77:matrix.org)
It seems to depend upon whether the contents to be sipped are RegularArrays at the given depth, or NumpyArrays with ndim>1
Date: 2021-07-19 16:08:05 From: Angus Hollands (@agoose77:matrix.org)
sample = ak.Array([
[
[1,2,3,4,5],
[1,2,3,4,5]
],
[
[1,2,3,4,5],
[1,2,3,4,5],
]
])
sample = ak.to_regular(sample, axis=-1)
data = ak.zip({"sample": sample, "two_sample": sample}, depth_limit=2)The above data.sample isn't regular despite begin regular under the hood
Date: 2021-07-19 16:08:13 From: Jim Pivarski (@jpivarski)
What do you mean "erased"?
Date: 2021-07-19 16:09:08 From: Angus Hollands (@agoose77:matrix.org)
the RegularArray layout becomes ListOffsetArray, and in doing so, the fact that the layout is regular is lost
Date: 2021-07-19 16:10:52 From: Jim Pivarski (@jpivarski)
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=True)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}).type
2 * var * var * {"a": int64, "b": int64}
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=False)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}).type
2 * var * var * {"a": int64, "b": int64}
I don't see a difference between RegularArray and NumpyArray with ndim > 1.
Date: 2021-07-19 16:11:48 From: Jim Pivarski (@jpivarski)
I hadn't thought about preserving regularness when implementing ak.zip.
Date: 2021-07-19 16:11:53 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I want to broadcast only to a depth of 2 - in my case, the data are waveforms that should all have 512 sample
Date: 2021-07-19 16:12:19 From: Angus Hollands (@agoose77:matrix.org)
Although your example is also interesting
Date: 2021-07-19 16:12:30 From: Jim Pivarski (@jpivarski)
I guess it would make sense if it did preserve regularness, but that's not something I'd considered.
Date: 2021-07-19 16:13:51 From: Jim Pivarski (@jpivarski)
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=True)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}, depth_limit=2).type
2 * var * {"a": var * int64, "b": var * int64}
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=False)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}, depth_limit=2).type
2 * var * {"a": var * int64, "b": var * int64}
Same thing happens here.
Date: 2021-07-19 16:13:51 From: Angus Hollands (@agoose77:matrix.org)
I've been approaching problems from the perspective of "keeping regular information is better than losing it", whether that's for memory / performance, or other reasons
Date: 2021-07-19 16:14:00 From: Jim Pivarski (@jpivarski)
That makes sense.
Date: 2021-07-19 16:14:49 From: Jim Pivarski (@jpivarski)
Actually, zipping is a special case of broadcasting.
Date: 2021-07-19 16:15:22 From: Jim Pivarski (@jpivarski)
I had thought that broadcasting preserves regularness if and only if all input arrays are regular.
Date: 2021-07-19 16:16:51 From: Angus Hollands (@agoose77:matrix.org)
Ha, yes I'm currently digging into the broadcasting logic
Date: 2021-07-19 16:16:51 From: Jim Pivarski (@jpivarski)
Date: 2021-07-19 16:17:59 From: Jim Pivarski (@jpivarski)
This is the block that handles the "all inputs are RegularArrays" case. It comes before the "all_same_offsets", so that's not relevant. It looks like the output of that block is a RegularArray.
Date: 2021-07-19 16:19:25 From: Jim Pivarski (@jpivarski)
The first question I'd ask is whether that block is triggering. Passing exactly the same array (sample) into it as all fields of the zip really should get you all(isinstance(x, ak.layout.RegularArray) for x in inputs).
Date: 2021-07-19 16:20:15 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: If you're happy that we should be preserving regularity here then I'll dig into it!
Date: 2021-07-19 16:22:26 From: Jim Pivarski (@jpivarski)
It should be preserving regularity. The fact that I didn't have that in mind while writing ak.zip shouldn't mean that it isn't covered. Preserving regularity was an explicit goal in broadcasting (as is evidenced from this code block), and ak.zip uses common broadcasting routines (so that I wouldn't have to think about it explicitly).
Date: 2021-07-19 16:22:52 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I understand, and I'll be looking at _util.broadcast_and_apply
Date: 2021-07-19 16:23:18 From: Angus Hollands (@agoose77:matrix.org)
It looks like its happening at the level above the regular array
Date: 2021-07-19 16:24:30 From: Jim Pivarski (@jpivarski)
Regularness is preserved in simple broadcasting without ak.zip, which makes it sound like something in ak.zip is preventing this from taking effect.
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=True)
>>> ak.broadcast_arrays(sample, sample)
[<Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>, <Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>]
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=False)
>>> ak.broadcast_arrays(sample, sample)
[<Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>, <Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>]
Date: 2021-07-19 16:27:00 From: Jim Pivarski (@jpivarski)
I don't see anything bad in ak.zip's "apply" function: https://github.com/scikit-hep/awkward-1.0/blob/32d54d4979ace913f80f9894d86b912c604a8460/src/awkward/operations/structure.py#L619-L636
Date: 2021-07-19 16:27:40 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I think it's in the broadcasting routines
Date: 2021-07-19 16:27:59 From: Angus Hollands (@agoose77:matrix.org)
I can see the regular arrays being converted to non regular arrays, just narrowing down the cause
Date: 2021-07-19 16:28:17 From: Jim Pivarski (@jpivarski)
Good. That's what I was going to suggest.
Date: 2021-07-19 16:41:14 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I think it's just the regular_to_jagged flag
Date: 2021-07-19 16:50:37 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: hahaha I looked at the git blame and guess who filed the issue that led to this change? 🐑 https://github.com/scikit-hep/uproot4/issues/244
Date: 2021-07-19 16:52:02 From: Jim Pivarski (@jpivarski)
Is "regular_to_jagged" on by default?
Date: 2021-07-19 16:52:09 From: Angus Hollands (@agoose77:matrix.org)
Only for zip
Date: 2021-07-19 16:52:16 From: Jim Pivarski (@jpivarski)
Why is that on at all?
Date: 2021-07-19 16:52:21 From: Angus Hollands (@agoose77:matrix.org)
https://github.com/scikit-hep/awkward-1.0/pull/656/files
Date: 2021-07-19 16:52:42 From: Angus Hollands (@agoose77:matrix.org)
I wonder whether what we really needed to change for the issue above (244) was to just disable right broadcasting?
Date: 2021-07-19 16:55:15 From: Jim Pivarski (@jpivarski)
Fortunately, Uproot and Awkward Array are decoupled; we don't need to pass a "regular_to_jagged" flag to ak.zip in Uproot only because Uproot is now using the ak.Array constructor.
Date: 2021-07-19 16:56:32 From: Jim Pivarski (@jpivarski)
I don't know why we thought that ak.zip should be turning regular arrays into jagged arrays. To avoid right-broadcasting, evidently, but (a) should ak.zip avoid right-broadcasting? and (b) why not just do that with a left_broadcasting=True, right_broadcasting=False?
Date: 2021-07-19 16:57:25 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I am inclined to agree; I don't think it's harmful that broadcast_and_apply as this flag, but zip shouldn't be making that level of structure change imo
Date: 2021-07-19 16:58:12 From: Jim Pivarski (@jpivarski)
So I guess ak.zip shouldn't be turning regular arrays into jagged arrays. That "fix" was an anti-fix. We thought at the time that right-broadcasting was non-intuitive for ak.zip, and that may well be. Right-broadcasting is always counter-intuitive in my opinion. So I guess ak.zip should use right_broadcasting=False instead of regular_to_jagged=True.
Date: 2021-07-19 16:58:33 From: Angus Hollands (@agoose77:matrix.org)
If we revert this change to zip, then users can always restore the old behaviour by manually from_regular by themselves before zipping.
Date: 2021-07-19 16:58:50 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: did we think that right broadcasting was wrong, or was it just the solution at the time?
Date: 2021-07-19 17:00:41 From: Jim Pivarski (@jpivarski)
We only really need to do right-broadcasting when the function is generalizing NumPy (same name of function and part or all of the space of input arguments is regular).
Date: 2021-07-19 17:00:54 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I see three options here:
-
zipis updated to take flags to control l/r broadcasting - no flags, and just use array types (regular or non regular) to control broadcasting (i.e. all flags set to true in
broadcast_and_apply) - always left broadcast
Date: 2021-07-19 17:01:30 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: OK, that weighs in favor of 2 or 3
Date: 2021-07-19 17:01:34 From: Angus Hollands (@agoose77:matrix.org)
Probably 3 more than 2
Date: 2021-07-19 17:02:22 From: Jim Pivarski (@jpivarski)
ak.zip was right-broadcasting to keep with the pattern, but apparently we thought right-broadcasting was nonintuitive for zipping, and since it involves the creation of records, I agree with my past self.
Date: 2021-07-19 17:02:32 From: Jim Pivarski (@jpivarski)
So I guess option 3 makes the most sense here.
Date: 2021-07-19 17:04:20 From: Jim Pivarski (@jpivarski)
Adding a "right_broadcasting=False" argument to the ak.zip function itself is also an option, to provide flexibility for power-users. The default should be False, and anybody who doesn't understand the blurb about it in the documentation can leave it as-is. That would also allow for backward compatibility, though it's opt-in backward compatibility, not automatic.
Date: 2021-07-19 17:04:39 From: Jim Pivarski (@jpivarski)
So that's half of option 1.
Date: 2021-07-19 17:04:58 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: equally, you could argue for a left_broadcasting flag for users that expect their arrays to match in shape
Date: 2021-07-19 17:05:12 From: Angus Hollands (@agoose77:matrix.org)
I'm caught between 1 and 3 :)
Date: 2021-07-19 17:05:24 From: Jim Pivarski (@jpivarski)
I'm advocating half of option 1.
Date: 2021-07-19 17:05:30 From: Angus Hollands (@agoose77:matrix.org)
OK
Date: 2021-07-19 17:06:04 From: Angus Hollands (@agoose77:matrix.org)
What's your take on the left broadcast flag?
Date: 2021-07-19 17:06:06 From: Jim Pivarski (@jpivarski)
Left-broadcasting is very natural for zipping, considering that depth_limit also counts from the left.
Date: 2021-07-19 17:06:49 From: Jim Pivarski (@jpivarski)
It's so natural that I think I wouldn't make it an option. There's an asymmetry here. If someone really wants neither left nor right broadcasting, they can just ensure that all arrays have the same depth (dimension).
Date: 2021-07-19 17:07:31 From: Angus Hollands (@agoose77:matrix.org)
Well, besides my contrived example I can't imagine it being used, so I vote for your proposal - if this becomes an issue someone can raise it again?
Date: 2021-07-19 17:08:12 From: Jim Pivarski (@jpivarski)
The internal broadcast_and_apply function has both flags for maximum flexibility, because it's used in a lot of functions, some of which might be right-broadcast only. I think having both flags on a public-facing function would cause confusion ("which one should I use?").
Date: 2021-07-19 17:08:28 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: agreed.
Date: 2021-07-19 17:09:03 From: Angus Hollands (@agoose77:matrix.org)
I'll open a ticket.
Date: 2021-07-19 17:10:48 From: Jim Pivarski (@jpivarski)
Thanks!
Date: 2021-07-19 17:11:51 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: for posterity, there are a lot of places where we broadcast and apply e.g. with_field. You've previously commented (in commits) that we nearly never want this, and we don't expose the flag for these functions. Should we perhaps just not expose any flag here, and expect users to pre-broadcast their arrays?
Date: 2021-07-19 17:12:07 From: Angus Hollands (@agoose77:matrix.org)
I.e. fully commit to left-broadcast in non-Numpy analogues
Date: 2021-07-19 17:13:48 From: Jim Pivarski (@jpivarski)
So, for all non-NumPy analogues that use broadcast_and_apply, add a right_broadcast=False keyword argument, passing that through? I could be onboard with that.
Date: 2021-07-19 17:15:10 From: Jim Pivarski (@jpivarski)
Some choice has to be made, and disabling right-broadcasting by default (with a flag to reverse that decision) is probably less error-prone than enabling it by default.
Date: 2021-07-19 17:16:44 From: Angus Hollands (@agoose77:matrix.org)
I think that's what I'm advocating for. I found it difficult previously to predict what would happen with any complex expression because I wasn't sure if the array was regular or not by the end. I think zip losing regular was one of the reasons for this.
Date: 2021-07-19 17:16:49 From: Jim Pivarski (@jpivarski)
The problem is "guessability," users have to be able to guess what a function is going to do without looking it up all the time, and definitely without having to understand something subtle like right-broadcasting. If some functions have one default and other functions have another, that's not guessable. But the rule here would be if it @ak._connect._numpy.implements NEP 18, then right-broadcasting is on by default; otherwise not.
Date: 2021-07-19 17:19:00 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: just to play devils advocate, could there be a world in which the user could force one kind of broadcasting?
Date: 2021-07-19 17:19:29 From: Angus Hollands (@agoose77:matrix.org)
Never mind how it might be hard to move to from the current approach.
Date: 2021-07-19 17:21:18 From: Jim Pivarski (@jpivarski)
If a user wants to be that involved, they can just make their arrays the same number of dimensions. Length-1 dimensions always broadcast—the left vs right business is for creating a dimension to broadcast, whether it should be created on the left (topmost level) or right (deepest level).
Date: 2021-07-19 17:21:45 From: Angus Hollands (@agoose77:matrix.org)
Yes, forget it - I'm overthinking things
Date: 2021-07-19 17:22:09 From: Jim Pivarski (@jpivarski)
As such, it's a convenience. You can use np.newaxis to make a new axis at exactly the place where you want it (not entirely on the left or entirely on the right, in fact), and then you're completely controlling the broadcasting.
Date: 2021-07-19 17:22:43 From: Jim Pivarski (@jpivarski)
So yeah—all we're doing here is trying to minimize surprise, including the surprise of it not broadcasting.
Date: 2021-07-19 17:23:26 From: Angus Hollands (@agoose77:matrix.org)
So recapping - we intend to replacing right_broadcast=False usage inside Awkward functions with right_broadcast=right_broadcast + a right_broadcast=False flag?
Date: 2021-07-19 17:24:02 From: Angus Hollands (@agoose77:matrix.org)
And separately, replace the use of regular_to_jagged with something that disables right broadcasting (by default, or as a flag)
Date: 2021-07-19 17:25:36 From: Jim Pivarski (@jpivarski)
- ak.zip should not be using regular_to_jagged; that should be replaced by right_broadcast=False, and the default should be set in the ak.zip argument list, to allow for user control.
Date: 2021-07-19 17:26:45 From: Angus Hollands (@agoose77:matrix.org)
Fab
Date: 2021-07-19 17:26:53 From: Jim Pivarski (@jpivarski)
2. All operations that don't implement a NumPy function with NEP 18 (@ak._connect._numpy.implements decorator) should be right_broadcast=False by default, in the high-level function argument list (like ak.zip), for control.
Date: 2021-07-19 17:27:57 From: Jim Pivarski (@jpivarski)
Item 1 comes directly from the bug you encountered; item 2 is a matter of making it homogeneous and therefore guessable.
Date: 2021-07-19 17:29:19 From: Angus Hollands (@agoose77:matrix.org)
Yes that's my take
Date: 2021-07-20 13:52:48 From: Storm Lin (@stormsomething)
Hello! I have a question that I was told to ask here.
I am trying to evaluate a string using something like eval or numexpr.evaulate, similar to how you can specify an uproot cut with a string.
The string I am trying to evaluate includes both numpy functions such as sqrt and sin as well as slices of awkward arrays like a[:,0].
This doesn't work with eval because I can't pass in a reference to numpy if I have a NanoEventsArray as my locals, and it doesn't work with numexpr.evaluate because it says VariableNode is not a subscriptable type.
Is there a way to make this work or what I should try instead?
Date: 2021-07-20 13:57:01 From: Angus Hollands (@agoose77:matrix.org)
@stormsomething: could you elaborate on why you're trying to do this? You don't need to justify what you're doing, I just want to get a better understanding of the context.
Date: 2021-07-20 14:06:17 From: Storm Lin (@stormsomething)
As part of the cabinetry package, the goal is for the user to be able to specify cuts and variables using uproot-like strings, things like "Filter": "lep_charge>0".
Using eval works with basic expressions like the example in the previous sentence, but it fails when trying to do things like "Variable": "lep_pt[:,0]*cos(lep_phi[:,0])".
Date: 2021-07-20 14:12:01 From: Angus Hollands (@agoose77:matrix.org)
If you just want to evaluate Python expressions, eval should be fine here. I don't quite understand what
This doesn't work with eval because I can't pass in a reference to numpy if I have a NanoEventsArray as my locals
means. Could you explain this?
Date: 2021-07-20 14:26:21 From: Storm Lin (@stormsomething)
I am passing a NanoEventsArray (a coffea object where you get awkward arrays using string keys) into the locals argument of eval. I am also trying to have {"np": np} in the globals argument of eval, so I can use numpy functions, but when I do this, it throws a ValueError: key "np" does not exist (not in record).
Date: 2021-07-20 14:27:26 From: Jim Pivarski (@jpivarski)
I hope cabinetry is not developing in such a way as to require Uproot's cut and expression strings. Those were motivated by ROOT aliases and will therefore evolve toward TTree::Draw-style syntax (someday the default "language" will switch from Python to TTree::Draw, once we have a good parser).
If cabinetry wants to run Python expressions from strings, it can run eval or exec in the for loop, outside of Uproot. That would decouple Uproot's handling of these strings, which have to contend with branch names containing dots and TTree::Draw syntax, from cabinetry's handling of these strings, which wants more complex expressions, such as slices with [:, 0] and (in another message from @alexander-held) lambda expressions.
It's always better to decouple things that can be decoupled, especially if they have very different use-cases and would be pulling in different directions (Uproot's strings toward TTree::Draw cases and cabinetry toward general Python).
Date: 2021-07-20 14:29:34 From: Angus Hollands (@agoose77:matrix.org)
@stormsomething: without seeing your code, I can't tell what might be happening very easily. However, your error sounds as though something is evaluating array["np"] instead of bare "np". Does cabinetry try and lookup names as keys in the NanoEventsArray by default? That would explain it.
Date: 2021-07-20 14:31:23 From: Jim Pivarski (@jpivarski)
If you have a NanoEventsArray, then you definitely don't want to be doing Uproot expression strings because Uproot strings only know about TBranch names, not the field names that NanoEvents replaces them with. You don't want your users to have to use different names in different places (e.g. TBranch name Muon_phi in the Uproot string and the NanoEvents name lep_phi elsewhere). If you're in the NanoEvents abstraction, you want to use NanoEvents throughout, and therefore write your own eval/exec.
Date: 2021-07-20 14:35:25 From: Jim Pivarski (@jpivarski)
If you're seeing that ValueError, something is trying to extract "np" as an attribute or property of the NanoEventsArray, which is not what putting {"np": np} as the globals of an eval should do. It sounds like you're working against some system that does too much; you need things to be more decoupled.
Plain-Python's eval (and exec) don't try to extract fields from a NanoEventsArray. What eval are you using? It sounds like it's something built into the NanoEvents infrastructure.
Date: 2021-07-20 14:38:24 From: Storm Lin (@stormsomething)
Yeah that makes sense as something that could be happening where eval only tries to find "np" in the NanoEventsArray keys and never looks in the globals. Thanks, I think this may help me solve my problem.
Date: 2021-07-20 14:43:56 From: Storm Lin (@stormsomething)
The method uses the Plain-Python eval. A NanoEventsArray is passed into locals and is treated as essentially a dict from what I can tell.
Date: 2021-07-20 14:48:46 From: Angus Hollands (@agoose77:matrix.org)
This will be the "problem". If you're writing your own eval-based routine, just create your own namespace, e.g.
records = dict(zip(ak.fields(array), ak.unzip(array)))
namespace = {**records, "np": np, ...}This is assuming that your array behaves like an Awkward record array.
Date: 2021-07-20 14:55:36 From: Jim Pivarski (@jpivarski)
I think it's ak.fields(array). The NanoEvents probably defines properties that aren't technically fields, and retrieving them might cause them to be calculated, which may be too much calculation up-front. But given a string to be interpreted as Python, you can ast.parse it into an AST and walk that tree, looking for instances of ast.Name to either get from the NanoEventsArray as attributes (getattr, which works for both fields and properties) or to recognize as "np" or other global names.
Date: 2021-07-20 14:57:25 From: Jim Pivarski (@jpivarski)
The main thing is for cabinetry to "own" the string-evaluation, to do it independently of other libraries, relying on standards like the Python standard library more than other libraries. This is how you can keep things decoupled so that we don't get into these problems of library A needs string processing to do XYZ and library B needs it to not do XYZ.
Date: 2021-07-20 14:57:25 From: Angus Hollands (@agoose77:matrix.org)
whoops, yep.
Yes, it depends on the scope of what needs to be achieved here. It might be just as straightfoward to declare a custom Mapping that tries a pre-registered namespace for numpy functions etc, and falls back upon the NanoEvents array
Date: 2021-07-20 15:03:05 From: Jim Pivarski (@jpivarski)
Starting from a string to be interpreted as Python gives you the most information and the most tools to work with. If you had been starting with a function (e.g. a Python lambda), then you'd be in a more difficult position.
See Python's ast module: https://docs.python.org/3/library/ast.html
The function for turning a string into an AST is ast.parse, and you could write a NodeVisitor to recursively walk over that tree, but I usually don't even bother with that. Python AST consists of ast.AST instances, the fields of each subclass are in a ._fields attribute (stable API), or lists, or numbers, or strings, or None.
So this is a generic node visitor:
Date: 2021-07-20 15:04:57 From: Angus Hollands (@agoose77:matrix.org)
I suppose I'm asking the question of do we need an AST for this or is the simple eval-method reasonable? In my case, I don't need too much persuading to add an AST (heck, I even wrote my own for Python's grammar ... don't ask)
Date: 2021-07-20 15:06:35 From: Storm Lin (@stormsomething)
The code for making a new namespace seems to have fixed my original problem and the method I'm working on now produces the intended result.
Date: 2021-07-20 15:07:11 From: Jim Pivarski (@jpivarski)
def recurse(node):
if isinstance(node, ast.AST):
for field in node._fields:
recurse(getattr(node, field))
elif isinstance(node, list):
for x in node:
recurse(x)
else:
passYou can then put specialized checks at the beginning of this if-elif chain, such as
if isinstance(node, ast.Name) and isinstance(node.ctx, ast.Load):
# this may be a field of the NanoEvents or "np", etc.Or you could change the AST, instrumenting the code, or return something else.
Already-parsed ASTs don't have to be parsed again; you can pass an AST to Python's built-in compile and then pass that to eval or exec. That would allow you to execute modified ASTs (if you have any reason to modify them).
Date: 2021-07-20 17:29:07 From: Alexander Held (@alexander-held)
Thanks a lot for all the useful discussion here! With the analysis Storm is working with, the limits of what can be nicely specified as a string are starting to become apparent. To address Jim's question of cabinetry requiring uproot style strings: at the moment those are used because they have worked nicely so far, but now that we use both uproot and coffea backends these differences become apparent.
It sounds to me from the comments above that the idea of handling the AST manually within cabinetry is maybe not that bad of an idea? I thought it might be reinventing the wheel and not a good design decision.
Date: 2021-07-20 17:33:24 From: Alexander Held (@alexander-held)
There are clearly limits with complexity where this string approach is not feasible anymore, and people just need to provide code directly (e.g. a full coffea processor). But then maybe for some intermediate cases we need an AST module in cabinetry. Is this something that would overlap a lot with the uproot parsing? I don't have a feeling for the complexity.
Date: 2021-07-20 17:40:20 From: Angus Hollands (@agoose77:matrix.org)
@alexander-held: the complexity required for AST operations is reasonably small — the Python AST module does most of the heavy lifting, as Jim mentioned earlier.
The following gives you the full AST
node = ast.parse(string)And you can then walk the tree to find names, etc. with a few more lines of code.
Date: 2021-07-20 19:34:51 From: Jim Pivarski (@jpivarski)
If these strings are interpreted as Python code, they're not limiting scope to a declarative language or anything, which is what it might look like is happening from examples such as "lep_charge > 0". As long as what's in the string is Python, it could be "complex_function_that_does_bitcoin_mininig_and_controls_nuclear_power_plants() > 0".
If your motive is to make analysis easier to reason about or easier to preserve by limiting the kinds of code that can run as cuts and one-to-one computations (i.e. ufuncs/mappers), then you'll need to define a language or a language subset. That's not as hard as it may sound, particularly the latter: you can subset Python by walking an AST and rejecting the syntax structures you do not want to support. Interpreting free variables as NanoEvents field names is another modification of the language that would make it quicker and easier to use.
If I were to classify easy and hard problems, I'd classify language subsets or modifications when you're starting from unparsed strings as easy, but decoupling conflicting constraints from what cabinetry wants these strings to be and what Uproot wants these strings to be as a hard problem.
For Uproot, subsetting Python was considered a stepping stone toward supporting TTree::Draw, a query language like func_adl, or others (maybe an SQL subset?) with the understanding that the default would change when TTree::Draw is available (with an error warning that tells the user how to get Python interpretation back, if they need it). That's because aliases in TTrees are written in TTree::Draw syntax.
The good news is that subsetting Python is not hard (it was, after all, a quick-and-dirty solution). As long as you start with a string, the Python standard library gives you complete freedom with lots of good tools: ast.parse does the hard step of parsing and compile does the hard step of compiling, while any AST interpretation/modification and setting up the globals/locals environment (the easy parts) are up to you.
This is not a case where piggy-backing on what Uproot has done would make things easier, due to the conflicting intentions of what we want the strings to do, but learning from it as an example (here: https://github.com/scikit-hep/uproot4/blob/main/src/uproot/language/python.py) can be very effective. In that file, you'll find AST-walkers, things for setting up the environment, and the compile → eval/exec step, coming from an AST, rather than an unparsed string.
If you have no interest in subsetting the language to make it declarative—if you'd be happy with users writing large Python code blocks in the cabinetry YAML that import other libraries and do complex stuff (as long as the inputs and outputs are well defined), then you may just want to scan the AST and replace instances of ast.Name objects naming fields of the NanoEvents with nanoevents.field. Or, you might simplify the problem further by saying, "You can't refer to NanoEvents fields as free variables; they have to be attributes of a global object named g, such as g.this and g.that." Then you don't have to modify the AST at all, just provide g in the environment. (It also avoids conflicts and provides a way of dealing with bad identifiers: g("bad-identifier").
Date: 2021-07-20 21:30:19 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: one bit of logic that I don't follow at the moment is the asslice behaviour. In particular, the RegularArray::asslice.
Date: 2021-07-20 21:30:55 From: Angus Hollands (@agoose77:matrix.org)
My current understanding is that asslice is only for Awkward advanced indexing (i.e. there needs to be all var-size dimensions), so it doesn't handle advanced numpy broadcasting
Date: 2021-07-20 21:31:36 From: Angus Hollands (@agoose77:matrix.org)
If this is the case, it surprises me at all thatRegularArray supports asslice
Date: 2021-07-20 21:33:39 From: Angus Hollands (@agoose77:matrix.org)
Naively reading the code, it looks like RegularArrays with size==1 are supposed to introduce a new dimension into a jagged index result?
Date: 2021-07-20 21:39:51 From: Angus Hollands (@agoose77:matrix.org)
OK, looking at the PR that introduces it I think that is the behavior, and asslice makes sense.
Date: 2021-07-20 21:53:04 From: Angus Hollands (@agoose77:matrix.org)
I'll move back to the GitHub issue
Date: 2021-07-20 21:55:18 From: Jim Pivarski (@jpivarski)
Awkward v1 converts each type of object that can be a slice into an equivalent C++ object, since libawkward.so has no dependence on libpython.so. The asslice function performs that conversion (a bit more complicated than it needs to be because things were added on later).
Broadcasting is entirely in Python and has nothing to do with slicing or C++.
Date: 2021-07-20 21:56:56 From: Angus Hollands (@agoose77:matrix.org)
☝️ Edit: My current understanding is that asslice is only for Awkward advanced indexing (i.e. there needs to be all var-size dimensions), so it doesn't handle advanced numpy indexing
Date: 2021-07-20 21:57:03 From: Angus Hollands (@agoose77:matrix.org)
That's a typo on my part!
Date: 2021-07-20 22:16:13 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org asslice turns every kind of slicer into a C++ object, for basic indexing (i.e. integers and slice objects), NumPy advanced indexing (i.e. arrays of integers and booleans), and Awkward advanced indexing (i.e. arrays of missing and jagged data), as well as odds and ends like Ellipsis and np.newaxis.
Date: 2021-07-20 22:19:17 From: Angus Hollands (@agoose77:matrix.org)
Are we talking about the same Content::asslice? The fact that regulararray raises an exception unless the size == 1 made me think otherwise. And indeed, in toslice_part (https://github.com/scikit-hep/awkward-1.0/blob/aa5137ba28a71139fe6ae9f5fa0dd47cc7ea090d/src/python/content.cpp#L578) asslice is only called if we don't want to handle as numpy
Date: 2021-07-20 22:25:24 From: Angus Hollands (@agoose77:matrix.org)
I explained my current understanding on the issue 🙂 https://github.com/scikit-hep/awkward-1.0/issues/1022#issuecomment-883733719
Date: 2021-07-20 22:25:41 From: Angus Hollands (@agoose77:matrix.org)
I'm heading off now ... Good night!
Date: 2021-07-20 22:33:33 From: Jim Pivarski (@jpivarski)
Eventually, the toslice_part function turns NumPy arrays into SliceArray64 objects. The code path is convoluted, so it might do so through handle_as_numpy or it might do so further down in the else of that function (handle_as_numpy goes through Contents, including NumpyArray; the older code in the else uses pybind's conversion to buffers).
The code here determines whether the output is SliceArray64 (NumPy advanced indexing) or SliceJagged64/SliceMissing64 (Awkward advanced indexing), but one way or another, the output is a list of SliceItems in a Slice object or it raises an exception saying that the object can't be used as a slicer.
Date: 2021-07-20 22:33:39 From: Jim Pivarski (@jpivarski)
'Night!
Date: 2021-07-21 14:32:21 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org I've been investigating https://github.com/scikit-hep/awkward-1.0/issues/1022. Normally, the mixed variable and regular slicer that you made would be declared invalid: all-variable dimensions triggers Awkward advanced indexing and all-regular triggers NumPy advanced indexing; a mixture would be confusing, so it's not allowed.
However, it sneaks through because length-1 regular arrays in a slice are interpreted as SliceVarNewAxis: https://github.com/scikit-hep/awkward-1.0/pull/694/files#diff-a63810de74c2520ec41382cece2d156993c47ba9eb69772ce6b10a8262536e22
The use of the word "newaxis" is a little misleading; this is not representing a np.newaxis object in a slice, but a length-1 regular axis that was probably made by a np.newaxis. These tests demonstrate its use: https://github.com/scikit-hep/awkward-1.0/pull/694/files#diff-822ebabcc1ec64f9f91037a24a786bc869db7a9ebd334cf14189f5d8c0149988 (the np.newaxis modifies the slicer, which is then used to slice the array. SliceVarNewAxis objects are created when slicing array, not when slicing slicer.
This feature was added in a rush to prepare a tutorial that never happened (https://github.com/jpivarski-talks/2021-02-09-propose-scipy2021-tutorial/blob/main/prep/million-song.ipynb). It was the most minimal way I could see to add features that were necessary to do the analysis in that tutorial. The idea was that this is replicating a NumPy feature—boradcasting length-1 dimensions in slices—in Awkward advanced indexing. But Awkward advanced indexing fits a nested structure in the slicer to the nested structure that you're slicing, whereas NumPy advanced indexing slices each dimension by a different array, and all of those arrays in the tuple are broadcasted. NumPy advanced indexing is truly broadcasting because there are multiple arrays in the slicer; Awkward advanced indexing has only one array in the slicer, so it's not really broadcasting.
Treating a length-1 dimension differently from any other length makes this rule hard to predict. The idea was that you'd get the length-1 dimension from a np.newaxis, but in your case, you got it from a reducer with keepdims=True. I'm thinking this was a bad rule to have introduced: it has unforeseen consequences.
In the test suite, the rule is only triggered in the tests that were added to check it. The rule was never advertised (after all, that tutorial was never presented), and it is unlikely to have made its way into any analyses other than yours, since it's rather easy to trigger the FIXME (which is much older, and may yet be unreachable without the new SliceVarNewAxis rule). Slices that are enabled by the rule can still be performed without it—for instance, this test using the rule:
array = ak.Array(
[
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19], [20, 21, 22, 23, 24], [25, 26, 27, 28, 29]],
]
)
slicer = ak.Array([[3, 4], [0, 1, 2, 3]])
assert array[slicer[:, np.newaxis]].tolist() == [
[[3, 4], [8, 9], [13, 14]],
[[15, 16, 17, 18], [20, 21, 22, 23], [25, 26, 27, 28]],
]can be replaced by the following, without the new rule:
assert array[[[[3, 4], [3, 4], [3, 4]], [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]]].tolist() == [
[[3, 4], [8, 9], [13, 14]],
[[15, 16, 17, 18], [20, 21, 22, 23], [25, 26, 27, 28]],
]Your use-case is also possible, but only if the slicer is all-variable (in keeping with the rule to avoid confusion between Awkward advanced indexing and NumPy advanced indexing):
>>> y = ak.Array([[[1, 2, 3, 4], [5, 6, 7, 8]]])
>>> t = ak.argmax(y, axis=-1, keepdims=True)
>>> y[t]
<Array [[[4], [8]]] type='1 * var * var * ?int64'>(y is allowed to have regular dimensions, but t can't mix regular with irregular. You could keep y irregular or make t regular.)
So I think I'm going to remove it, which reverts only PR: https://github.com/scikit-hep/awkward-1.0/pull/694.
Date: 2021-07-21 14:38:11 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: I replied on the issue and came to a similar conclusion that as this isn't an advertised feature of jagged indexing yet, it's not something that people should be relying on — if we didn't support SliceVarNewAxis then my case was an error because I had a regular dim.
I think the goal is a really good one - I would like for advanced indexing to support this kind of broadcasting, I feel as though I have run into needing it before, and I believe @nsmith- did for his use case in one of the motivating issues.
Although it is spooky behaviour, I think that's really just a manifestation of the fact that reg and var dimensions need to be carefully discriminated between; as I mentioned. That's a tricky point that people need to know in any case to use Awkward.
Date: 2021-07-21 14:38:44 From: Angus Hollands (@agoose77:matrix.org)
However, if the current implementation isn't right (and I don't think it is quite), perhaps removing it for now is a good idea so that we don't constrain the future development with an old design?
Date: 2021-07-21 14:47:36 From: Jim Pivarski (@jpivarski)
@agoose77:matrix.org Since you're likely the only person to have run into it, I think it can be quietly removed, as long as you're okay with that. It sounds like you are.
Also, you're okay with doing your slice as y[t] (if y is variable) or y[ak.from_regular(t, axis=2)] (if the last axis of y is regular)? It may be that ak.from_regular and ak.to_regular will need to support axis=None (probably by default) to make these conversions easier.
Date: 2021-07-21 14:48:29 From: Jim Pivarski (@jpivarski)
With your approval, I'm going to revert https://github.com/scikit-hep/awkward-1.0/pull/694 (manually, keeping some things like the better diagnostics I added while doing it).
Date: 2021-07-21 14:48:41 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski: yes, I have been thinking that to/from_regular need an "all" equivalent for a while actually (but too lazy to do it!)
Date: 2021-07-21 14:48:52 From: Angus Hollands (@agoose77:matrix.org)
I have no issues here - the only place I've run into it knowingly is the case that raises the RuntimeError
Date: 2021-07-21 14:49:32 From: Jim Pivarski (@jpivarski)
If I have time after reverting that PR, I'll add axis=None to to/from_regular.
Date: 2021-07-21 14:50:04 From: Jim Pivarski (@jpivarski)
Okay, I'm reverting https://github.com/scikit-hep/awkward-1.0/pull/694.
Date: 2021-07-21 14:50:18 From: Angus Hollands (@agoose77:matrix.org)
Something that has been gnawing away at me is whether we need the regular vs var distinction w.r.t high level routines. At the content level, it makes sense because knowing something's regular means we can do things more efficiently. But, having to ensure that something is reg / iregg to use a feature feels frustrating sometimes. It might well be that it's required, but I wonder...
Date: 2021-07-21 14:51:07 From: Jim Pivarski (@jpivarski)
I think most people use variable-length lists all the time.
Date: 2021-07-21 14:51:31 From: Angus Hollands (@agoose77:matrix.org)
The main cases I can think of are
- Detecting left vs right broadcasting
- Numpy indexing vs Jagged indexing
Date: 2021-07-21 14:52:16 From: Jim Pivarski (@jpivarski)
I'm pretty sure that's a complete list of things that depend on reg vs var.
Date: 2021-07-21 14:53:10 From: Angus Hollands (@agoose77:matrix.org)
Probably. I tend to gravitate to regular layouts when I know that my data are regular, e.g. a block of waveform samples. It feels a bit strange to require more memory and/or a copy (do we check if the offsets in a list offset are equal-sized to avoid copying when converting to regular?) in order to express storage-unrelated semantics
Date: 2021-07-21 14:55:32 From: Angus Hollands (@agoose77:matrix.org)
However, the more I try and think about "solving this", the ultimate answer is that it's very difficult to maintain transparent NumPy compatability and everything else!
Date: 2021-07-21 14:56:23 From: Angus Hollands (@agoose77:matrix.org)
One thing I did think about was - do we need to require all var dimensions for jagged indexing? would it be reasonable to only require a single var dimension?
Date: 2021-07-21 15:00:34 From: Angus Hollands (@agoose77:matrix.org)
This might simplify the case where one mixes regular and irregular data. It doesn't really solve the case that we need to guarantee that at least one-dim is var for indexing, but it might loosen the restriction where it isn't useful?
Date: 2021-07-21 15:01:14 From: Jim Pivarski (@jpivarski)
One possibility would be to say, "If one dimension is var, the rest are converted to var before slicing." Then the slicing rules would usually be Awkward's, with NumPy rules only taking effect in the all-regular case. The downside of this is that if someone is relying on NumPy rules, it could accidentally slip into Awkward rules without an error message for small deviations. The "all var or all reg" rule was to put a demiliterized zone between the two cases.
Date: 2021-07-21 15:02:47 From: Angus Hollands (@agoose77:matrix.org)
A fair point. I'm approaching this from an "I want Awkward semantics" perspective, but both are reasonable perspectives. I'm going to put this one down here because I don't think there's an obvious "better" answer than what we have currently
Date: 2021-07-21 15:52:59 From: Jim Pivarski (@jpivarski)
I made a notification here: https://github.com/scikit-hep/awkward-1.0/discussions/1027
and will start removing it now.
Date: 2021-07-21 16:57:33 From: Angus Hollands (@agoose77:matrix.org)
Thanks Jim. I've thought about it some more, and I can see the case for never replacing it in the same form.
My example wasn't really a problem because reducers on the same array will behave identically. It's only if the user produces an index with a regular 1-length dimension from some other process that this is an issue, because without SliceVarNewAxis it would normally raise an error.
Although it is useful to broadcast like this, it adds another mode of indexing that people need to be aware of, and reduces the ways that we can warn people about mistakes in their code (e.g. reducing a dimension wrongly).
Date: 2021-07-21 16:59:21 From: Angus Hollands (@agoose77:matrix.org)
Off-hand, do you have a personal rule of thumb about when to move something into its own record array, vs keeping everything top level, e.g..
n * var * {"x" * float64, "y" * float64, "t" * float64}vs
n * var * {"pos": {"x" * float64, "y" * float64}, "t" * float64}I can't seem to settle on a "good" solution besides "if it needs its own behaviour, put it in a recordarray"
Date: 2021-07-21 17:13:06 From: Jim Pivarski (@jpivarski)
Changing the record-nesting depth without any variable-length structures is both technically and semantically equivalent to renaming the fields. Instead of calling a field "x", it is now called "pos.x" (thinking of the dot as part of the name, though the implementation is two lookups, rather than one).
But the fact that you might want to isolate behavior is a good reason for doing it that I don't usually think of. With the "x" and "y" sectioned off, you can add behaviors to "pos" that have nothing to do with "t". Since the functions you write to define behaviors are duck-typed, they can be written for an "x", "y", "t" array and simply ignore the presence of "t" (and therefore work equally well on a record without "t"), so this, too, is a "soft" reason for it. (If you're collaborating, defining behaviors for a restricted "pos" prevents collaborators from including "t" in any functions, which helps to set boundaries and avoid dependency-creep.)
In the end, there's no "hard," technical rule for why you should want one structure rather than another, but it can change the mindset of how they're supposed to be thought of.
Date: 2021-07-21 18:04:31 From: Angus Hollands (@agoose77:matrix.org)
@jpivarski that's exactly it - they're both near 'equivalent' from a design constraints perspective. I'm currently settled on make everything top level unless behaviour or structure dictates
Date: 2021-07-22 22:20:40 From: Angus Hollands (@agoose77:matrix.org)
So I did some more thinking on this, and concluded t
