The `library="pd"` option sometimes returns a Pandas DataFrame, sometimes a tuple of DataFrames #114

bdrum · 2020-09-24T14:11:12Z

bdrum
Sep 24, 2020

Hi, guys!
Thanks again (I'll never tire to repeat it 👍 ) for the package!

I've ported my scripts from uproot to uproot4 and noticed one effect, that I've not seen in the docs:

uproot4 in case of library='pd' passed returns for me tuple of dataframes

import uproot4
events2 = uproot4.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"]
df2 = events2.arrays(entry_stop=10, library='pd')
type(df2)
# tuple

but previous version has returned dataframe:

import uproot
events2 = uproot.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"]
df = events2.pandas.df(["MET_p*", "Muon_P*"], entrystop=10)
type(df)
# pandas.core.frame.DataFrame

So what I can't realized is the reason of such behavior and moreover looks like that the data in each of dataframe from the tuple is not unique and have duplicates, so I can't merge it.

Answered by jpivarski

Sep 24, 2020

This is a change in behavior from Uproot 3, and if, after understanding the reason why, you have some suggested text for how it should be explained in the documentation, let me know.

Although Pandas DataFrames can describe jagged data by putting the jaggedness into a MultiIndex, a DataFrame can have only one index. Thus, data with different multiplicities can't go into the same DataFrame.

Below, we ask for both jets and muons. The jet index (in each event) is unrelated to the muon index (in each event), so they really have to be in different DataFrames.

>>> events.arrays(filter_name=["Jet_P*", "Muon_P*"], library="pd")
(                   Jet_Px     Jet_Py      Jet_Pz
entry subentry      …

View full answer

jpivarski · 2020-09-24T14:50:52Z

jpivarski
Sep 24, 2020
Maintainer

This is a change in behavior from Uproot 3, and if, after understanding the reason why, you have some suggested text for how it should be explained in the documentation, let me know.

Although Pandas DataFrames can describe jagged data by putting the jaggedness into a MultiIndex, a DataFrame can have only one index. Thus, data with different multiplicities can't go into the same DataFrame.

Below, we ask for both jets and muons. The jet index (in each event) is unrelated to the muon index (in each event), so they really have to be in different DataFrames.

>>> events.arrays(filter_name=["Jet_P*", "Muon_P*"], library="pd")
(                   Jet_Px     Jet_Py      Jet_Pz
entry subentry                                  
1     0        -38.874714  19.863453   -0.894942
3     0        -71.695213  93.571579  196.296432
      1         36.606369  21.838793   91.666283
      2        -28.866419   9.320708   51.243221
4     0          3.880162 -75.234055 -359.601624
...                   ...        ...         ...
2417  0        -33.196457 -59.664749  -29.040150
      1        -26.086025 -19.068407   26.774284
2418  0         -3.714818 -37.202377   41.012222
2419  0        -36.361286  10.173571  226.429214
      1        -15.256871 -27.175364   12.119683

[2773 rows x 3 columns],
                   Muon_Px    Muon_Py     Muon_Pz
entry subentry                                  
0     0        -52.899456 -11.654672   -8.160793
      1         37.737782   0.693474  -11.307582
1     0         -0.816459 -24.404259   20.199968
2     0         48.987831 -21.723139   11.168285
      1          0.827567  29.800508   36.965191
...                   ...        ...         ...
2416  0        -39.285824 -14.607491   61.715790
2417  0         35.067146 -14.150043  160.817917
2418  0        -29.756786 -15.303859  -52.663750
2419  0          1.141870  63.609570  162.176315
2420  0         23.913206 -35.665077   54.719437

[3825 rows x 3 columns])

Uproot 3 merged these because that's similar to what ROOT's TTree::Scan does. But it's odd to say that the jet at index 0 is in the same row as the muon at index 0 (how are these two collections sorted? why associate e.g. the most energetic jet with the most energetic muon? they might not have anything to do with each other). If you want to do that, you can opt-in:

>>> left, right = events.arrays(filter_name=["Jet_P*", "Muon_P*"], library="pd")
>>> pd.merge(left, right, left_index=True, right_index=True, how="inner")
                   Jet_Px     Jet_Py  ...    Muon_Py     Muon_Pz
entry subentry                        ...                       
1     0        -38.874714  19.863453  ... -24.404259   20.199968
3     0        -71.695213  93.571579  ... -85.835464  403.848450
      1         36.606369  21.838793  ... -13.956494  335.094208
4     0          3.880162 -75.234055  ...  67.248787  -89.695732
      1          4.979580 -39.231731  ...  25.403667   20.115053
...                   ...        ...  ...        ...         ...
2414  0         33.961163  58.900467  ... -42.204014  -64.264900
2416  0         37.071465  20.131996  ... -14.607491   61.715790
2417  0        -33.196457 -59.664749  ... -14.150043  160.817917
2418  0         -3.714818 -37.202377  ... -15.303859  -52.663750
2419  0        -36.361286  10.173571  ...  63.609570  162.176315

[2038 rows x 6 columns]
>>> pd.merge(left, right, left_index=True, right_index=True, how="outer")
                   Jet_Px     Jet_Py  ...    Muon_Py     Muon_Pz
entry subentry                        ...                       
0     0               NaN        NaN  ... -11.654672   -8.160793
      1               NaN        NaN  ...   0.693474  -11.307582
1     0        -38.874714  19.863453  ... -24.404259   20.199968
2     0               NaN        NaN  ... -21.723139   11.168285
      1               NaN        NaN  ...  29.800508   36.965191
...                   ...        ...  ...        ...         ...
2417  1        -26.086025 -19.068407  ...        NaN         NaN
2418  0         -3.714818 -37.202377  ... -15.303859  -52.663750
2419  0        -36.361286  10.173571  ...  63.609570  162.176315
      1        -15.256871 -27.175364  ...        NaN         NaN
2420  0               NaN        NaN  ... -35.665077   54.719437

[4560 rows x 6 columns]

The first of these, the how="inner" join, truncates the two collections in each event to the shorter of the two. The second, the how="outer" join, pads the shorter collection to have the same length as the longer with NaN values.

Relationally (i.e. in SQL-land), the right thing to do is to keep these DataFrames separate and join on the first level of their MultiIndex. The fact that that's such a pain has a lot to do with why Awkward Array exists (see StackOverflow question from four years ago!). But I digress.

If you don't want data with different multiplicities, you get a single DataFrame, no tuple.

>>> events.arrays(filter_name=["Muon_P*"], library="pd")
                  Muon_Px    Muon_Py     Muon_Pz
entry subentry                                  
0     0        -52.899456 -11.654672   -8.160793
      1         37.737782   0.693474  -11.307582
1     0         -0.816459 -24.404259   20.199968
2     0         48.987831 -21.723139   11.168285
      1          0.827567  29.800508   36.965191
...                   ...        ...         ...
2416  0        -39.285824 -14.607491   61.715790
2417  0         35.067146 -14.150043  160.817917
2418  0        -29.756786 -15.303859  -52.663750
2419  0          1.141870  63.609570  162.176315
2420  0         23.913206 -35.665077   54.719437

[3825 rows x 3 columns]

As an interesting in-between case, if you want jagged and non-jagged data, the non-jagged data can be broadcasted to the jagged and that gives you a single DataFrame, too.

>>> events.arrays(filter_name=["MET_p*", "Muon_P*"], library="pd")
                  Muon_Px    Muon_Py     Muon_Pz     MET_px     MET_py
entry subentry                                                        
0     0        -52.899456 -11.654672   -8.160793   5.912771   2.563633
      1         37.737782   0.693474  -11.307582   5.912771   2.563633
1     0         -0.816459 -24.404259   20.199968  24.765203 -16.349110
2     0         48.987831 -21.723139   11.168285 -25.785088  16.237131
      1          0.827567  29.800508   36.965191 -25.785088  16.237131
...                   ...        ...         ...        ...        ...
2416  0        -39.285824 -14.607491   61.715790 -14.607650 -28.204895
2417  0         35.067146 -14.150043  160.817917  22.208313  59.774940
2418  0        -29.756786 -15.303859  -52.663750  18.101646  50.290718
2419  0          1.141870  63.609570  162.176315  79.875191 -52.351452
2420  0         23.913206 -35.665077   54.719437  19.713749  -3.595418

[3825 rows x 5 columns]

Arguably, maybe this should be two DataFrames as well, since the broadcasting means that MET values in events with no muons are dropped. But if that was an issue, you'd just call events.arrays multiple times, to make the DataFrames separate on purpose. The auto-broadcasting simplifies a common case (you don't have to explicitly pd.merge).

Arguably, maybe this function should always return a tuple, possibly a tuple of one item, so that types don't depend on values. The way it is now, whether you get a tuple of DataFrames or just a DataFrame depends on what kinds of branches exist in the ROOT file and whether you've asked for them. I'm on the fence about that: always returning the same type is nice for predictability, but it could be hard explaining to everyone why they're getting a tuple with only one item. It then becomes cumbersome to unpack (I like the variable_name, = thing_that_returns_singleton() syntax, but a lot of people don't like the meaningfulness of the trailing comma, including black.)

So that's why. I'm going to close this because it's not a bug/issue and label it as a question. If you want to start a discussion about changing the behavior, I'll reopen it as a policy question. If you have a suggested edit to the documentation, I'll take a PR. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The `library="pd"` option sometimes returns a Pandas DataFrame, sometimes a tuple of DataFrames #114

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The library="pd" option sometimes returns a Pandas DataFrame, sometimes a tuple of DataFrames #114

Uh oh!

Uh oh!

bdrum Sep 24, 2020

Replies: 1 comment

Uh oh!

jpivarski Sep 24, 2020 Maintainer

The `library="pd"` option sometimes returns a Pandas DataFrame, sometimes a tuple of DataFrames #114

bdrum
Sep 24, 2020

jpivarski
Sep 24, 2020
Maintainer