-
Hi, guys! I've ported my scripts from uproot to uproot4 and noticed one effect, that I've not seen in the docs: uproot4 in case of library='pd' passed returns for me tuple of dataframes import uproot4
events2 = uproot4.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"]
df2 = events2.arrays(entry_stop=10, library='pd')
type(df2)
# tuple but previous version has returned dataframe: import uproot
events2 = uproot.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"]
df = events2.pandas.df(["MET_p*", "Muon_P*"], entrystop=10)
type(df)
# pandas.core.frame.DataFrame So what I can't realized is the reason of such behavior and moreover looks like that the data in each of dataframe from the tuple is not unique and have duplicates, so I can't merge it. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This is a change in behavior from Uproot 3, and if, after understanding the reason why, you have some suggested text for how it should be explained in the documentation, let me know. Although Pandas DataFrames can describe jagged data by putting the jaggedness into a MultiIndex, a DataFrame can have only one index. Thus, data with different multiplicities can't go into the same DataFrame. Below, we ask for both jets and muons. The jet index (in each event) is unrelated to the muon index (in each event), so they really have to be in different DataFrames. >>> events.arrays(filter_name=["Jet_P*", "Muon_P*"], library="pd")
( Jet_Px Jet_Py Jet_Pz
entry subentry
1 0 -38.874714 19.863453 -0.894942
3 0 -71.695213 93.571579 196.296432
1 36.606369 21.838793 91.666283
2 -28.866419 9.320708 51.243221
4 0 3.880162 -75.234055 -359.601624
... ... ... ...
2417 0 -33.196457 -59.664749 -29.040150
1 -26.086025 -19.068407 26.774284
2418 0 -3.714818 -37.202377 41.012222
2419 0 -36.361286 10.173571 226.429214
1 -15.256871 -27.175364 12.119683
[2773 rows x 3 columns],
Muon_Px Muon_Py Muon_Pz
entry subentry
0 0 -52.899456 -11.654672 -8.160793
1 37.737782 0.693474 -11.307582
1 0 -0.816459 -24.404259 20.199968
2 0 48.987831 -21.723139 11.168285
1 0.827567 29.800508 36.965191
... ... ... ...
2416 0 -39.285824 -14.607491 61.715790
2417 0 35.067146 -14.150043 160.817917
2418 0 -29.756786 -15.303859 -52.663750
2419 0 1.141870 63.609570 162.176315
2420 0 23.913206 -35.665077 54.719437
[3825 rows x 3 columns]) Uproot 3 merged these because that's similar to what ROOT's >>> left, right = events.arrays(filter_name=["Jet_P*", "Muon_P*"], library="pd")
>>> pd.merge(left, right, left_index=True, right_index=True, how="inner")
Jet_Px Jet_Py ... Muon_Py Muon_Pz
entry subentry ...
1 0 -38.874714 19.863453 ... -24.404259 20.199968
3 0 -71.695213 93.571579 ... -85.835464 403.848450
1 36.606369 21.838793 ... -13.956494 335.094208
4 0 3.880162 -75.234055 ... 67.248787 -89.695732
1 4.979580 -39.231731 ... 25.403667 20.115053
... ... ... ... ... ...
2414 0 33.961163 58.900467 ... -42.204014 -64.264900
2416 0 37.071465 20.131996 ... -14.607491 61.715790
2417 0 -33.196457 -59.664749 ... -14.150043 160.817917
2418 0 -3.714818 -37.202377 ... -15.303859 -52.663750
2419 0 -36.361286 10.173571 ... 63.609570 162.176315
[2038 rows x 6 columns]
>>> pd.merge(left, right, left_index=True, right_index=True, how="outer")
Jet_Px Jet_Py ... Muon_Py Muon_Pz
entry subentry ...
0 0 NaN NaN ... -11.654672 -8.160793
1 NaN NaN ... 0.693474 -11.307582
1 0 -38.874714 19.863453 ... -24.404259 20.199968
2 0 NaN NaN ... -21.723139 11.168285
1 NaN NaN ... 29.800508 36.965191
... ... ... ... ... ...
2417 1 -26.086025 -19.068407 ... NaN NaN
2418 0 -3.714818 -37.202377 ... -15.303859 -52.663750
2419 0 -36.361286 10.173571 ... 63.609570 162.176315
1 -15.256871 -27.175364 ... NaN NaN
2420 0 NaN NaN ... -35.665077 54.719437
[4560 rows x 6 columns] The first of these, the Relationally (i.e. in SQL-land), the right thing to do is to keep these DataFrames separate and join on the first level of their MultiIndex. The fact that that's such a pain has a lot to do with why Awkward Array exists (see StackOverflow question from four years ago!). But I digress. If you don't want data with different multiplicities, you get a single DataFrame, no tuple. >>> events.arrays(filter_name=["Muon_P*"], library="pd")
Muon_Px Muon_Py Muon_Pz
entry subentry
0 0 -52.899456 -11.654672 -8.160793
1 37.737782 0.693474 -11.307582
1 0 -0.816459 -24.404259 20.199968
2 0 48.987831 -21.723139 11.168285
1 0.827567 29.800508 36.965191
... ... ... ...
2416 0 -39.285824 -14.607491 61.715790
2417 0 35.067146 -14.150043 160.817917
2418 0 -29.756786 -15.303859 -52.663750
2419 0 1.141870 63.609570 162.176315
2420 0 23.913206 -35.665077 54.719437
[3825 rows x 3 columns] As an interesting in-between case, if you want jagged and non-jagged data, the non-jagged data can be broadcasted to the jagged and that gives you a single DataFrame, too. >>> events.arrays(filter_name=["MET_p*", "Muon_P*"], library="pd")
Muon_Px Muon_Py Muon_Pz MET_px MET_py
entry subentry
0 0 -52.899456 -11.654672 -8.160793 5.912771 2.563633
1 37.737782 0.693474 -11.307582 5.912771 2.563633
1 0 -0.816459 -24.404259 20.199968 24.765203 -16.349110
2 0 48.987831 -21.723139 11.168285 -25.785088 16.237131
1 0.827567 29.800508 36.965191 -25.785088 16.237131
... ... ... ... ... ...
2416 0 -39.285824 -14.607491 61.715790 -14.607650 -28.204895
2417 0 35.067146 -14.150043 160.817917 22.208313 59.774940
2418 0 -29.756786 -15.303859 -52.663750 18.101646 50.290718
2419 0 1.141870 63.609570 162.176315 79.875191 -52.351452
2420 0 23.913206 -35.665077 54.719437 19.713749 -3.595418
[3825 rows x 5 columns] Arguably, maybe this should be two DataFrames as well, since the broadcasting means that MET values in events with no muons are dropped. But if that was an issue, you'd just call Arguably, maybe this function should always return a tuple, possibly a tuple of one item, so that types don't depend on values. The way it is now, whether you get a tuple of DataFrames or just a DataFrame depends on what kinds of branches exist in the ROOT file and whether you've asked for them. I'm on the fence about that: always returning the same type is nice for predictability, but it could be hard explaining to everyone why they're getting a tuple with only one item. It then becomes cumbersome to unpack (I like the So that's why. I'm going to close this because it's not a bug/issue and label it as a question. If you want to start a discussion about changing the behavior, I'll reopen it as a policy question. If you have a suggested edit to the documentation, I'll take a PR. Thanks! |
Beta Was this translation helpful? Give feedback.
This is a change in behavior from Uproot 3, and if, after understanding the reason why, you have some suggested text for how it should be explained in the documentation, let me know.
Although Pandas DataFrames can describe jagged data by putting the jaggedness into a MultiIndex, a DataFrame can have only one index. Thus, data with different multiplicities can't go into the same DataFrame.
Below, we ask for both jets and muons. The jet index (in each event) is unrelated to the muon index (in each event), so they really have to be in different DataFrames.