[df] Enable snapshotting RNTuple cardinality cols #20820
+46
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
RNTuple's cardinality fields are projected read-only fields, and currently an exception is thrown when a user tries to snapshot fields of this type to a new RNTuple.
To prevent this from happening, with this PR, such fields are instead converted into non-projected fields of the inner
ROOT::RNTupleCardinality<SizeT>field (eitherstd::uint32_torstd::int64_t) before they are added to the model of the new RNTuple. A warning is shown to the user when this happens.A follow-up/alternative approach is to preserve the projection when creating the model for the output RNTuple. However, this comes with the caveat that the source fields must be included in the output RNTuple. This becomes an issue for cardinality fields of collections of anonymous records (i.e., as is the case for NanoAODs, see paragraph below), since the RNTuple data source here only exposes the inner fields and not the collection field itself, because there is no straightforward way to represent the anonymous record in memory.
A notable scenario is the current implementation of CMS NanoAOD, which in the TTree format contain leaflist arrays. When converting to RNTuple these leaflist arrays, e.g. created via
tree.Branch("jet_pt", &jet_pt, "jet_pt[njets]"), the RNTupleImporter creates an anonymous collection record, wherejet_ptbecomes a true collection field, andnjetsis a projected field of typeRNTupleCardinality. As such, currently RDataFrame is not capable of writing out RNTuple NanoAOD data via Snapshot that preserves the column names for both the collection payload and also the size of the collections. We want to be able to preserve the complete NanoAOD schema.