[ntuple] Change entry handling in `RNTupleProcessor` to properly handle missing values by enirolf · Pull Request #19111 · root-project/root

enirolf · 2025-06-20T09:57:22Z

This PR adds a significant change to the way field values can be accessed from an RNTupleProcessor, and, under the hood, how the RNTupleProcessor handles the entries that manage these values. The motivation behind this change is that the current processor interface has two main shortcomings:

It is too flexible, allowing users to call GetPtr on the entry inside the event loop which is generally considered a bad practice.
It cannot properly handle values missing from entries as the result of an incomplete join or missing fields in subsequent chains.

To address these shortcomings, we introduce two new classes, the RNTupleProcessorEntry and the RNTupleProcessorOptionalPtr, which are described in more detail below. With the introduction of these classes, the use of the RNTupleModel is removed from the public interface entirely.

The changes in this PR are also in preparation of the use of the RNTupleProcessor as an RDataFrame data source. A minimal working example has been developed to validate that this new RNTupleProcessor interface is compatible with the abstract interface of RDataSource.

The RNTupleProcessorEntry

The RNTupleProcessorEntry is an internal class that largely mirrors the REntry interface, but with additional functionality for the RNTupleProcessor. Most notably, fields and their values present in the entry have a notion of validity. Only when a field is valid, its value can be read.

In addition, it keeps track of which fields are auxiliary if a join operation is involved. This is needed in order to properly resolve from which underlying RNTuple the field data should be read. Since the RNTupleProcessor is composable, this is also relevant for the chain-based processor.

The RNTupleProcessorOptionalPtr

The RNTupleProcessorOptionalPtr is what the user or upstream application will actually iteract with to read field values from a processor entry. It is obtained when registering a field in the processor through RegisterField(), which has to be done before and processor iteration starts (because, as stated before, afterwards the entry will be frozen).
The RNTupleProcessorOptionalPtr can be used to get a const reference or a pointer to the field's value. If the field is marked invalid in the entry at the time of access, the const reference access method will throw an exception, whereas the pointer access method returns a nullptr.

Code example

Old interface

auto model = RNTupleModel::Create();
auto fldX = model->MakeField<float>("x");

auto proc =
  RNTupleProcessor::CreateChain({{"ntuple", "ntuple1.root"}, {"ntuple", "ntuple1.root"}}, std::move(model));

for (const auto &entry : *proc) {
  // Either (no way to guarantee the validity of `fldX` in this entry):
  std::cout << "x = " << *fldX << std::endl;
  // Or, same issue as above, plus expensive:
  std::cout << "x = " << entry.GetPtr<float>("x") << std::endl;
}

New interface

auto proc =
  RNTupleProcessor::CreateChain({{"ntuple", "ntuple1.root"}, {"ntuple", "ntuple1.root"}});

// Returns RNTupleProcessorOptionalPtr
auto x = proc->RequestField<float>("x");

// Instead of the full entry, now only provide the entry number as iterator value
for (const auto &idx : *proc) {
  // Without this check, `*x` will throw for entries where it is invalid.
  if (x.HasValue())
    std::cout << "x = " << *x << std::endl;
}

tutorials/io/ntuple/ntpl012_processor_chain.C

tutorials/io/ntuple/ntpl015_processor_join.C

github-actions · 2025-06-20T12:02:59Z

Test Results

22 files 22 suites 3d 18h 48m 58s ⏱️
3 701 tests 3 700 ✅ 0 💤 1 ❌
79 468 runs 79 466 ✅ 0 💤 2 ❌

For more details on these failures, see this check.

Results for commit 0520c5f.

♻️ This comment has been updated with latest results.

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

hahnjo

First code review - I have two overall design questions that I will post separately so their inline comments are not mixed with these.

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx

tree/ntuple/src/RNTupleProcessor.cxx

hahnjo

My two questions:

Do we (want to) support conversion and schema evolution when reading fields via the RNTupleProcessor? (triggered by #19111 (comment))
At the moment, every RNTupleProcessor has its own RNTupleProcessorEntry that are "connected together" via shared pointers. However, the field validity is not shared which could lead to problems with nested joins where the outer one is valid, but the inner one does not find its value (see also https://github.com/root-project/root/pull/19111/files#r2194283077).

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

tree/ntuple/src/RNTupleProcessor.cxx

pcanal · 2025-07-15T16:03:41Z

My two questions:

1. Do we (want to) support conversion and schema evolution when reading fields via the `RNTupleProcessor`? (triggered by [#19111 (comment)](https://github.com/root-project/root/pull/19111#discussion_r2194261277))

In my opinion we have no choice but to (eventually) support schema evolution (otherwise the RNTupleProcessor would be unusable to read any non-current files).

enirolf

Thank you for you extensive review, @hahnjo! To answer your questions:

Do we (want to) support conversion and schema evolution when reading fields via the RNTupleProcessor? (triggered by #19111 (comment))

Yes, to echo what Philippe wrote, we have no choice (not just for schema evolution, but also to allow alternative field types through e.g., RDF). I admit that I didn't yet take this into account in this particular PR, but I also think what with the proposed interface change there's no blocker not to have this. I will address it and depending on how much it would blow up this PR even more, I'll either add it here or in a follow-up PR.

At the moment, every RNTupleProcessor has its own RNTupleProcessorEntry that are "connected together" via shared pointers. However, the field validity is not shared which could lead to problems with nested joins where the outer one is valid, but the inner one does not find its value (see also https://github.com/root-project/root/pull/19111/files#r2194283077).

That's a good point, and something that needs to indeed also be addressed more properly. For this one as well, I will investigate more and either add it to this PR or create a follow-up.

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

tree/ntuple/src/RNTupleProcessor.cxx

vepadulano

The work is already at a really advanced state and the direction is very positive, thanks! A few minor suggestions/considerations. Furthermore, I don't see the new API to check for missing values e.g. FieldIsValid` being used in the tests, would be good to have at least one simple test already in this PR. A more thorough testing campaign could be added later.

EDIT: Sorry, I was looking only for FieldIsValid, I see there are uses of HasValue in the tests 👍

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

tree/ntuple/test/ntuple_processor.cxx

jblomer

I think the tutorials look much cleaner with this change!

tree/ntuple/inc/ROOT/RNTupleModel.hxx

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

tutorials/io/ntuple/ntpl015_processor_join.C

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx

vepadulano

Thank you! This is already very mature and close to being ready. A few comments from my side before the final push.

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

\

tree/ntuple/test/ntuple_processor.cxx

tree/ntuple/src/RNTupleProcessor.cxx

vepadulano

Great work, thank you! Please try to simplify the commit history where it makes sense before merging.

hahnjo

My last round of comments has been addressed 😃

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx

...instead of the full entry. This design was already a bit questionable, because it would allow users to call `GetPtr` or related functions *inside* of the loop, which we want to avoid. Other than that, there is (currently) no additional useful information one can obtain from the full entry. Therefore, instead we now only return the index of the current entry.

This will reduce the amount for entry number arithmetic in (future) tests

jblomer

Very nice!

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx

Interal class that largely mirrors the REntry interface, but with additional functionality specifically for the `RNTupleProcessor`. Most notably, fields and their values present in the entry have a notion of validity. Only when a field is valid, its value can be read.

After a chain or join operation, is is not guaranteed anymore that fields always have a (valid) value in every entry. The existing `RENtry` does not provide enough functionality to handle this, so instead we implement an (internal) version of it (the `RNTupleProcessorEntry`), which is specialized for the `RNTupleProcessor` to also keep track of the validity of its values, as well as a class (the `RNTupleProcessorOptionalPtr`) that can be used to read field values from a processor with proper validity checks. A major effect of this change is that the RNTupleModel is now hidden from the user, i.e., it is not possible anymore to provide one through the factory methods. The entry is instead filled on a field-by-field basis, through the user-facing `RequestField` method. This method adds the field to the entry (if not yet present), and returns an `RNTupleProcessorOptionalPtr`. For advanced use cases, it is possible to provide a raw pointer created by the user for reading the field values. This introduces a risk of still reading wrong values when this pointer is accessed directly, so the prescribed way to read entry data during processing is always through the `RNTupleProcessorOptionalPtr`. This feature will therefore not be publicly advertised and should only be used by expert users.

This tracks the provenance of a field through composed processors, so its data is correctly read from the actual on-disk field.

See commit e5fba74 for the "why".

enirolf requested review from hahnjo, pcanal, silverweed and vepadulano June 20, 2025 09:57

enirolf self-assigned this Jun 20, 2025

enirolf requested a review from jblomer as a code owner June 20, 2025 09:57

enirolf added the in:RNTuple label Jun 20, 2025

enirolf requested a review from couet as a code owner June 20, 2025 09:57

enirolf mentioned this pull request Jun 20, 2025

[ntuple] Handle missing values in RNTupleProcessor #18932

Closed

silverweed reviewed Jun 20, 2025

View reviewed changes

tutorials/io/ntuple/ntpl012_processor_chain.C Outdated Show resolved Hide resolved

tutorials/io/ntuple/ntpl015_processor_join.C Outdated Show resolved Hide resolved

enirolf force-pushed the ntuple-processor-value-handling branch from 36fb077 to fd395d3 Compare June 20, 2025 12:29

pcanal reviewed Jun 24, 2025

View reviewed changes

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx Outdated Show resolved Hide resolved

enirolf force-pushed the ntuple-processor-value-handling branch from fd395d3 to bc98934 Compare July 2, 2025 11:36

enirolf added the clean build Ask CI to do non-incremental build on PR label Jul 3, 2025

hahnjo reviewed Jul 9, 2025

View reviewed changes

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx Outdated Show resolved Hide resolved

tree/ntuple/src/RNTupleProcessor.cxx Outdated Show resolved Hide resolved

tree/ntuple/src/RNTupleProcessor.cxx Outdated Show resolved Hide resolved

enirolf commented Jul 16, 2025

View reviewed changes

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx Outdated Show resolved Hide resolved

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx Outdated Show resolved Hide resolved

tree/ntuple/src/RNTupleProcessor.cxx Outdated Show resolved Hide resolved

enirolf force-pushed the ntuple-processor-value-handling branch from bc98934 to 4ce35af Compare July 16, 2025 06:59

vepadulano requested changes Jul 28, 2025

View reviewed changes

jblomer reviewed Aug 7, 2025

View reviewed changes

tree/ntuple/inc/ROOT/RNTupleModel.hxx Outdated Show resolved Hide resolved

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx Outdated Show resolved Hide resolved

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx Outdated Show resolved Hide resolved

enirolf mentioned this pull request Aug 26, 2025

[ntuple] Use a view-based interface in RNTupleProcessor #19693

Closed

pcanal reviewed Aug 26, 2025

View reviewed changes

tutorials/io/ntuple/ntpl015_processor_join.C Outdated Show resolved Hide resolved

enirolf marked this pull request as draft September 23, 2025 13:04

enirolf force-pushed the ntuple-processor-value-handling branch 2 times, most recently from 4d01e7c to 9e5fd73 Compare September 23, 2025 15:17

enirolf marked this pull request as ready for review September 25, 2025 09:07

enirolf marked this pull request as draft September 25, 2025 09:42

enirolf force-pushed the ntuple-processor-value-handling branch from 9e5fd73 to 722cd27 Compare September 30, 2025 13:25

enirolf force-pushed the ntuple-processor-value-handling branch 2 times, most recently from 74918b9 to 28c5696 Compare October 13, 2025 14:06

silverweed reviewed Oct 14, 2025

View reviewed changes

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx Outdated Show resolved Hide resolved

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx Outdated Show resolved Hide resolved

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx Show resolved Hide resolved

enirolf force-pushed the ntuple-processor-value-handling branch from 28c5696 to c8f7f8a Compare October 14, 2025 13:26

enirolf requested a review from dpiparo as a code owner October 14, 2025 13:26

enirolf force-pushed the ntuple-processor-value-handling branch from c8f7f8a to 339a53a Compare October 15, 2025 07:27

vepadulano requested changes Oct 17, 2025

View reviewed changes

enirolf force-pushed the ntuple-processor-value-handling branch from 339a53a to 9e7877f Compare October 20, 2025 08:25

enirolf removed the request for review from dpiparo October 20, 2025 08:26

vepadulano approved these changes Oct 20, 2025

View reviewed changes

enirolf requested review from hahnjo and silverweed October 21, 2025 07:54

hahnjo approved these changes Oct 21, 2025

View reviewed changes

pcanal reviewed Oct 21, 2025

View reviewed changes

tree/ntuple/inc/ROOT/RNTupleProcessor.hxx Show resolved Hide resolved

pcanal closed this Oct 21, 2025

pcanal reopened this Oct 21, 2025

enirolf added 2 commits October 22, 2025 10:03

[ntuple] Make number of entries in fixture multiple of 5

fe5cef3

This will reduce the amount for entry number arithmetic in (future) tests

enirolf force-pushed the ntuple-processor-value-handling branch from 9e7877f to 035f552 Compare October 22, 2025 08:05

enirolf added the clean build Ask CI to do non-incremental build on PR label Oct 22, 2025

jblomer approved these changes Oct 24, 2025

View reviewed changes

tree/ntuple/inc/ROOT/RNTupleProcessorEntry.hxx Show resolved Hide resolved

enirolf added 6 commits October 28, 2025 08:49

[ntuple] Add RNTupleProcessorProvenance

072d88a

This tracks the provenance of a field through composed processors, so its data is correctly read from the actual on-disk field.

[ntuple] Add more tests for composed processors

1fb2d8b

[ntuple] Change processor iterator tag

e0708f1

See commit e5fba74 for the "why".

[ntuple] Update tutorials

0520c5f

enirolf force-pushed the ntuple-processor-value-handling branch from 035f552 to 0520c5f Compare October 28, 2025 07:53

dpiparo merged commit a1c112f into root-project:master Oct 29, 2025
41 of 50 checks passed

Conversation

enirolf commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The RNTupleProcessorEntry

The RNTupleProcessorOptionalPtr

Code example

Old interface

New interface

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

hahnjo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hahnjo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcanal commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enirolf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vepadulano left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jblomer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vepadulano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vepadulano left a comment

Choose a reason for hiding this comment

Uh oh!

hahnjo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

enirolf commented Jun 20, 2025 •

edited

Loading

github-actions bot commented Jun 20, 2025 •

edited

Loading

pcanal commented Jul 15, 2025 •

edited

Loading

enirolf left a comment •

edited

Loading

vepadulano left a comment •

edited

Loading

hahnjo left a comment •

edited

Loading