Skip to content

[ntuple] Change entry handling in RNTupleProcessor to properly handle missing values#19111

Merged
dpiparo merged 8 commits intoroot-project:masterfrom
enirolf:ntuple-processor-value-handling
Oct 29, 2025
Merged

[ntuple] Change entry handling in RNTupleProcessor to properly handle missing values#19111
dpiparo merged 8 commits intoroot-project:masterfrom
enirolf:ntuple-processor-value-handling

Conversation

@enirolf
Copy link
Copy Markdown
Contributor

@enirolf enirolf commented Jun 20, 2025

This PR adds a significant change to the way field values can be accessed from an RNTupleProcessor, and, under the hood, how the RNTupleProcessor handles the entries that manage these values. The motivation behind this change is that the current processor interface has two main shortcomings:

  1. It is too flexible, allowing users to call GetPtr on the entry inside the event loop which is generally considered a bad practice.
  2. It cannot properly handle values missing from entries as the result of an incomplete join or missing fields in subsequent chains.

To address these shortcomings, we introduce two new classes, the RNTupleProcessorEntry and the RNTupleProcessorOptionalPtr, which are described in more detail below. With the introduction of these classes, the use of the RNTupleModel is removed from the public interface entirely.

The changes in this PR are also in preparation of the use of the RNTupleProcessor as an RDataFrame data source. A minimal working example has been developed to validate that this new RNTupleProcessor interface is compatible with the abstract interface of RDataSource.

The RNTupleProcessorEntry

The RNTupleProcessorEntry is an internal class that largely mirrors the REntry interface, but with additional functionality for the RNTupleProcessor. Most notably, fields and their values present in the entry have a notion of validity. Only when a field is valid, its value can be read.

In addition, it keeps track of which fields are auxiliary if a join operation is involved. This is needed in order to properly resolve from which underlying RNTuple the field data should be read. Since the RNTupleProcessor is composable, this is also relevant for the chain-based processor.

The RNTupleProcessorOptionalPtr

The RNTupleProcessorOptionalPtr is what the user or upstream application will actually iteract with to read field values from a processor entry. It is obtained when registering a field in the processor through RegisterField(), which has to be done before and processor iteration starts (because, as stated before, afterwards the entry will be frozen).
The RNTupleProcessorOptionalPtr can be used to get a const reference or a pointer to the field's value. If the field is marked invalid in the entry at the time of access, the const reference access method will throw an exception, whereas the pointer access method returns a nullptr.

Code example

Old interface

auto model = RNTupleModel::Create();
auto fldX = model->MakeField<float>("x");

auto proc =
  RNTupleProcessor::CreateChain({{"ntuple", "ntuple1.root"}, {"ntuple", "ntuple1.root"}}, std::move(model));

for (const auto &entry : *proc) {
  // Either (no way to guarantee the validity of `fldX` in this entry):
  std::cout << "x = " << *fldX << std::endl;
  // Or, same issue as above, plus expensive:
  std::cout << "x = " << entry.GetPtr<float>("x") << std::endl;
}

New interface

auto proc =
  RNTupleProcessor::CreateChain({{"ntuple", "ntuple1.root"}, {"ntuple", "ntuple1.root"}});

// Returns RNTupleProcessorOptionalPtr
auto x = proc->RequestField<float>("x");

// Instead of the full entry, now only provide the entry number as iterator value
for (const auto &idx : *proc) {
  // Without this check, `*x` will throw for entries where it is invalid.
  if (x.HasValue())
    std::cout << "x = " << *x << std::endl;
}

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 20, 2025

Test Results

    22 files      22 suites   3d 18h 48m 58s ⏱️
 3 701 tests  3 700 ✅ 0 💤 1 ❌
79 468 runs  79 466 ✅ 0 💤 2 ❌

For more details on these failures, see this check.

Results for commit 0520c5f.

♻️ This comment has been updated with latest results.

@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from 36fb077 to fd395d3 Compare June 20, 2025 12:29
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from fd395d3 to bc98934 Compare July 2, 2025 11:36
@enirolf enirolf added the clean build Ask CI to do non-incremental build on PR label Jul 3, 2025
Copy link
Copy Markdown
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First code review - I have two overall design questions that I will post separately so their inline comments are not mixed with these.

Copy link
Copy Markdown
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two questions:

  1. Do we (want to) support conversion and schema evolution when reading fields via the RNTupleProcessor? (triggered by #19111 (comment))
  2. At the moment, every RNTupleProcessor has its own RNTupleProcessorEntry that are "connected together" via shared pointers. However, the field validity is not shared which could lead to problems with nested joins where the outer one is valid, but the inner one does not find its value (see also https://github.com/root-project/root/pull/19111/files#r2194283077).

@pcanal
Copy link
Copy Markdown
Member

pcanal commented Jul 15, 2025

My two questions:

1. Do we (want to) support conversion and schema evolution when reading fields via the `RNTupleProcessor`? (triggered by [#19111 (comment)](https://github.com/root-project/root/pull/19111#discussion_r2194261277))

In my opinion we have no choice but to (eventually) support schema evolution (otherwise the RNTupleProcessor would be unusable to read any non-current files).

Copy link
Copy Markdown
Contributor Author

@enirolf enirolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for you extensive review, @hahnjo! To answer your questions:

  1. Do we (want to) support conversion and schema evolution when reading fields via the RNTupleProcessor? (triggered by #19111 (comment))

Yes, to echo what Philippe wrote, we have no choice (not just for schema evolution, but also to allow alternative field types through e.g., RDF). I admit that I didn't yet take this into account in this particular PR, but I also think what with the proposed interface change there's no blocker not to have this. I will address it and depending on how much it would blow up this PR even more, I'll either add it here or in a follow-up PR.

  1. At the moment, every RNTupleProcessor has its own RNTupleProcessorEntry that are "connected together" via shared pointers. However, the field validity is not shared which could lead to problems with nested joins where the outer one is valid, but the inner one does not find its value (see also https://github.com/root-project/root/pull/19111/files#r2194283077).

That's a good point, and something that needs to indeed also be addressed more properly. For this one as well, I will investigate more and either add it to this PR or create a follow-up.

@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from bc98934 to 4ce35af Compare July 16, 2025 06:59
Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work is already at a really advanced state and the direction is very positive, thanks! A few minor suggestions/considerations. Furthermore, I don't see the new API to check for missing values e.g. FieldIsValid` being used in the tests, would be good to have at least one simple test already in this PR. A more thorough testing campaign could be added later.

EDIT: Sorry, I was looking only for FieldIsValid, I see there are uses of HasValue in the tests 👍

Copy link
Copy Markdown
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tutorials look much cleaner with this change!

@enirolf enirolf marked this pull request as draft September 23, 2025 13:04
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch 2 times, most recently from 4d01e7c to 9e5fd73 Compare September 23, 2025 15:17
@enirolf enirolf marked this pull request as ready for review September 25, 2025 09:07
@enirolf enirolf marked this pull request as draft September 25, 2025 09:42
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from 9e5fd73 to 722cd27 Compare September 30, 2025 13:25
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch 2 times, most recently from 74918b9 to 28c5696 Compare October 13, 2025 14:06
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from 28c5696 to c8f7f8a Compare October 14, 2025 13:26
@enirolf enirolf requested a review from dpiparo as a code owner October 14, 2025 13:26
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from c8f7f8a to 339a53a Compare October 15, 2025 07:27
Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is already very mature and close to being ready. A few comments from my side before the final push.

@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from 339a53a to 9e7877f Compare October 20, 2025 08:25
@enirolf enirolf removed the request for review from dpiparo October 20, 2025 08:26
Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thank you! Please try to simplify the commit history where it makes sense before merging.

@enirolf enirolf requested review from hahnjo and silverweed October 21, 2025 07:54
Copy link
Copy Markdown
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last round of comments has been addressed 😃

@pcanal pcanal closed this Oct 21, 2025
@pcanal pcanal reopened this Oct 21, 2025
...instead of the full entry. This design was already a bit
questionable, because it would allow users to call `GetPtr` or related
functions *inside* of the loop, which we want to avoid. Other than
that, there is (currently) no additional useful information one can
obtain from the full entry. Therefore, instead we now only return the
index of the current entry.
This will reduce the amount for entry number arithmetic in (future)
tests
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from 9e7877f to 035f552 Compare October 22, 2025 08:05
@enirolf enirolf added the clean build Ask CI to do non-incremental build on PR label Oct 22, 2025
Copy link
Copy Markdown
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Interal class that largely mirrors the REntry interface, but with
additional functionality specifically for the `RNTupleProcessor`. Most
notably, fields and their values present in the entry have a notion of
validity. Only when a field is valid, its value can be read.
After a chain or join operation, is is not guaranteed anymore that
fields always have a (valid) value in every entry. The existing `RENtry`
does not provide enough functionality to handle this, so instead we
implement an (internal) version of it (the `RNTupleProcessorEntry`),
which is specialized for the `RNTupleProcessor` to also keep track of
the validity of its values, as well as a class (the
`RNTupleProcessorOptionalPtr`) that can be used to read field values
from a processor with proper validity checks.

A major effect of this change is that the RNTupleModel is now hidden
from the user, i.e., it is not possible anymore to provide one through
the factory methods. The entry is instead filled on a field-by-field
basis, through the user-facing `RequestField` method. This method adds
the field to the entry (if not yet present), and returns an
`RNTupleProcessorOptionalPtr`. For advanced use cases, it is possible to
provide a raw pointer created by the user for reading the field values.
This introduces a risk of still reading wrong values when this pointer
is accessed directly, so the prescribed way to read entry data during
processing is always through the `RNTupleProcessorOptionalPtr`. This
feature will therefore not be publicly advertised and should only be
used by expert users.
This tracks the provenance of a field through composed processors, so
its data is correctly read from the actual on-disk field.
@enirolf enirolf force-pushed the ntuple-processor-value-handling branch from 035f552 to 0520c5f Compare October 28, 2025 07:53
@dpiparo dpiparo merged commit a1c112f into root-project:master Oct 29, 2025
41 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clean build Ask CI to do non-incremental build on PR in:RNTuple

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants