[ntuple] Change entry handling in RNTupleProcessor to properly handle missing values#19111
Conversation
Test Results 22 files 22 suites 3d 18h 48m 58s ⏱️ For more details on these failures, see this check. Results for commit 0520c5f. ♻️ This comment has been updated with latest results. |
36fb077 to
fd395d3
Compare
fd395d3 to
bc98934
Compare
hahnjo
left a comment
There was a problem hiding this comment.
First code review - I have two overall design questions that I will post separately so their inline comments are not mixed with these.
hahnjo
left a comment
There was a problem hiding this comment.
My two questions:
- Do we (want to) support conversion and schema evolution when reading fields via the
RNTupleProcessor? (triggered by #19111 (comment)) - At the moment, every
RNTupleProcessorhas its ownRNTupleProcessorEntrythat are "connected together" via shared pointers. However, the field validity is not shared which could lead to problems with nested joins where the outer one is valid, but the inner one does not find its value (see also https://github.com/root-project/root/pull/19111/files#r2194283077).
In my opinion we have no choice but to (eventually) support schema evolution (otherwise the |
There was a problem hiding this comment.
Thank you for you extensive review, @hahnjo! To answer your questions:
- Do we (want to) support conversion and schema evolution when reading fields via the RNTupleProcessor? (triggered by #19111 (comment))
Yes, to echo what Philippe wrote, we have no choice (not just for schema evolution, but also to allow alternative field types through e.g., RDF). I admit that I didn't yet take this into account in this particular PR, but I also think what with the proposed interface change there's no blocker not to have this. I will address it and depending on how much it would blow up this PR even more, I'll either add it here or in a follow-up PR.
- At the moment, every RNTupleProcessor has its own RNTupleProcessorEntry that are "connected together" via shared pointers. However, the field validity is not shared which could lead to problems with nested joins where the outer one is valid, but the inner one does not find its value (see also https://github.com/root-project/root/pull/19111/files#r2194283077).
That's a good point, and something that needs to indeed also be addressed more properly. For this one as well, I will investigate more and either add it to this PR or create a follow-up.
bc98934 to
4ce35af
Compare
There was a problem hiding this comment.
The work is already at a really advanced state and the direction is very positive, thanks! A few minor suggestions/considerations. Furthermore, I don't see the new API to check for missing values e.g. FieldIsValid` being used in the tests, would be good to have at least one simple test already in this PR. A more thorough testing campaign could be added later.
EDIT: Sorry, I was looking only for FieldIsValid, I see there are uses of HasValue in the tests 👍
jblomer
left a comment
There was a problem hiding this comment.
I think the tutorials look much cleaner with this change!
4d01e7c to
9e5fd73
Compare
9e5fd73 to
722cd27
Compare
74918b9 to
28c5696
Compare
28c5696 to
c8f7f8a
Compare
c8f7f8a to
339a53a
Compare
vepadulano
left a comment
There was a problem hiding this comment.
Thank you! This is already very mature and close to being ready. A few comments from my side before the final push.
339a53a to
9e7877f
Compare
vepadulano
left a comment
There was a problem hiding this comment.
Great work, thank you! Please try to simplify the commit history where it makes sense before merging.
...instead of the full entry. This design was already a bit questionable, because it would allow users to call `GetPtr` or related functions *inside* of the loop, which we want to avoid. Other than that, there is (currently) no additional useful information one can obtain from the full entry. Therefore, instead we now only return the index of the current entry.
This will reduce the amount for entry number arithmetic in (future) tests
9e7877f to
035f552
Compare
Interal class that largely mirrors the REntry interface, but with additional functionality specifically for the `RNTupleProcessor`. Most notably, fields and their values present in the entry have a notion of validity. Only when a field is valid, its value can be read.
After a chain or join operation, is is not guaranteed anymore that fields always have a (valid) value in every entry. The existing `RENtry` does not provide enough functionality to handle this, so instead we implement an (internal) version of it (the `RNTupleProcessorEntry`), which is specialized for the `RNTupleProcessor` to also keep track of the validity of its values, as well as a class (the `RNTupleProcessorOptionalPtr`) that can be used to read field values from a processor with proper validity checks. A major effect of this change is that the RNTupleModel is now hidden from the user, i.e., it is not possible anymore to provide one through the factory methods. The entry is instead filled on a field-by-field basis, through the user-facing `RequestField` method. This method adds the field to the entry (if not yet present), and returns an `RNTupleProcessorOptionalPtr`. For advanced use cases, it is possible to provide a raw pointer created by the user for reading the field values. This introduces a risk of still reading wrong values when this pointer is accessed directly, so the prescribed way to read entry data during processing is always through the `RNTupleProcessorOptionalPtr`. This feature will therefore not be publicly advertised and should only be used by expert users.
This tracks the provenance of a field through composed processors, so its data is correctly read from the actual on-disk field.
See commit e5fba74 for the "why".
035f552 to
0520c5f
Compare
This PR adds a significant change to the way field values can be accessed from an
RNTupleProcessor, and, under the hood, how theRNTupleProcessorhandles the entries that manage these values. The motivation behind this change is that the current processor interface has two main shortcomings:GetPtron the entry inside the event loop which is generally considered a bad practice.To address these shortcomings, we introduce two new classes, the
RNTupleProcessorEntryand theRNTupleProcessorOptionalPtr, which are described in more detail below. With the introduction of these classes, the use of theRNTupleModelis removed from the public interface entirely.The changes in this PR are also in preparation of the use of the
RNTupleProcessoras an RDataFrame data source. A minimal working example has been developed to validate that this newRNTupleProcessorinterface is compatible with the abstract interface ofRDataSource.The RNTupleProcessorEntry
The
RNTupleProcessorEntryis an internal class that largely mirrors theREntryinterface, but with additional functionality for theRNTupleProcessor. Most notably, fields and their values present in the entry have a notion of validity. Only when a field is valid, its value can be read.In addition, it keeps track of which fields are auxiliary if a join operation is involved. This is needed in order to properly resolve from which underlying RNTuple the field data should be read. Since the RNTupleProcessor is composable, this is also relevant for the chain-based processor.
The RNTupleProcessorOptionalPtr
The
RNTupleProcessorOptionalPtris what the user or upstream application will actually iteract with to read field values from a processor entry. It is obtained when registering a field in the processor throughRegisterField(), which has to be done before and processor iteration starts (because, as stated before, afterwards the entry will be frozen).The
RNTupleProcessorOptionalPtrcan be used to get a const reference or a pointer to the field's value. If the field is marked invalid in the entry at the time of access, the const reference access method will throw an exception, whereas the pointer access method returns anullptr.Code example
Old interface
New interface