-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[ntuple] add streaming vector tutorial (v2) #19748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[ntuple] add streaming vector tutorial (v2) #19748
Conversation
acbebf5
to
311a905
Compare
// so that never the entire vector needs to stay in memory. | ||
// Note that we don't need to implement loading chunks of data explicitly. Simply by asking for a single vector element | ||
// at every iteration step, the RNTuple views will take care of keeping only the currently required data pages | ||
// in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful here to say how one can tune the maximum amount of memory used by the StreamingVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the important point is to turn off the cluster cache, otherwise we will load entire clusters anyway. Beyond that, the memory consumption should be that of a page, which is determined by the input file. The reading code doesn't have much control over it...
Test Results 21 files 21 suites 3d 17h 35m 53s ⏱️ For more details on these failures, see this check. Results for commit 311a905. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the other fixes to RNTupleLocalRange
and RNTupleCollectionView
- we may consider backporting these to the next release of v6.36. Some comments inline for the ntpl016_streaming_vector.C
tutorial.
constexpr char const *kNTupleName = "ntpl"; | ||
constexpr char const *kFieldName = "LargeVector"; | ||
constexpr unsigned int kNEvents = 10; | ||
constexpr unsigned int kVectorSize = 1000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alma9 modules_off runtime_cxxmodules=Off
is slightly unhappy about this line because TGeometry.h
defines const Int_t kVectorSize = 3;
I guess we have to rename the constant here...
#include <vector> | ||
#include <utility> | ||
|
||
constexpr char const *kFileName = "ntpl015_streaming_vector.root"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr char const *kFileName = "ntpl015_streaming_vector.root"; | |
constexpr char const *kFileName = "ntpl016_streaming_vector.root"; |
// A lightweight iterator used in StreamingVectorView::begin() and StreamingVectorView::end(). | ||
// Used to iterate over the elements of an RNTuple on-disk vector for a certain entry. | ||
// Dereferencing the iterator returns the corresponding value of the item view. | ||
class iterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably have the usual iterator using
definitions, for std::iterator_traits
// so that never the entire vector needs to stay in memory. | ||
// Note that we don't need to implement loading chunks of data explicitly. Simply by asking for a single vector element | ||
// at every iteration step, the RNTuple views will take care of keeping only the currently required data pages | ||
// in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the important point is to turn off the cluster cache, otherwise we will load entire clusters anyway. Beyond that, the memory consumption should be that of a page, which is determined by the input file. The reading code doesn't have much control over it...
Replaces #17139 with comments in that PR incorporated.