Skip to content

Commit 4d196ea

Browse files
jblomerdpiparo
authored andcommitted
[NFC][ntuple] add schema evolution docs
1 parent 3c7d388 commit 4d196ea

File tree

1 file changed

+243
-0
lines changed

1 file changed

+243
-0
lines changed

tree/ntuple/doc/SchemaEvolution.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Schema Evolution
2+
3+
Schema evolution is the capability of the ROOT I/O to read data
4+
into in-memory models that are different but compatible to the on-disk schema.
5+
6+
Schema evolution allows for data models to evolve over time
7+
such that old data can be read into current models ("backward compatibility")
8+
and old software can read newer data models ("forward compatibility").
9+
For instance, data model authors may over time add and reorder class members, change data types
10+
(e.g. `std::vector<float>` --> `ROOT::RVec<double>`), rename classes, etc.
11+
12+
ROOT applies automatic schema evolution rules for common, safe and unambiguous cases.
13+
Users can complement the automatic rules by manual schema evolution ("I/O customization rules")
14+
where custom code snippets implement the transformation logic.
15+
In case neither automatic nor any of the provided I/O customization rules suffice
16+
to transform the on-disk schema into the in-memory model, ROOT will error out and refrain from reading data.
17+
18+
This document describes schema evolution support implemented in RNTuple.
19+
For the most part, schema evolution works identical across the different ROOT I/O systems (TFile, TTree, RNTuple).
20+
The exceptions are listed in the last section of this document.
21+
22+
## Automatic schema evolution
23+
24+
ROOT applies a number of rules to read data transparently into in-memory models
25+
that are not an exact match to the on-disk schema.
26+
The automatic rules apply recursively to compound types (classes, tuples, collections, etc.);
27+
the outer types are evolved before the inner types.
28+
29+
Automatic schema evolution rules transform native _types_ as well as the _shape_ of user-defined classes
30+
as listed in the following, exhaustive tables.
31+
32+
### Class shape transformations
33+
34+
User-defined classes can automatically evolve their layout in the following ways.
35+
Note that users should increase the class version number when the layout changes.
36+
37+
| Layout Change | Comment |
38+
| --------------------------------------- | ---------------------------------------------------- |
39+
| Remove member | Match by member name |
40+
| Add member | Match by member name, new member default-initialized |
41+
| Reorder members | Match by member name |
42+
| Remove all base classes | |
43+
| Add base class(es) where they were none | New base class members default initialized |
44+
45+
Reordering and incremental addition or removal of base classes is currently unsupported
46+
but may be supported in future RNTuple versions.
47+
48+
The class shape evolution also applies to untyped records.
49+
Note that untyped records cannot have base classes.
50+
51+
### Type transformations
52+
53+
ROOT transparently reads into in-memory types that are different from but compatible to the on-disk type.
54+
In the following tables, `T'` denotes a type that is compatible to `T`.
55+
This includes user-defined types that are related via a renaming rule.
56+
57+
#### Plain fields
58+
59+
| In-memory type | Compatible on-disk types | Comment |
60+
| --------------------------- | --------------------------- | ------------------------|
61+
| `bool` | `char` | |
62+
| | `std::[u]int[8,16,32,64]_t` | |
63+
| | enum | |
64+
|-----------------------------|-----------------------------|-------------------------|
65+
| `char` | `bool` | |
66+
| | `std::[u]int[8,16,32,64]_t` | with bounds check |
67+
| | enum | with bounds check |
68+
|-----------------------------|-----------------------------|-------------------------|
69+
| `std::[u]int[8,16,32,64]_t` | `bool` | |
70+
| | `char` | |
71+
| | `std::[u]int[8,16,32,64]_t` | with bounds check |
72+
| | enum | with bounds check |
73+
|-----------------------------|-----------------------------|-------------------------|
74+
| enum | enum of different type | with bounds check |
75+
| | | on underlying integer |
76+
|-----------------------------|-----------------------------|-------------------------|
77+
| float | double | with fp class check[^1] |
78+
|-----------------------------|-----------------------------|-------------------------|
79+
| double | float | |
80+
|-----------------------------|-----------------------------|-------------------------|
81+
| `std::atomic<T>` | `T'` | |
82+
83+
[^1]: The floating point class check ensures that the on-disk value and the in-memory value are of the same nature
84+
(NaN, +/-inf, zero, underflow, or normal value).
85+
86+
87+
#### Variable-length collections
88+
89+
The different variable-length collections have the same on-disk representation
90+
and thus evolve naturally into one another.
91+
However, only those transformations that are guarantueed to work at runtime will be performed.
92+
For instance, a set can always be read as a vector but a vector does not necessarily fulfil the set property.
93+
94+
| In-memory type | Compatible on-disk types | Comment |
95+
| -------------------------------- | ------------------------------------ | ------------------------------------- |
96+
| `std::vector<T>` | `ROOT::RVec<T'>` | |
97+
| | `std::array<T', N>` | |
98+
| | `std::[unordered_][multi]set<T'>` | |
99+
| | `std::[unordered_][multi]map<K',V'>` | only `T` = `std::[pair,tuple]<K,V>` |
100+
| | `std::optional<T'>` | |
101+
| | `std::unique_ptr<T'>` | |
102+
| | User-defined collection of `T'` | |
103+
| | Untyped collection of `T'` | |
104+
|----------------------------------|--------------------------------------|---------------------------------------|
105+
| `ROOT::RVec<T>` | `std::vector<T'>` | with size check |
106+
| | `std::array<T', N>` | with size check |
107+
| | `std::[unordered_][multi]set<T'>` | with size check |
108+
| | `std::[unordered_][multi]map<K',V'>` | only `T` = `std::[pair,tuple]<K,V>`, |
109+
| | | with size check |
110+
| | `std::optional<T'>` | |
111+
| | `std::unique_ptr<T'>` | |
112+
| | User-defined collection of `T'` | with size check |
113+
| | Untyped collectionof `T'` | with size check |
114+
|----------------------------------|--------------------------------------|---------------------------------------|
115+
| `std::[unordered_]set<T>` | `std::[unordered_]set<T'>` | |
116+
| | `std::[unordered_]map<K',V'>` | only `T` = `std::[pair,tuple]<K,V>` |
117+
|----------------------------------|--------------------------------------|---------------------------------------|
118+
| `std::[unordered_]multiset<T>` | `ROOT::RVec<T'>` | |
119+
| | `std::vector<T'>` | |
120+
| | `std::array<T', N>` | |
121+
| | `std::[unordered_][multi]set<T'>` | |
122+
| | `std::[unordered_][multi]map<K',V'>` | only `T` = `std::[pair,tuple]<K,V>` |
123+
| | User-defined collection of `T'` | |
124+
| | Untyped collection of `T'` | |
125+
|----------------------------------|--------------------------------------|---------------------------------------|
126+
| `std::[unordered_]map<K,V>` | `std::[unordered_]map<K',V'>` | |
127+
|----------------------------------|--------------------------------------|---------------------------------------|
128+
| `std::[unordered_]multimap<K,V>` | `ROOT::RVec<T>` | only `T` = `std::[pair,tuple]<K,V>` |
129+
| | `std::vector<T>` | only `T` = `std::[pair,tuple]<K,V>` |
130+
| | `std::array<T, N>` | only `T` = `std::[pair,tuple]<K,V>` |
131+
| | `std::[unordered_][multi]set<T>` | only `T` = `std::[pair,tuple]<K,V>` |
132+
| | `std::[unordered_][multi]map<K',V'>` | |
133+
| | User-defined collection of `T` | only `T` = `std::[pair,tuple]<K,V>` |
134+
| | Untyped collection of `T` | only `T` = `std::[pair,tuple]<K,V>` |
135+
136+
#### Fixed-size collections
137+
138+
There is no special automatic evolution for fixed-length collections (`std::array<...>`, `std::bitset<...>`).
139+
The length of the array must not change and there is no automatic transformation from variable-length to
140+
fixed-length collections.
141+
C style arrays and `std::array<...>` of the same type and length can be used interchangibly.
142+
143+
#### Nullable fields
144+
145+
| In-memory type | Compatible on-disk types |
146+
| -------------------- | ------------------------ |
147+
| `std::optional<T>` | `std::unique_ptr<T'>` |
148+
| | `T'` |
149+
|----------------------|--------------------------|
150+
| `std::unique_ptr<T>` | `std::optional<T'>` |
151+
| | `T'` |
152+
153+
#### Records
154+
155+
| In-memory type | Compatible on-disk types |
156+
| --------------------------- | -------------------------------------- |
157+
| `std::pair<T,U>` | `std::tuple<T',U'>` |
158+
|-----------------------------|----------------------------------------|
159+
| `std::tuple<T,U>` | `std::pair<T',U'>` |
160+
|-----------------------------|----------------------------------------|
161+
| Untyped record | User-defined class of compatible shape |
162+
163+
Note that for emulated classes, the in-memory untyped record is constructed from on-disk information.
164+
165+
#### Additional rules
166+
167+
All on-disk types `std::atomic<T'>` can be read into a `T` in-memory model.
168+
169+
If a class property changes from using an RNTuple streamer field to a using regular RNTuple class field,
170+
existing files with on-disk streamer fields will continue to read as streamer fields.
171+
This can be seen as "schema evolution out of streamer fields".
172+
173+
## Manual schema evolution (I/O customization rules)
174+
175+
ROOT I/O customization rules allow for custom code handling the transformation
176+
from the on-disk schema to the in-memory model.
177+
Customization rules are part of the class dictionary.
178+
For the exact syntax of customization rules, please refer to the [ROOT manual](https://root.cern/manual/io/#dealing-with-changes-in-class-layouts-schema-evolution).
179+
180+
Generally, customization rules consist of
181+
- A target class.
182+
- Target members of the target class, i.e. those class members whose value is set by the rule.
183+
Target members must be direct members, i.e. not part of a base class.
184+
- A source class (possibly having a different class name than the target class)
185+
together with class versions or class checksums
186+
that describe all the possible on-disk class versions the rule applies to.
187+
Note that the class checksum can be retrieved, e.g., from `TClass::GetCheckSum()`.
188+
- Source members of the source class; the given source members will be read as the given type.
189+
The source member will undergo schema evolution before being passed to the rule's function.
190+
Source members can also be from a base class.
191+
Note that there is no way to specify a base class member that has the same name as a member in the derived class.
192+
- The custom code snippet; the code snippet has access to the (whole) target object and to the given source members.
193+
194+
For illustration purposes, here is a concrete example of a customization rule
195+
```
196+
#pragma read \
197+
targetClass = "Coordinates"\
198+
target = "fPhi,fR" \
199+
sourceClass = "Coordinates" \
200+
version = "[3]" \
201+
source = "float fX; float fY" \
202+
include = "cmath" \
203+
code = "{ fR = sqrt(onfile.fX * onfile.fX + onfile.fY * onfile.fY); fPhi = atan2(onfile.fY, onfile.fX); }"
204+
```
205+
206+
At runtime, for any given target member there must be at most be one applicable rule.
207+
A source member can be read into any type compatible to its on-disk type
208+
but any given source member can only be read into one type for a given target class
209+
(i.e. multiple rules for the same target/source class must not use different types for the same source member).
210+
211+
There are two special types of rules
212+
1. Pure class rename rules consisting only of source and target class
213+
2. Whole-object rules that have no target members
214+
215+
Class rename rules (pure or not) are not transitive
216+
(if in-memory `A` can read from on-disk `B` and in-memory `B` can read from no-disk `C`,
217+
in-memory `A` can not automatically read from on-disk `C`).
218+
219+
Note that customization rules operate on partially read objects.
220+
Customization rules are executed after all members not subject to customization rules have been read from disk.
221+
Whole-object rules are executed after other rules.
222+
Otherwise, the scheduling of rules is unspecified.
223+
224+
## Interplay between automatic and manual schema evolution
225+
226+
The target members of I/O customization rules are exempt from automatic schema evolution
227+
(applies to the corresponding field of the target member and all its subfields).
228+
Otherwise, automatic and manual schema evolution work side by side.
229+
For instance, a renamed class is still subject to automatic schema evolution.
230+
231+
The source member of a customization rule is subject to the same automatic and manual schema evolution rules
232+
as if it was normally read, e.g. in an `RNTupleView`.
233+
234+
## Schema evolution differences between RNTuple and Classic I/O
235+
236+
In contrast to RNTuple, TTree and TFile apply also the following automatic schema evolution rules
237+
- Conversion between floating point and integer types
238+
- Conversion from `unique_ptr<T>` --> `T'`
239+
- Complete conversion matrix of all collection types
240+
- Insertion and removal of intermediate classes
241+
- Move of a member between base class and derived class
242+
- Reordering of base classes
243+

0 commit comments

Comments
 (0)