Backport CORE-290 to 2.29 branch #5676
Merged
+5,504
−806
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request resolves CORE-290 on the
release-2.29branch for backport.It combines three pull requests to
mainwhich were all used in the resolution of CORE-290:#5670
#5675
#5655
Merge conflicts were all trivial and/or in test code.
Report
In CORE-290 a customer reported issues with corrupt arrays after running consolidation. The symptom was memory allocation errors when opening an array. The root cause turned out to be the consolidation itself was writing new fragments where the fragment domain did not match the number of tiles in the fragment.
Root cause analysis
Stepping through a minimal reproducer revealed that consolidation is not at fault at all - instead it is the global order writer max_fragment_size_ field. This field, presumably meant to be used for consolidation only but valid for other writes, instructs the writer to write multiple fragments each under the requested size as necessary. When a write splits its data into multiple fragments this way, nothing need be done for sparse arrays. But dense fragment metadata relies on the domain to determine the number of tiles, and the global order writer did not update the fragment metadata domain when splitting its write into multiple fragments.
This pull request fixes that issue by ensuring that the global order writer updates the fragment metadata domain to reflect what was actually written into that fragment.
This is easier said than done. In CORE-290 I offered four ways we can fix this. The chosen solution:
Design
The fragment metadata domain is a bounding rectangle. This means that the global order writer must split the tiles of its input into fragments at tile boundaries which bisect the bounding rectangle into two smaller bounding rectangles.
To do so, we add a first pass identify_fragment_tile_boundaries which returns a list of tile offsets where new fragments will begin. Upon finishing a fragment, we use that tile offset to determine which rectangle within the target subarray the fragment actually represents, and update the fragment metadata accordingly. We use new functions is_rectangular_domain to determine whether a (start_tile, num_tiles) pair identifies a rectangle, and domain_tile_offset to compute that rectangle.
Much of the complexity comes from the usage of the global order writer which does happen in consolidation: multi-part writes. A user (or a consolidation operation) can set a domain D which it intends to write into, and then actually fill in all of the cells over multiple submit calls which stream in the cells to write. It is not required for these cells to be tile aligned. Because of that, and the need to write rectangle fragments, a single submit cannot always determine whether a tail of tiles belongs to its current fragment or must be deferred to the next. To get around this we keep those tiles in memory in the global_write_state_ and prepend them to the user input in the next submit.
Testing
We add unit tests for is_rectangular_domain and domain_tile_offset, including both example tests as well as rapidcheck properties to assert general claims.
We add tests for the global order writer to make broad claims about what the writer is supposed to do with respect to the max_fragment_size parameter. We add examples and a rapidcheck test to exercise these claims. In particular:
Since the original report originated from consolidation, we also add some consolidation tests mimicking the customer input.
TYPE: BUG
DESC: Backport CORE-290 to 2.29