-
Notifications
You must be signed in to change notification settings - Fork 30
[c++] Optimize memory management on write path #4311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4311 +/- ##
==========================================
- Coverage 86.37% 86.36% -0.02%
==========================================
Files 139 140 +1
Lines 21093 21111 +18
Branches 15 17 +2
==========================================
+ Hits 18219 18232 +13
- Misses 2874 2879 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
bkmartinjr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any measurements of the time/space impact of this PR?
I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well |
|
|
||
| CArrayColumnBuffer() = delete; | ||
| CArrayColumnBuffer(const CArrayColumnBuffer&) = delete; | ||
| CArrayColumnBuffer(CArrayColumnBuffer&&) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the following warning when compiling:
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:401:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
401 | CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
| ^
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:375:28: note: move constructor of 'CArrayColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor
14d71e2 to
f2da391
Compare
1ac1e25 to
1a00005
Compare
| self._handle.submit_write() | ||
|
|
||
| # clear stored data objects | ||
| self._ref_store.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ManagedQuery allow buffers to be re-used across multiple submit calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffers used for write operations are not reused. The managed query only gets a view of them from wherever they come from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the C/C++ API buffers can be re-used for multi-part global order writes. But IIRC those aren't supported by Python yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the buffers we supply to the C++ API are owned by numpy or Arrow so we do not do anything with them other than that.
Is this just for the write path? Or does this also include reads? It might be nice to see a breakdown. |
|
Yes this was just for writes. Reads were slower before merging the different memory modes PR but after they should be on par or faster. I haven't run the benchmarks yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished yet, I've made my way through column_buffer.{cc,h} so far.
I have left a slew of comments but nothing particularly regarding safety, most of them are cosmetic, for which I defer to y'all as I am not a SOMA maintainer. These few are the most important:
At a higher level I don't really understand why the CArrayColumnBuffer would have any different performance characteristics than the VectorColumnBuffer - I would expect these to be making very similar patterns of memory allocations as long as the move constructors and etc are used appropriately. Data overrides my intuition of course. But I have a feeling that you don't actually need to separate these - the separation of the read and write path would be enough.
| pydict["bar"] = [4.1, 5.2, 6.3, 7.4, 8.5] | ||
| pydict["baz"] = ["apple", "ball", "cat", "dog", "egg"] | ||
| rb = pa.Table.from_pydict(pydict) | ||
| rb = pa.Table.from_pydict(pydict, schema=obs_arrow_schema.insert(0, pa.field("soma_joinid", pa.int64()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come you are choosing to add the field this way instead of inline above?
|
|
||
| if (is_var()) { | ||
| std::unique_ptr<uint64_t[]> offsets_buffer = std::make_unique_for_overwrite<uint64_t[]>(num_cells_ + 1); | ||
| std::memcpy(offsets_buffer.get(), this->offsets().data(), (num_cells_ + 1) * sizeof(uint64_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how many times a developer has written this TileDB-to-arrow buffer conversion. It's definitely a good piece of technical debt that core doesn't produce arrow, when so many upstream users want it.
| } | ||
| } else { | ||
| if (type() != TILEDB_BOOL) { | ||
| data_buffer = std::make_unique_for_overwrite<std::byte[]>(data_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any evidence that this mode is ever better?
The above approach conceptually allocates more memory - twice the max data size. Whereas here you would end up with a potentially smaller number of variably-sized allocations. And keep in mind if this is a read query it's probably filling the buffer with as many cells as you have room for - whatever piece is left is going to be small. In practice the allocator might give you the same size blocks anyway.
Instead of varying by mode you may want to make the choice based on the ratio of data_size_ to max_data_size_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data_size_ to max_data_size_ is a much better approach that is planned for a follow up PR. By default the allocated memory per column is 1GB and for small reads it is inefficient as they have the same lifetime as the resulting Arrow table. These overallocations allocation are most of the time only virtual allocations there has been random memory errors that may be related.
| return std::span<uint8_t>(validity_.data(), num_cells_); | ||
| } | ||
|
|
||
| std::unique_ptr<IArrowBufferStorage> VectorColumnBuffer::export_buffers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost entirely the same as the CBuffer one, I suggest using a helper function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorColumnBuffer will be removed entirelly once CArrayColumnBuffer has matured
Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]>
…4359) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]> * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- Co-authored-by: XanthosXanthopoulos <[email protected]>
…4359) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]> * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- Co-authored-by: XanthosXanthopoulos <[email protected]> (cherry picked from commit a1f6a68)
…4359) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 Co-authored-by: XanthosXanthopoulos <[email protected]> * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- Co-authored-by: XanthosXanthopoulos <[email protected]> (cherry picked from commit a1f6a68)
…4359) (#4369) * (WIP) Safe-cast pyarrow tables on write Still needs the following: * fix schema names for GeometryDataFrame * test unsafe casting * Fix casting for geometry dataframe outlines * Update history * Switch from deprecated `field_by_name` to `field` * Update error message and remove unneeded type declaration * Take tests from PR #4311 Add test for dictionary casting from #4311 * Add xfail to uncovered bug * Remove test that is checking for unsafe cast * Fix syntax for xfail --------- (cherry picked from commit a1f6a68) Co-authored-by: XanthosXanthopoulos <[email protected]>
…rrow objects (#4334) * Add custom allocator for vector backed buffers, implement memory modes for TileDB to Arrow conversions * Lint fix * Migrate R to use multithreaded arrow conversion * Fix compiler warnings * Rebind buffers after each read operation * Read memory mode from config * Include Rcpp header before other headers * Store object references when setting column data until write is submitted (#4350)
44cfb0c to
ccc8b8f
Compare
| std::shared_ptr<T> casted_column = std::dynamic_pointer_cast<T>(buffers_[name]); | ||
|
|
||
| if (!casted_column) { | ||
| throw std::runtime_error("[ArrayBuffers][at] Dynamic cast failed for column '" + name + "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider this an internal error or a user error?
This is not a very helpful message for the end user, it doesn't spell out clearly if there's something they can do about it. I'd suggest returning some kind of "type mismatch, expected T, found ".
If this is internal only then it's probably fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is internal only. The buffer type is chosen by us depending on query type so any failure there means we use the wrong buffer for the wrong query type
| std::launch::async, | ||
| ArrowAdapter::to_arrow, | ||
| buffer->at<ReadColumnBuffer>(name), | ||
| enumerations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the async part of this is something you really want to lean into then you could probably make this faster by making the enumerations std::shared_future and passing them here. As written, a single enumeration in the schema creates the critical path for this whole conversion which most of the dimensions/attributes probably do not depend on.
| } | ||
| } break; | ||
| case TILEDB_BOOL: { | ||
| std::span<const bool> data_v(static_cast<const bool*>(data), data_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the standard guarantees that bool is one byte, which a cursory google search indicates it does not, I think it should be safer to use uint8_t instead.
| // Represent the Boolean vector with, at most, the last two | ||
| // bits. In Arrow, Boolean values are LSB packed | ||
| uint8_t packed_data = 0; | ||
| for (size_t i = 0; i < count; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is count bounded to 8? I can hardly imagine that it is. This looks problematic, all of the values which don't fit in the single byte will be dropped on the floor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK TileDB Enumerations do not allow duplicate values, so for booleans the max enum size should be 2 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... core doesn't really do anything to validate that the data matches the logical type, so there's not actually any physical difference between BOOL and UINT8. There could definitely be values in here other than zero and one.
One could argue that we should probably never use arrow bool type, for that reason... I don't think we would need to go that far, but I do think it is important to preserve the truthiness which the keys correspond to, i.e. if the enumeration value i is nonzero then the arrow value i should be true. Even if i is greater than the cardinality of a boolean.
|
Please do not rebase in the middle of a review, it removes the old commits and I am unable to see what has changed. |
Issue and/or context: SOMA-528 SOMA-714 SOMA-688
Changes:
This PR changes the memory management for the read/write operations implemented by
ManagedQuery. Specifically:std::vectorbacked buffers with C++ arrays wrapped instd::unique_ptrSOMAArrayprovided schema to type casts in advanceNotes for Reviewer: