Skip to content

Conversation

@XanthosXanthopoulos
Copy link
Collaborator

@XanthosXanthopoulos XanthosXanthopoulos commented Nov 13, 2025

Issue and/or context: SOMA-528 SOMA-714 SOMA-688

Changes:
This PR changes the memory management for the read/write operations implemented by ManagedQuery. Specifically:

  • Replaces std::vector backed buffers with C++ arrays wrapped in std::unique_ptr
  • Optimizes null count calculation for nullable columns
  • Makes TileDB buffers to Arrow table conversion multithreaded
  • When possible the writes are now zero copy. Passing temporary object to set the data buffer for writes will crash the program because setting the buffers and writing them to TileDB is not an atomic operation
  • Removes implicit casting of numeric data when writing to TileDB. When passing data to read/write you should use the SOMAArray provided schema to type casts in advance
  • Fix index casting when writing dictionaries
  • Properly write validity buffers for nullable enumerated columns

Notes for Reviewer:

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 82.85714% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.36%. Comparing base (e5a14aa) to head (73577bc).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4311      +/-   ##
==========================================
- Coverage   86.37%   86.36%   -0.02%     
==========================================
  Files         139      140       +1     
  Lines       21093    21111      +18     
  Branches       15       17       +2     
==========================================
+ Hits        18219    18232      +13     
- Misses       2874     2879       +5     
Flag Coverage Δ
python 89.01% <81.81%> (-0.06%) ⬇️
r 84.99% <84.61%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.01% <81.81%> (-0.06%) ⬇️
libtiledbsoma 76.58% <71.42%> (-0.66%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as ready for review November 14, 2025 17:03
Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any measurements of the time/space impact of this PR?

@XanthosXanthopoulos
Copy link
Collaborator Author

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well


CArrayColumnBuffer() = delete;
CArrayColumnBuffer(const CArrayColumnBuffer&) = delete;
CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following warning when compiling:

/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:401:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
  401 |     CArrayColumnBuffer(CArrayColumnBuffer&&) = default;
      |     ^
/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:375:28: note: move constructor of 'CArrayColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor

@XanthosXanthopoulos XanthosXanthopoulos changed the title [c++][WIP] Optimize memory management on write path [c++] Optimize memory management on write path Nov 25, 2025
@jp-dark jp-dark removed the request for review from bkmartinjr December 22, 2025 13:54
self._handle.submit_write()

# clear stored data objects
self._ref_store.clear()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ManagedQuery allow buffers to be re-used across multiple submit calls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffers used for write operations are not reused. The managed query only gets a view of them from wherever they come from.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the C/C++ API buffers can be re-used for multi-part global order writes. But IIRC those aren't supported by Python yet, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the buffers we supply to the C++ API are owned by numpy or Arrow so we do not do anything with them other than that.

@rroelke
Copy link

rroelke commented Dec 22, 2025

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

Is this just for the write path? Or does this also include reads? It might be nice to see a breakdown.

@XanthosXanthopoulos
Copy link
Collaborator Author

Yes this was just for writes. Reads were slower before merging the different memory modes PR but after they should be on par or faster. I haven't run the benchmarks yet

Copy link

@rroelke rroelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished yet, I've made my way through column_buffer.{cc,h} so far.

I have left a slew of comments but nothing particularly regarding safety, most of them are cosmetic, for which I defer to y'all as I am not a SOMA maintainer. These few are the most important:

At a higher level I don't really understand why the CArrayColumnBuffer would have any different performance characteristics than the VectorColumnBuffer - I would expect these to be making very similar patterns of memory allocations as long as the move constructors and etc are used appropriately. Data overrides my intuition of course. But I have a feeling that you don't actually need to separate these - the separation of the read and write path would be enough.

pydict["bar"] = [4.1, 5.2, 6.3, 7.4, 8.5]
pydict["baz"] = ["apple", "ball", "cat", "dog", "egg"]
rb = pa.Table.from_pydict(pydict)
rb = pa.Table.from_pydict(pydict, schema=obs_arrow_schema.insert(0, pa.field("soma_joinid", pa.int64())))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you are choosing to add the field this way instead of inline above?


if (is_var()) {
std::unique_ptr<uint64_t[]> offsets_buffer = std::make_unique_for_overwrite<uint64_t[]>(num_cells_ + 1);
std::memcpy(offsets_buffer.get(), this->offsets().data(), (num_cells_ + 1) * sizeof(uint64_t));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how many times a developer has written this TileDB-to-arrow buffer conversion. It's definitely a good piece of technical debt that core doesn't produce arrow, when so many upstream users want it.

}
} else {
if (type() != TILEDB_BOOL) {
data_buffer = std::make_unique_for_overwrite<std::byte[]>(data_size_);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any evidence that this mode is ever better?
The above approach conceptually allocates more memory - twice the max data size. Whereas here you would end up with a potentially smaller number of variably-sized allocations. And keep in mind if this is a read query it's probably filling the buffer with as many cells as you have room for - whatever piece is left is going to be small. In practice the allocator might give you the same size blocks anyway.
Instead of varying by mode you may want to make the choice based on the ratio of data_size_ to max_data_size_

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data_size_ to max_data_size_ is a much better approach that is planned for a follow up PR. By default the allocated memory per column is 1GB and for small reads it is inefficient as they have the same lifetime as the resulting Arrow table. These overallocations allocation are most of the time only virtual allocations there has been random memory errors that may be related.

return std::span<uint8_t>(validity_.data(), num_cells_);
}

std::unique_ptr<IArrowBufferStorage> VectorColumnBuffer::export_buffers() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost entirely the same as the CBuffer one, I suggest using a helper function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorColumnBuffer will be removed entirelly once CArrayColumnBuffer has matured

jp-dark added a commit that referenced this pull request Dec 23, 2025
Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>
jp-dark added a commit that referenced this pull request Jan 8, 2026
…4359)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>

* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------

Co-authored-by: XanthosXanthopoulos <[email protected]>
jp-dark added a commit that referenced this pull request Jan 12, 2026
…4359)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>

* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------

Co-authored-by: XanthosXanthopoulos <[email protected]>
(cherry picked from commit a1f6a68)
jp-dark added a commit that referenced this pull request Jan 12, 2026
…4359)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311

Co-authored-by:  XanthosXanthopoulos <[email protected]>

* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------

Co-authored-by: XanthosXanthopoulos <[email protected]>
(cherry picked from commit a1f6a68)
jp-dark added a commit that referenced this pull request Jan 13, 2026
…4359) (#4369)

* (WIP) Safe-cast pyarrow tables on write

Still needs the following:
* fix schema names for GeometryDataFrame
* test unsafe casting

* Fix casting for geometry dataframe outlines

* Update history

* Switch from deprecated `field_by_name` to `field`

* Update error message and remove unneeded type declaration

* Take tests from PR #4311

Add test for dictionary casting from
#4311



* Add xfail to uncovered bug

* Remove test that is checking for unsafe cast

* Fix syntax for xfail

---------


(cherry picked from commit a1f6a68)

Co-authored-by: XanthosXanthopoulos <[email protected]>
@jp-dark jp-dark requested a review from rroelke January 14, 2026 20:18
std::shared_ptr<T> casted_column = std::dynamic_pointer_cast<T>(buffers_[name]);

if (!casted_column) {
throw std::runtime_error("[ArrayBuffers][at] Dynamic cast failed for column '" + name + "'");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider this an internal error or a user error?
This is not a very helpful message for the end user, it doesn't spell out clearly if there's something they can do about it. I'd suggest returning some kind of "type mismatch, expected T, found ".

If this is internal only then it's probably fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal only. The buffer type is chosen by us depending on query type so any failure there means we use the wrong buffer for the wrong query type

std::launch::async,
ArrowAdapter::to_arrow,
buffer->at<ReadColumnBuffer>(name),
enumerations,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the async part of this is something you really want to lean into then you could probably make this faster by making the enumerations std::shared_future and passing them here. As written, a single enumeration in the schema creates the critical path for this whole conversion which most of the dimensions/attributes probably do not depend on.

}
} break;
case TILEDB_BOOL: {
std::span<const bool> data_v(static_cast<const bool*>(data), data_size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the standard guarantees that bool is one byte, which a cursory google search indicates it does not, I think it should be safer to use uint8_t instead.

// Represent the Boolean vector with, at most, the last two
// bits. In Arrow, Boolean values are LSB packed
uint8_t packed_data = 0;
for (size_t i = 0; i < count; ++i)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is count bounded to 8? I can hardly imagine that it is. This looks problematic, all of the values which don't fit in the single byte will be dropped on the floor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK TileDB Enumerations do not allow duplicate values, so for booleans the max enum size should be 2 right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... core doesn't really do anything to validate that the data matches the logical type, so there's not actually any physical difference between BOOL and UINT8. There could definitely be values in here other than zero and one.

One could argue that we should probably never use arrow bool type, for that reason... I don't think we would need to go that far, but I do think it is important to preserve the truthiness which the keys correspond to, i.e. if the enumeration value i is nonzero then the arrow value i should be true. Even if i is greater than the cardinality of a boolean.

@rroelke
Copy link

rroelke commented Jan 21, 2026

Please do not rebase in the middle of a review, it removes the old commits and I am unable to see what has changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants