[c++] Optimize memory management on write path #4311

XanthosXanthopoulos · 2025-11-13T17:55:46Z

Issue and/or context: SOMA-528 SOMA-714 SOMA-688

Changes:
This PR changes the memory management for the read/write operations implemented by ManagedQuery. Specifically:

Replaces std::vector backed buffers with C++ arrays wrapped in std::unique_ptr
Optimizes null count calculation for nullable columns
Makes TileDB buffers to Arrow table conversion multithreaded
When possible the writes are now zero copy. Passing temporary object to set the data buffer for writes will crash the program because setting the buffers and writing them to TileDB is not an atomic operation
Removes implicit casting of numeric data when writing to TileDB. When passing data to read/write you should use the SOMAArray provided schema to type casts in advance
Fix index casting when writing dictionaries
Properly write validity buffers for nullable enumerated columns

Notes for Reviewer:

codecov · 2025-11-14T10:39:30Z

Codecov Report

❌ Patch coverage is 82.85714% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.36%. Comparing base (e5a14aa) to head (73577bc).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4311      +/-   ##
==========================================
- Coverage   86.37%   86.36%   -0.02%     
==========================================
  Files         139      140       +1     
  Lines       21093    21111      +18     
  Branches       15       17       +2     
==========================================
+ Hits        18219    18232      +13     
- Misses       2874     2879       +5

Flag	Coverage Δ
python	`89.01% <81.81%> (-0.06%)`	⬇️
r	`84.99% <84.61%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`89.01% <81.81%> (-0.06%)`	⬇️
libtiledbsoma	`76.58% <71.42%> (-0.66%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bkmartinjr

are there any measurements of the time/space impact of this PR?

XanthosXanthopoulos · 2025-11-16T18:44:16Z

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

libtiledbsoma/src/soma/managed_query.h

jp-dark · 2025-11-18T17:38:19Z

libtiledbsoma/src/soma/column_buffer.h

+
+    CArrayColumnBuffer() = delete;
+    CArrayColumnBuffer(const CArrayColumnBuffer&) = delete;
+    CArrayColumnBuffer(CArrayColumnBuffer&&) = default;


I get the following warning when compiling:

/home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:401:5: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted] 401 | CArrayColumnBuffer(CArrayColumnBuffer&&) = default; | ^ /home/jules/Software/TileDB-Inc/TileDB-SOMA/libtiledbsoma/src/soma/column_buffer.h:375:28: note: move constructor of 'CArrayColumnBuffer' is implicitly deleted because base class 'ReadColumnBuffer' has a deleted move constructor

libtiledbsoma/src/soma/column_buffer.h

rroelke · 2025-12-22T14:35:58Z

apis/python/src/tiledbsoma/_managed_query.py

+        self._handle.submit_write()
+
+        # clear stored data objects
+        self._ref_store.clear()


Does ManagedQuery allow buffers to be re-used across multiple submit calls?

Buffers used for write operations are not reused. The managed query only gets a view of them from wherever they come from.

In the C/C++ API buffers can be re-used for multi-part global order writes. But IIRC those aren't supported by Python yet, right?

Yes the buffers we supply to the C++ API are owned by numpy or Arrow so we do not do anything with them other than that.

rroelke · 2025-12-22T15:21:57Z

are there any measurements of the time/space impact of this PR?

I have ingested a couple of h5ad files and the result was about 20% faster with this PR with lower memory usage as well

Is this just for the write path? Or does this also include reads? It might be nice to see a breakdown.

XanthosXanthopoulos · 2025-12-22T15:25:11Z

Yes this was just for writes. Reads were slower before merging the different memory modes PR but after they should be on par or faster. I haven't run the benchmarks yet

rroelke

I haven't finished yet, I've made my way through column_buffer.{cc,h} so far.

I have left a slew of comments but nothing particularly regarding safety, most of them are cosmetic, for which I defer to y'all as I am not a SOMA maintainer. These few are the most important:

At a higher level I don't really understand why the CArrayColumnBuffer would have any different performance characteristics than the VectorColumnBuffer - I would expect these to be making very similar patterns of memory allocations as long as the move constructors and etc are used appropriately. Data overrides my intuition of course. But I have a feeling that you don't actually need to separate these - the separation of the read and write path would be enough.

apis/python/tests/test_dataframe.py

rroelke · 2025-12-22T14:40:40Z

apis/python/tests/test_experiment_basic.py

    pydict["bar"] = [4.1, 5.2, 6.3, 7.4, 8.5]
    pydict["baz"] = ["apple", "ball", "cat", "dog", "egg"]
-    rb = pa.Table.from_pydict(pydict)
+    rb = pa.Table.from_pydict(pydict, schema=obs_arrow_schema.insert(0, pa.field("soma_joinid", pa.int64())))


How come you are choosing to add the field this way instead of inline above?

apis/python/tests/test_geometry_dataframe.py

apis/python/tests/test_sparse_nd_array.py

libtiledbsoma/src/soma/array_buffers.h