Don't copy data to job

dor-forer · dor-forer · commit 24cef8ac05de · 2025-12-24T15:34:59.000+02:00
diff --git a/docs/disk_hnsw_multithreaded_architecture.md b/docs/disk_hnsw_multithreaded_architecture.md
@@ -2,28 +2,27 @@
 
 ## Overview
 
-This document describes the architectural changes introduced in the `dorer-disk-poc-add-delete-mt` branch compared to the original `disk-poc` branch. The focus is on multi-threading, synchronization, concurrency in writing to disk, and performance enhancements.
+This document describes the multi-threaded architecture of the HNSWDisk index, focusing on synchronization, concurrency in writing to disk, and performance enhancements.
 
-## Key Architectural Changes
+## Key Architectural Components
 
-### 1. Insertion Mode
+### 1. Lightweight Insert Jobs
 
-**Previous single threaded approach:** Vectors were accumulated in batches before being written to disk, requiring complex coordination between threads.
-
-
-**Current approach:** Each insert job is self-contained and can write directly to disk upon completion, optimized for workloads where disk writes are cheap but neighbor searching (reads from disk) is the bottleneck.
+Each insert job is lightweight and only stores metadata (vectorId, elementMaxLevel). Vector data is looked up from shared storage when the job executes, minimizing memory usage when many jobs are queued.
 
 ```
 ┌──────────────────────────────────────────────────────────────┐
 │                    HNSWDiskSingleInsertJob                   │
 ├──────────────────────────────────────────────────────────────┤
 │ - vectorId                                                   │
 │ - elementMaxLevel                                            │
-│ - rawVectorData (copied into job - no external references)   │
-│ - processedVectorData (quantized vector for distance calc)   │
 └──────────────────────────────────────────────────────────────┘
 ```
 
+At execution time, jobs access vector data via:
+- **Raw vectors**: `shared_ptr` from `rawVectorsInRAM` (refcount increment, no copy)
+- **Processed vectors**: Direct access from `this->vectors` container
+
 ### 2. Segmented Neighbor Cache
 
 To reduce lock contention in multi-threaded scenarios, the neighbor changes cache is partitioned into **64 independent segments**:
@@ -39,7 +38,7 @@ struct alignas(64) CacheSegment {
 };
 ```
 Note:
-NUM_CACHE_SEGMENTS can be changes which will be cause better separation of the cache,
+NUM_CACHE_SEGMENTS can be changed which will cause better separation of the cache,
 but will require more RAM usage.
 
 **Key benefits:**
@@ -146,10 +145,12 @@ addVector()
     │   └── executeGraphInsertionCore() [inline]
     │
     └── Multi-threaded path (with job queue):
-        ├── Create HNSWDiskSingleInsertJob (copies vector data)
+        ├── Create HNSWDiskSingleInsertJob (just vectorId + level, no vector copy)
         ├── Submit via SubmitJobsToQueue callback
         └── Worker thread executes:
             └── executeSingleInsertJob()
+                ├── Get shared_ptr to raw vector from rawVectorsInRAM
+                ├── Get processed vector from this->vectors
                 └── executeGraphInsertionCore()
 ```
 
@@ -159,12 +160,16 @@ addVector()
 struct HNSWDiskSingleInsertJob : public AsyncJob {
     idType vectorId;
     size_t elementMaxLevel;
-    std::string rawVectorData;       // Copied - no external references
-    std::string processedVectorData; // Quantized data for distance calc
+    // No vector data stored - looked up from index when job executes
+    // This saves memory: 100M pending jobs don't need 100M vector copies
 };
 ```
 
-Copying vector data into the job eliminates race conditions with the caller's buffer.
+Jobs look up vector data at execution time:
+- **Raw vectors**: Accessed via `shared_ptr` from `rawVectorsInRAM` (just increments refcount, no copy)
+- **Processed vectors**: Accessed from `this->vectors` container
+
+This eliminates memory duplication while maintaining thread safety through reference counting.
 
 ## Data Flow During Insert
 
@@ -222,13 +227,33 @@ Nodes created in the current batch are tracked in `cacheSegment.newNodes`:
 - Avoids disk lookups for vectors that haven't been written yet
 - Cleared after successful flush to disk
 
-### 4. Raw Vectors in RAM
+### 4. Raw Vectors in RAM with shared_ptr
+
+Raw vectors are stored in `rawVectorsInRAM` using `std::shared_ptr<std::string>`:
+
+```cpp
+std::unordered_map<idType, std::shared_ptr<std::string>> rawVectorsInRAM;
+```
 
-Raw vectors are kept in `rawVectorsInRAM` until flushed to disk:
+**Benefits:**
 - Allows concurrent jobs to access vectors before disk write
 - Eliminates redundant disk reads during graph construction
+- **Zero-copy job execution**: Jobs increment refcount instead of copying entire vector
+- **Safe concurrent deletion**: If vector is erased from map while job is executing, the `shared_ptr` keeps data alive until job completes
 - Protected by `rawVectorsGuard` (shared_mutex)
 
+**Execution flow:**
+```cpp
+// Job execution - no data copy, just refcount increment
+std::shared_ptr<std::string> localRawRef;
+{
+    std::shared_lock<std::shared_mutex> lock(rawVectorsGuard);
+    localRawRef = rawVectorsInRAM[job->vectorId];  // refcount++
+}
+// Lock released, but data stays alive via localRawRef
+// Use localRawRef->data() for graph insertion and disk write
+```
+
 ## Thread Safety Summary
 
 | Operation | Thread Safety | Notes |
diff --git a/src/VecSim/algorithms/hnsw/hnsw_disk.h b/src/VecSim/algorithms/hnsw/hnsw_disk.h
@@ -13,6 +13,7 @@
 #include "VecSim/memory/vecsim_malloc.h"
 #include "VecSim/utils/vecsim_stl.h"
 #include "VecSim/utils/vec_utils.h"
+#include <optional>
 #include <vector>
 // #include "VecSim/containers/data_block.h"
 // #include "VecSim/containers/raw_data_container_interface.h"
@@ -143,17 +144,14 @@ class HNSWDiskIndex;
 struct HNSWDiskSingleInsertJob : public AsyncJob {
     idType vectorId;
     size_t elementMaxLevel;
-    // Store vector data directly in the job (no external references)
-    std::string rawVectorData;       // Original float32 vector
-    std::string processedVectorData; // Preprocessed/quantized vector for distance calculations
+    // No vector data stored - looked up from index when job executes
+    // This saves memory: 100M pending jobs don't need 100M vector copies
 
     HNSWDiskSingleInsertJob(std::shared_ptr<VecSimAllocator> allocator, idType vectorId_,
-                            size_t elementMaxLevel_, std::string &&rawVector,
-                            std::string &&processedVector, JobCallback insertCb,
+                            size_t elementMaxLevel_, JobCallback insertCb,
                             VecSimIndex *index_)
         : AsyncJob(allocator, HNSW_DISK_SINGLE_INSERT_JOB, insertCb, index_),
-          vectorId(vectorId_), elementMaxLevel(elementMaxLevel_),
-          rawVectorData(std::move(rawVector)), processedVectorData(std::move(processedVector)) {}
+          vectorId(vectorId_), elementMaxLevel(elementMaxLevel_) {}
 };
 
 //////////////////////////////////// HNSW index implementation ////////////////////////////////////
@@ -277,8 +275,9 @@ class HNSWDiskIndex : public VecSimIndexAbstract<DataType, DistType>
     vecsim_stl::vector<NeighborUpdate> stagedInsertNeighborUpdates;
 
     // Temporary storage for raw vectors in RAM (until flush batch)
-    // Maps idType -> raw vector data (stored as string for simplicity)
-    std::unordered_map<idType, std::string> rawVectorsInRAM;
+    // Maps idType -> raw vector data (using shared_ptr to avoid copying in job execution)
+    // When a job executes, it just increments refcount instead of copying the entire vector
+    std::unordered_map<idType, std::shared_ptr<std::string>> rawVectorsInRAM;
 
 
     /********************************** Multi-threading Support **********************************/
@@ -934,10 +933,12 @@ int HNSWDiskIndex<DataType, DistType>::addVector(
     // We need to store the original vector before preprocessing
     // NOTE: In batchless mode, we still use rawVectorsInRAM so other concurrent jobs can access
     // the raw vectors of vectors that haven't been written to disk yet
+    // Using shared_ptr so job execution can just increment refcount instead of copying
     const char* raw_data = reinterpret_cast<const char*>(vector);
+    auto rawVectorPtr = std::make_shared<std::string>(raw_data, this->inputBlobSize);
     {
         std::lock_guard<std::shared_mutex> lock(rawVectorsGuard);
-        rawVectorsInRAM[newElementId] = std::string(raw_data, this->inputBlobSize);
+        rawVectorsInRAM[newElementId] = rawVectorPtr;
     }
     // Preprocess the vector
     ProcessedBlobs processedBlobs = this->preprocess(vector);
@@ -1009,14 +1010,9 @@ int HNSWDiskIndex<DataType, DistType>::addVector(
     // Check if we have a job queue for async processing
     if (SubmitJobsToQueue != nullptr) {
         // Multi-threaded: submit job for async processing
-        std::string rawVectorCopy(raw_data, this->inputBlobSize);
-        std::string processedVectorCopy(
-            reinterpret_cast<const char *>(processedBlobs.getStorageBlob()),
-            this->dataSize);
-
+        // No vector copies in job - job will look up from rawVectorsInRAM and this->vectors
         auto *job = new (this->allocator) HNSWDiskSingleInsertJob(
-            this->allocator, newElementId, elementMaxLevel, std::move(rawVectorCopy),
-            std::move(processedVectorCopy),
+            this->allocator, newElementId, elementMaxLevel,
             HNSWDiskIndex<DataType, DistType>::executeSingleInsertJobWrapper, this);
 
         submitSingleJob(job);
@@ -1304,7 +1300,7 @@ bool HNSWDiskIndex<DataType, DistType>::getRawVector(idType id, void* output_buf
         std::shared_lock<std::shared_mutex> lock(rawVectorsGuard);
         auto it = rawVectorsInRAM.find(id);
         if (it != rawVectorsInRAM.end()) {
-            const char* data_ptr = it->second.data();
+            const char* data_ptr = it->second->data();
             std::memcpy(output_buffer, data_ptr, this->inputBlobSize);
             return true;
         }
@@ -1353,7 +1349,7 @@ bool HNSWDiskIndex<DataType, DistType>::getRawVectorInternal(idType id, void* ou
         std::shared_lock<std::shared_mutex> lock(rawVectorsGuard);
         auto it = rawVectorsInRAM.find(id);
         if (it != rawVectorsInRAM.end()) {
-            const char* data_ptr = it->second.data();
+            const char* data_ptr = it->second->data();
             std::memcpy(output_buffer, data_ptr, this->inputBlobSize);
             return true;
         }
@@ -3146,7 +3142,28 @@ void HNSWDiskIndex<DataType, DistType>::executeSingleInsertJob(HNSWDiskSingleIns
         return;
     }
 
-    // Get current entry point
+    // Get shared_ptr to raw vector from rawVectorsInRAM (just increments refcount, no copy)
+    // This keeps the data alive even if erased from map before job finishes
+    std::shared_ptr<std::string> localRawRef;
+    {
+        std::shared_lock<std::shared_mutex> lock(rawVectorsGuard);
+        auto it = rawVectorsInRAM.find(job->vectorId);
+        if (it == rawVectorsInRAM.end()) {
+            // Vector was already erased (e.g., deleted before job executed)
+            delete job;
+            return;
+        }
+        localRawRef = it->second;  // Just increments refcount, no data copy
+    }
+
+    // Get processed vector from vectors container
+    const void *processedVector;
+    {
+        std::shared_lock<std::shared_mutex> lock(vectorsGuard);
+        processedVector = this->vectors->getElement(job->vectorId);
+    }
+
+    // Get current entry point and max level
     idType currentEntryPoint;
     size_t currentMaxLevel;
     {
@@ -3157,8 +3174,7 @@ void HNSWDiskIndex<DataType, DistType>::executeSingleInsertJob(HNSWDiskSingleIns
 
     // Use unified core function (batching controlled by diskWriteBatchThreshold)
     executeGraphInsertionCore(job->vectorId, job->elementMaxLevel, currentEntryPoint,
-                              currentMaxLevel, job->rawVectorData.data(),
-                              job->processedVectorData.data());
+                              currentMaxLevel, localRawRef->data(), processedVector);
 
     delete job;
 }