Skip to content

Conversation

rmatif
Copy link
Contributor

@rmatif rmatif commented Sep 19, 2025

Changes introduced in #790 could fails on some tensors because we deduped by name using an unordered_map, whichever thread flushed last won, so some chunks pointed at the wrong offsets and corrupt the load. This fix keeps the most recent source tensor (highest index) per name, then re-sorts by the original order before reading this makes loads deterministic

This will fix #838

@wbruna
Copy link
Contributor

wbruna commented Sep 19, 2025

I can't say I'm following the code (this really needs a lot more comments IMHO), but going by your description: if this is happening because of an unstable ordering of the all_results vector, wouldn't it be much simpler to replace it with a vector-of-vectors, indexed by thread id?

Something like this?
diff --git a/model.cpp b/model.cpp
index 2e7127d..78669de 100644
--- a/model.cpp
+++ b/model.cpp
@@ -1968,17 +1968,18 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_thread
         };
 
         std::mutex vec_mutex;
-        std::vector<IndexedStorage> all_results;
+        std::vector<std::vector<IndexedStorage> > all_results;
 
         int n_threads = std::min(num_threads_to_use, (int)tensor_storages.size());
         if (n_threads < 1) {
             n_threads = 1;
         }
         std::vector<std::thread> workers;
+        all_results.resize(n_threads);
 
         for (int i = 0; i < n_threads; ++i) {
             workers.emplace_back([&, thread_id = i]() {
-                std::vector<IndexedStorage> local_results;
+                std::vector<IndexedStorage> & local_results = all_results[i];
                 std::vector<TensorStorage> temp_storages;
 
                 for (size_t j = thread_id; j < tensor_storages.size(); j += n_threads) {
@@ -1995,11 +1996,6 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_thread
                     }
                 }
 
-                if (!local_results.empty()) {
-                    std::lock_guard<std::mutex> lock(vec_mutex);
-                    all_results.insert(all_results.end(),
-                                       local_results.begin(), local_results.end());
-                }
             });
         }
         for (auto& w : workers) {
@@ -2007,8 +2003,10 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_thread
         }
 
         std::unordered_map<std::string, IndexedStorage> latest_map;
-        for (auto& entry : all_results) {
-            latest_map[entry.ts.name] = entry;
+        for (auto& thread_results: all_results) {
+            for (auto& entry : thread_results) {
+                latest_map[entry.ts.name] = entry;
+            }
         }
 
         processed_tensor_storages.reserve(latest_map.size());

Edit: ok, not as-is because each thread isn't processing the entries contiguously. But they could: thread 0 gets the entries 0..(tensor_storages.size()/n_threads), thread 1 the next block, and so forth:

        // how many tensors per thread
        size_t slice = tensor_storages.size() / n_threads;
        // each of the first threads will take one leftover
        size_t leftovers = tensor_storages.size() % n_threads;

        for (int i = 0; i < n_threads; ++i) {
            workers.emplace_back([&, thread_id = i]() {
                std::vector<IndexedStorage> & local_results = all_results[thread_id];
                std::vector<TensorStorage> temp_storages;
                size_t first, last;
                if (thread_id < leftovers) {
                    first = (slice + 1) * thread_id;
                    last = first + slice + 1;
                } else {
                    first = tensor_storages.size() - slice * (n_threads - thread_id);
                    last = first + slice;
                }
                for (size_t j = first; j < last; j++) {

(@rmatif , feel free to borrow this if it makes sense; I can't properly test it anyway)

@leejet
Copy link
Owner

leejet commented Sep 22, 2025

@rmatif Could you describe in detail the reason for the changes? Just looking at the code leaves me a bit confused.

@leejet
Copy link
Owner

leejet commented Sep 24, 2025

I think I understand where the problem is. This PR should fix it. Thank you for your contribution.

@leejet leejet merged commit 1e0d282 into leejet:master Sep 24, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDLX Black output afer PR 55c2e05

3 participants