make tensor loading deterministic #844

rmatif · 2025-09-19T18:01:16Z

Changes introduced in #790 could fails on some tensors because we deduped by name using an unordered_map, whichever thread flushed last won, so some chunks pointed at the wrong offsets and corrupt the load. This fix keeps the most recent source tensor (highest index) per name, then re-sorts by the original order before reading this makes loads deterministic

This will fix #838

wbruna · 2025-09-19T18:34:33Z

I can't say I'm following the code (this really needs a lot more comments IMHO), but going by your description: if this is happening because of an unstable ordering of the all_results vector, wouldn't it be much simpler to replace it with a vector-of-vectors, indexed by thread id?

Something like this?

diff --git a/model.cpp b/model.cpp
index 2e7127d..78669de 100644
--- a/model.cpp
+++ b/model.cpp
@@ -1968,17 +1968,18 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_thread
         };
 
         std::mutex vec_mutex;
-        std::vector<IndexedStorage> all_results;
+        std::vector<std::vector<IndexedStorage> > all_results;
 
         int n_threads = std::min(num_threads_to_use, (int)tensor_storages.size());
         if (n_threads < 1) {
             n_threads = 1;
         }
         std::vector<std::thread> workers;
+        all_results.resize(n_threads);
 
         for (int i = 0; i < n_threads; ++i) {
             workers.emplace_back([&, thread_id = i]() {
-                std::vector<IndexedStorage> local_results;
+                std::vector<IndexedStorage> & local_results = all_results[i];
                 std::vector<TensorStorage> temp_storages;
 
                 for (size_t j = thread_id; j < tensor_storages.size(); j += n_threads) {
@@ -1995,11 +1996,6 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_thread
                     }
                 }
 
-                if (!local_results.empty()) {
-                    std::lock_guard<std::mutex> lock(vec_mutex);
-                    all_results.insert(all_results.end(),
-                                       local_results.begin(), local_results.end());
-                }
             });
         }
         for (auto& w : workers) {
@@ -2007,8 +2003,10 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, int n_thread
         }
 
         std::unordered_map<std::string, IndexedStorage> latest_map;
-        for (auto& entry : all_results) {
-            latest_map[entry.ts.name] = entry;
+        for (auto& thread_results: all_results) {
+            for (auto& entry : thread_results) {
+                latest_map[entry.ts.name] = entry;
+            }
         }
 
         processed_tensor_storages.reserve(latest_map.size());

Edit: ok, not as-is because each thread isn't processing the entries contiguously. But they could: thread 0 gets the entries 0..(tensor_storages.size()/n_threads), thread 1 the next block, and so forth:

        // how many tensors per thread
        size_t slice = tensor_storages.size() / n_threads;
        // each of the first threads will take one leftover
        size_t leftovers = tensor_storages.size() % n_threads;

        for (int i = 0; i < n_threads; ++i) {
            workers.emplace_back([&, thread_id = i]() {
                std::vector<IndexedStorage> & local_results = all_results[thread_id];
                std::vector<TensorStorage> temp_storages;
                size_t first, last;
                if (thread_id < leftovers) {
                    first = (slice + 1) * thread_id;
                    last = first + slice + 1;
                } else {
                    first = tensor_storages.size() - slice * (n_threads - thread_id);
                    last = first + slice;
                }
                for (size_t j = first; j < last; j++) {

(@rmatif , feel free to borrow this if it makes sense; I can't properly test it anyway)

leejet · 2025-09-22T16:11:01Z

@rmatif Could you describe in detail the reason for the changes? Just looking at the code leaves me a bit confused.

leejet · 2025-09-24T15:20:20Z

I think I understand where the problem is. This PR should fix it. Thank you for your contribution.

fix tensor loading

fd71485

This was referenced Sep 19, 2025

fix: serialize calls to ggml_backend_tensor_set #841

Closed

SDLX Black output afer PR 55c2e05 #838

Closed

wbruna mentioned this pull request Sep 24, 2025

multithread tensor loading fixes #854

Merged

leejet merged commit 1e0d282 into leejet:master Sep 24, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make tensor loading deterministic #844

make tensor loading deterministic #844

Uh oh!

rmatif commented Sep 19, 2025

Uh oh!

wbruna commented Sep 19, 2025 •

edited

Loading

Uh oh!

leejet commented Sep 22, 2025

Uh oh!

leejet commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

make tensor loading deterministic #844

make tensor loading deterministic #844

Uh oh!

Conversation

rmatif commented Sep 19, 2025

Uh oh!

wbruna commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented Sep 22, 2025

Uh oh!

leejet commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wbruna commented Sep 19, 2025 •

edited

Loading