File tree Expand file tree Collapse file tree 1 file changed +8
-1
lines changed Expand file tree Collapse file tree 1 file changed +8
-1
lines changed Original file line number Diff line number Diff line change @@ -1360,7 +1360,14 @@ UseGgmlGemm2:;
13601360 // If the chunking is poor for the number of threads on this setup, scrap the whole plan. Re-chunk it by thread.
13611361 // Also, chunking by thread was measured to have perform better on NUMA systems. See https://github.com/ggml-org/llama.cpp/pull/6915
13621362 // In theory, chunking should be just as useful on NUMA and non NUMA systems, but testing disagreed with that.
1363- if (nchunk0 * nchunk1 < nth * 4 || ggml_is_numa ()) {
1363+ // If the current chunking plan is inefficient for the available threads, re-chunk it by thread.
1364+ // - Original observation: For low-core NUMA machines, re-chunking improves performance
1365+ // when there are too few chunks per thread (see https://github.com/ggml-org/llama.cpp/pull/6915).
1366+ // - Our observation on AWS Graviton4 (high-core, high-memory bandwidth) shows that
1367+ // disabling this re-chunking for nth >= 128 can actually improve performance.
1368+ // - Therefore, we only apply re-chunking when nth <= 128 and the chunking is poor
1369+ // or on NUMA machines.
1370+ if (nth <= 128 && (nchunk0 * nchunk1 < nth * 4 || ggml_is_numa ())) {
13641371 // distribute the thread work across the inner or outer loop based on which one is larger
13651372 nchunk0 = nr0 > nr1 ? nth : 1 ; // parallelize by src0 rows
13661373 nchunk1 = nr0 > nr1 ? 1 : nth ; // parallelize by src1 rows
You can’t perform that action at this time.
0 commit comments