Skip to content

Commit c584042

Browse files
fmzmax-krasnyanskyfmzslaren
authored andcommitted
Threadpool: take 2 (llama/8672)
* Introduce ggml_compute_threadpool - OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems * Minor fixes * fixed use after release bug * fixed a harmless race condition * Fix Android bulid issue * fix more race conditions * fix deadlock for cases where cgraph.n_nodes == 1 and fix --poll case * threadpool: use cpu_get_num_math to set the default number of threadpool threads This way we avoid using E-Cores and Hyperthreaded siblings. * bench: create fresh threadpool for each test For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc). * atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior. * threadpool: make polling the default to match openmp behavior All command line args now allow for setting poll to 0 (false). * threadpool: do not wakeup threads in already paused threadpool * fix potential race condition in check_for_work * threadpool: do not create two threadpools if their params are identical * threadpool: reduce pause/resume/wakeup overhead in common cases We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead. * threadpool: add support for hybrid polling poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve. * threadpool: reduce the number of barrier required New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs. * threadpool: remove special-casing for disposable threadpools With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching. Include n_threads in debug print for disposable threadpool. Declare pause and stop flags as atomic_bool This doesn't actually generate any memory barriers and simply informs the thread sanitizer that these flags can be written & read by different threads without locking. * threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers. * threadpool: use relaxed order for chunk sync Full memory barrier is an overkill for this since each thread works on different chunk * threadpool: remove abort_callback from threadpool state * threadpool: better naming for thread/cpumask releated functions * threadpool: consistent use of int type for n_threads params * threadpool: add support for ggml_threadpool_params_default/init Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask. * threadpool: move typedef into ggml.h * threadpool: fix apply_priority() function name * threadpool: fix swift wrapper errors due to n_threads int type cleanup * threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled * threadpool: replace checks for compute_thread ret code with proper status check * threadpool: simplify threadpool init logic and fix main thread affinity application Most of the init code is now exactly the same between threadpool and openmp. * threadpool: update threadpool resume/pause function names * threadpool: enable openmp by default for now * threadpool: don't forget to free workers state when omp is enabled * threadpool: avoid updating process priority on the platforms that do not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings. * threadpool: update calling thread prio and affinity only at start/resume This avoids extra syscalls for each graph_compute() * llama-bench: turn threadpool params into vectors, add output headers, etc * llama-bench: add support for cool off between tests --delay This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test. * threadpool: move process priority setting into the apps (bench and cli) This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy. * threadpool: move all pause/resume logic into ggml * threadpool: futher api cleanup and prep for future refactoring All threadpool related functions and structs use ggml_threadpool prefix. * threadpool: minor indent fixes * threadpool: improve setprioty error message * Update examples/llama-bench/llama-bench.cpp Co-authored-by: slaren <[email protected]> * threadpool: fix indent in set_threadpool call * use int32_t for n_thread type in public llama.cpp API * threadpool: use _new and _free instead of _create and _release * fix two more public APIs to use int32_t for n_threads * build: set _GNU_SOURCE for Adroid --------- Co-authored-by: Max Krasnyansky <[email protected]> Co-authored-by: fmz <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]> Co-authored-by: slaren <[email protected]>
1 parent 10e83a4 commit c584042

File tree

6 files changed

+740
-187
lines changed

6 files changed

+740
-187
lines changed

include/ggml-alloc.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ extern "C" {
77
#endif
88

99
typedef struct ggml_backend_buffer_type * ggml_backend_buffer_type_t;
10-
typedef struct ggml_backend_buffer * ggml_backend_buffer_t;
11-
typedef struct ggml_backend * ggml_backend_t;
10+
typedef struct ggml_backend_buffer * ggml_backend_buffer_t;
11+
typedef struct ggml_backend * ggml_backend_t;
1212

1313
// Tensor allocator
1414
struct ggml_tallocr {

include/ggml-backend.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ extern "C" {
103103

104104
GGML_API GGML_CALL bool ggml_backend_is_cpu (ggml_backend_t backend);
105105
GGML_API void ggml_backend_cpu_set_n_threads (ggml_backend_t backend_cpu, int n_threads);
106+
GGML_API void ggml_backend_cpu_set_threadpool (ggml_backend_t backend_cpu, ggml_threadpool_t threadpool);
106107
GGML_API void ggml_backend_cpu_set_abort_callback(ggml_backend_t backend_cpu, ggml_abort_callback abort_callback, void * abort_callback_data);
107108

108109
// Create a backend buffer from an existing pointer

include/ggml.h

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,8 @@
231231
#define GGML_MAX_SRC 10
232232
#ifndef GGML_MAX_NAME
233233
#define GGML_MAX_NAME 64
234+
#define GGML_MAX_N_THREADS 512
235+
234236
#endif
235237
#define GGML_MAX_OP_PARAMS 64
236238
#define GGML_DEFAULT_N_THREADS 4
@@ -628,13 +630,37 @@ extern "C" {
628630
// If it returns true, the computation is aborted
629631
typedef bool (*ggml_abort_callback)(void * data);
630632

633+
// Scheduling priorities
634+
enum ggml_sched_priority {
635+
GGML_SCHED_PRIO_NORMAL,
636+
GGML_SCHED_PRIO_MEDIUM,
637+
GGML_SCHED_PRIO_HIGH,
638+
GGML_SCHED_PRIO_REALTIME
639+
};
640+
641+
// Threadpool params
642+
// Use ggml_threadpool_params_default() or ggml_threadpool_params_init() to populate the defaults
643+
struct ggml_threadpool_params {
644+
bool cpumask[GGML_MAX_N_THREADS]; // mask of cpu cores (all-zeros means use default affinity settings)
645+
int n_threads; // number of threads
646+
enum ggml_sched_priority prio; // thread priority
647+
uint32_t poll; // polling level (0 - no polling, 100 - aggressive polling)
648+
bool strict_cpu; // strict cpu placement
649+
bool paused; // start in paused state
650+
};
651+
652+
struct ggml_threadpool; // forward declaration, see ggml.c
653+
654+
typedef struct ggml_threadpool * ggml_threadpool_t;
655+
631656
// the compute plan that needs to be prepared for ggml_graph_compute()
632657
// since https://github.com/ggerganov/ggml/issues/287
633658
struct ggml_cplan {
634659
size_t work_size; // size of work buffer, calculated by `ggml_graph_plan()`
635660
uint8_t * work_data; // work buffer, to be allocated by caller before calling to `ggml_graph_compute()`
636661

637662
int n_threads;
663+
struct ggml_threadpool * threadpool;
638664

639665
// abort ggml_graph_compute when true
640666
ggml_abort_callback abort_callback;
@@ -2057,10 +2083,23 @@ extern "C" {
20572083
GGML_API size_t ggml_graph_overhead(void);
20582084
GGML_API size_t ggml_graph_overhead_custom(size_t size, bool grads);
20592085

2086+
GGML_API struct ggml_threadpool_params ggml_threadpool_params_default(int n_threads);
2087+
GGML_API void ggml_threadpool_params_init (struct ggml_threadpool_params *p, int n_threads);
2088+
GGML_API bool ggml_threadpool_params_match (const struct ggml_threadpool_params *p0, const struct ggml_threadpool_params *p1);
2089+
GGML_API struct ggml_threadpool* ggml_threadpool_new (struct ggml_threadpool_params * params);
2090+
GGML_API void ggml_threadpool_free (struct ggml_threadpool * threadpool);
2091+
GGML_API int ggml_threadpool_get_n_threads(struct ggml_threadpool * threadpool);
2092+
GGML_API void ggml_threadpool_pause (struct ggml_threadpool * threadpool);
2093+
GGML_API void ggml_threadpool_resume (struct ggml_threadpool * threadpool);
2094+
20602095
// ggml_graph_plan() has to be called before ggml_graph_compute()
20612096
// when plan.work_size > 0, caller must allocate memory for plan.work_data
2062-
GGML_API struct ggml_cplan ggml_graph_plan (const struct ggml_cgraph * cgraph, int n_threads /*= GGML_DEFAULT_N_THREADS*/);
2063-
GGML_API enum ggml_status ggml_graph_compute( struct ggml_cgraph * cgraph, struct ggml_cplan * cplan);
2097+
GGML_API struct ggml_cplan ggml_graph_plan(
2098+
const struct ggml_cgraph * cgraph,
2099+
int n_threads, /* = GGML_DEFAULT_N_THREADS */
2100+
struct ggml_threadpool * threadpool /* = NULL */ );
2101+
GGML_API enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan);
2102+
20642103
// same as ggml_graph_compute() but the work data is allocated as a part of the context
20652104
// note: the drawback of this API is that you must have ensured that the context has enough memory for the work data
20662105
GGML_API enum ggml_status ggml_graph_compute_with_ctx(struct ggml_context * ctx, struct ggml_cgraph * cgraph, int n_threads);

src/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1247,7 +1247,7 @@ endif()
12471247

12481248
# Data types, macros and functions related to controlling CPU affinity and
12491249
# some memory allocation are available on Linux through GNU extensions in libc
1250-
if (CMAKE_SYSTEM_NAME MATCHES "Linux")
1250+
if (CMAKE_SYSTEM_NAME MATCHES "Linux" OR CMAKE_SYSTEM_NAME MATCHES "Android")
12511251
add_compile_definitions(_GNU_SOURCE)
12521252
endif()
12531253

src/ggml-backend.c

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -722,9 +722,11 @@ ggml_backend_buffer_type_t ggml_backend_cpu_hbm_buffer_type(void) {
722722
#endif
723723

724724
struct ggml_backend_cpu_context {
725-
int n_threads;
726-
void * work_data;
727-
size_t work_size;
725+
int n_threads;
726+
ggml_threadpool_t threadpool;
727+
728+
void * work_data;
729+
size_t work_size;
728730

729731
ggml_abort_callback abort_callback;
730732
void * abort_callback_data;
@@ -759,7 +761,7 @@ GGML_CALL static ggml_backend_graph_plan_t ggml_backend_cpu_graph_plan_create(gg
759761

760762
struct ggml_backend_plan_cpu * cpu_plan = malloc(sizeof(struct ggml_backend_plan_cpu));
761763

762-
cpu_plan->cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
764+
cpu_plan->cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads, cpu_ctx->threadpool);
763765
cpu_plan->cgraph = *cgraph; // FIXME: deep copy
764766

765767
if (cpu_plan->cplan.work_size > 0) {
@@ -796,7 +798,7 @@ GGML_CALL static enum ggml_status ggml_backend_cpu_graph_plan_compute(ggml_backe
796798
GGML_CALL static enum ggml_status ggml_backend_cpu_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
797799
struct ggml_backend_cpu_context * cpu_ctx = (struct ggml_backend_cpu_context *)backend->context;
798800

799-
struct ggml_cplan cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads);
801+
struct ggml_cplan cplan = ggml_graph_plan(cgraph, cpu_ctx->n_threads, cpu_ctx->threadpool);
800802

801803
if (cpu_ctx->work_size < cplan.work_size) {
802804
free(cpu_ctx->work_data);
@@ -877,6 +879,7 @@ ggml_backend_t ggml_backend_cpu_init(void) {
877879
}
878880

879881
ctx->n_threads = GGML_DEFAULT_N_THREADS;
882+
ctx->threadpool = NULL;
880883
ctx->work_data = NULL;
881884
ctx->work_size = 0;
882885
ctx->abort_callback = NULL;
@@ -907,6 +910,18 @@ void ggml_backend_cpu_set_n_threads(ggml_backend_t backend_cpu, int n_threads) {
907910
ctx->n_threads = n_threads;
908911
}
909912

913+
void ggml_backend_cpu_set_threadpool(ggml_backend_t backend_cpu, ggml_threadpool_t threadpool) {
914+
GGML_ASSERT(ggml_backend_is_cpu(backend_cpu));
915+
916+
struct ggml_backend_cpu_context * ctx = (struct ggml_backend_cpu_context *)backend_cpu->context;
917+
918+
if (ctx->threadpool && ctx->threadpool != threadpool) {
919+
// already had a different threadpool, pause/suspend it before switching
920+
ggml_threadpool_pause(ctx->threadpool);
921+
}
922+
ctx->threadpool = threadpool;
923+
}
924+
910925
void ggml_backend_cpu_set_abort_callback(ggml_backend_t backend_cpu, ggml_abort_callback abort_callback, void * abort_callback_data) {
911926
GGML_ASSERT(ggml_backend_is_cpu(backend_cpu));
912927

0 commit comments

Comments
 (0)