NDK build test #14

max-krasnyansky · 2024-08-29T05:26:31Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

- OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems

and fix --poll case

…ool threads This way we avoid using E-Cores and Hyperthreaded siblings.

For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc).

…r when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

All command line args now allow for setting poll to 0 (false).

We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve.

New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs.

With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching. Include n_threads in debug print for disposable threadpool. Declare pause and stop flags as atomic_bool This doesn't actually generate any memory barriers and simply informs the thread sanitizer that these flags can be written & read by different threads without locking.

…xes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers.

Full memory barrier is an overkill for this since each thread works on different chunk

Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask.

…ly if threadpool is enabled

…atus check

…ty application Most of the init code is now exactly the same between threadpool and openmp.

…not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings.

This avoids extra syscalls for each graph_compute()

… etc

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test.

fmz and others added 30 commits August 27, 2024 06:37

Introduce ggml_compute_threadpool

130adf8

- OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems

Minor fixes

a0aae52

fixed use after release bug

d5c9c14

fixed a harmless race condition

82224f8

Fix Android bulid issue

817eaf0

fix more race conditions

5763732

fix deadlock for cases where cgraph.n_nodes == 1

3008b31

and fix --poll case

threadpool: use cpu_get_num_math to set the default number of threadp…

96d6603

…ool threads This way we avoid using E-Cores and Hyperthreaded siblings.

bench: create fresh threadpool for each test

2953441

For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc).

atomics: always use stdatomics with clang and use relaxed memory orde…

6fcc780

…r when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

threadpool: make polling the default to match openmp behavior

3b62f7c

All command line args now allow for setting poll to 0 (false).

threadpool: do not wakeup threads in already paused threadpool

dfa6377

fix potential race condition in check_for_work

2e18f0d

threadpool: do not create two threadpools if their params are identical

48aa8ee

threadpool: reduce pause/resume/wakeup overhead in common cases

494e27c

We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

threadpool: reduce the number of barrier required

9d3e78c

New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs.

threadpool: do not clear barrier counters between graphs computes (fi…

db45b6d

…xes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers.

threadpool: use relaxed order for chunk sync

307fece

Full memory barrier is an overkill for this since each thread works on different chunk

threadpool: remove abort_callback from threadpool state

63a0dad

threadpool: better naming for thread/cpumask releated functions

2358bb3

threadpool: consistent use of int type for n_threads params

4a4d715

threadpool: add support for ggml_threadpool_params_default/init

c4452ed

Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask.

threadpool: move typedef into ggml.h

31541d7

threadpool: fix apply_priority() function name

4064860

threadpool: fix swift wrapper errors due to n_threads int type cleanup

f64c975

threadpool: enable --cpu-mask and other threadpool related options on…

c506d7f

…ly if threadpool is enabled

threadpool: replace checks for compute_thread ret code with proper st…

8008463

…atus check

threadpool: simplify threadpool init logic and fix main thread affini…

49ac51f

…ty application Most of the init code is now exactly the same between threadpool and openmp.

max-krasnyansky added 8 commits August 27, 2024 06:37

threadpool: update threadpool resume/pause function names

204377a

threadpool: enable openmp by default for now

93f170d

threadpool: don't forget to free workers state when omp is enabled

a7496bf

threadpool: avoid updating process priority on the platforms that do …

8186e96

…not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings.

threadpool: update calling thread prio and affinity only at start/resume

658f16c

This avoids extra syscalls for each graph_compute()

llama-bench: turn threadpool params into vectors, add output headers,…

8d5ab9a

… etc

llama-bench: add support for cool off between tests --delay

3bcc4de

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test.

threadpool: move all pause/resume logic into ggml

4c0ce47

github-actions bot added testing examples server ggml labels Aug 29, 2024

max-krasnyansky closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NDK build test #14

NDK build test #14

Uh oh!

max-krasnyansky commented Aug 29, 2024

Uh oh!

Uh oh!

NDK build test #14

NDK build test #14

Uh oh!

Conversation

max-krasnyansky commented Aug 29, 2024

Uh oh!

Uh oh!