Replies: 2 comments 3 replies
-
GPU and CPU both produce the same result. Init part: llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 99;
llama_model * model = llama_model_load_from_file(params.model.path.c_str(), model_params);
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048 * 2;
ctx_params.n_batch = 2048 + 100;
ctx_params.no_perf = false;
ctx_params.n_threads = cpu_get_num_physical_cores();
ctx_params.n_threads_batch = cpu_get_num_math();
ctx_params.embeddings = true;
ctx_params.flash_attn = false;
ctx_params.kv_unified = false;
llama_context * ctx = llama_init_from_model(model, ctx_params); |
Beta Was this translation helpful? Give feedback.
0 replies
-
Clear the context memory after each run using |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have found that the two inference results for the same set of inputs are inconsistent. How can I ensure the consistency of the results?
Beta Was this translation helpful? Give feedback.
All reactions