Why are the results of each inference inconsistent？ #16099

MPolaris · 2025-09-19T03:00:12Z

MPolaris
Sep 19, 2025

I have found that the two inference results for the same set of inputs are inconsistent. How can I ensure the consistency of the results?

void test_acc(llama_context* lctx){
    int total_tokens = 12;
    int embd_size = 3584;
    float* emb_data = (float*)malloc(sizeof(float) * total_tokens * embd_size);
    if (emb_data == NULL) {
        LOG_ERR("Unable to allocate memory for emb_data\n");
        return;
    }
    // ensure input consistent
    read_data_from_file("./randominputemb.bin", emb_data, sizeof(float) * total_tokens * embd_size);

    llama_batch batch_data = llama_batch_init(total_tokens, embd_size, 1);
    // ctx.batch = batch_data;
    free(batch_data.embd);
    batch_data.embd = emb_data;
    for (int i = 0; i < total_tokens; i++)
    {
        batch_data.pos[i] = i;
        batch_data.n_seq_id[i] = 1;
        batch_data.seq_id[i][0] = 0;
        batch_data.logits[i] = true;
    }
    batch_data.n_tokens = total_tokens;

    llama_decode(lctx, batch_data);

    //error
    float* t_embd = llama_get_embeddings(lctx); // inconsistent
    if (!t_embd)
    { 
        printf("llama_get_embeddings() failed\n");
    }
    // save to disk, and compare
    save_data_to_file("./output.bin", t_embd, sizeof(float) * total_tokens * embd_size);
 
}

MPolaris · 2025-09-19T03:02:16Z

MPolaris
Sep 19, 2025
Author

GPU and CPU both produce the same result. Init part：

    llama_model_params model_params = llama_model_default_params();
    model_params.n_gpu_layers = 99;

    llama_model * model = llama_model_load_from_file(params.model.path.c_str(), model_params);
    llama_context_params ctx_params = llama_context_default_params();
    ctx_params.n_ctx = 2048 * 2;
    ctx_params.n_batch = 2048 + 100;
    ctx_params.no_perf = false;
    ctx_params.n_threads = cpu_get_num_physical_cores();
    ctx_params.n_threads_batch = cpu_get_num_math();
    ctx_params.embeddings = true;
    ctx_params.flash_attn = false;
    ctx_params.kv_unified = false;

    llama_context * ctx = llama_init_from_model(model, ctx_params);

0 replies

ggerganov · 2025-09-19T06:41:26Z

ggerganov
Sep 19, 2025
Maintainer

Clear the context memory after each run using llama_memory_clear().

3 replies

MPolaris Sep 19, 2025
Author

Clear the context memory after each run using llama_memory_clear().

Thank you for your reply, but I always run the entire program from scratch to ensure that the initial state is the same.

But, I just try add that code before llama_decode, and I got same result:

auto* kv = llama_get_memory(lctx);
    if (!kv) {
        printf("Unable to get kv\n");
        return;
    }
    llama_memory_clear(kv, true);

ggerganov Sep 19, 2025
Maintainer

I see. Not sure then - the results should be the same for each run.

You can try to modify the llama-embedding example for your use case, as I know it works correctly. Let us know if you trace down the issue in your code.

MPolaris Sep 19, 2025
Author

I see. Not sure then - the results should be the same for each run.

You can try to modify the llama-embedding example for your use case, as I know it works correctly. Let us know if you trace down the issue in your code.

Thank you again, but I have been tormented by this question for a long time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why are the results of each inference inconsistent？ #16099

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why are the results of each inference inconsistent？ #16099

Uh oh!

Uh oh!

MPolaris Sep 19, 2025

Replies: 2 comments · 3 replies

Uh oh!

MPolaris Sep 19, 2025 Author

Uh oh!

ggerganov Sep 19, 2025 Maintainer

Uh oh!

MPolaris Sep 19, 2025 Author

Uh oh!

ggerganov Sep 19, 2025 Maintainer

Uh oh!

MPolaris Sep 19, 2025 Author

MPolaris
Sep 19, 2025

Replies: 2 comments 3 replies

MPolaris
Sep 19, 2025
Author

ggerganov
Sep 19, 2025
Maintainer

MPolaris Sep 19, 2025
Author

ggerganov Sep 19, 2025
Maintainer

MPolaris Sep 19, 2025
Author