DeepSeek V2/V3 MLA optimisations (please ignore) #12961

jukofyork · 2025-04-15T12:23:06Z

Permutes q_pe and q_nope before the ggml_concat(), then permutes Qcur; which will then get permuted back inside of build_attn() and thus should be contiguous by the time it gets to build_attn_mha().
Performs Vcur = ggml_cont(ctx0, Vcur) to ensure that Vcur is contiguous.
Added back the MQA optimisation in build_attn_mha() (NOTE: requires the change in (1) to ensure q is contiguous!).

Apologies if it doesn't work 100% as just did it using GitHub editor as away from home, but it seems to compile and run OK for me using CUDA (not tested the performance yet though).

Leaving as a draft as @ggerganov will likely be addressing some or all of these in his subsequent MLA-related PRs - this is really just to try to track down the performance regression @fairydreaming is finding...

jukofyork · 2025-04-15T12:33:36Z

@fairydreaming If this doesn't fully resolve the performance regression then one more slightly dangerous / hacky change is to remove the ggml_mul_mat_set_prec(kq, GGML_PREC_F32) from here:

        ggml_tensor * kq = nullptr;
        if (ggml_is_contiguous(k) && ggml_is_contiguous(q) && n_head_kv == 1) {
            k = ggml_reshape_2d(ctx0, k, n_embd, n_kv);
            q = ggml_reshape_2d(ctx0, q, n_embd, n_tokens*n_head);
            kq = ggml_mul_mat(ctx0, k, q);
            kq = ggml_reshape_3d(ctx0, kq, n_kv, n_tokens, n_head);
        }

as IIRC, it isn't actually needed for deepseek-r1 and your tests may have been run without this before.

jukofyork · 2025-04-15T13:15:50Z

My original tests (sorry not very scientific lol):

I can generate over 11 tokens per second for refactoring tasks now on a machine with:
and around 35-40 tokens per second prompt processing.

Master with MLA PR merged/ (see here):

prompt eval time =  285665.33 ms /  9848 tokens (   29.01 ms per token,    34.47 tokens per second)
       eval time = 1101768.72 ms / 11721 tokens (   94.00 ms per token,    10.64 tokens per second)
      total time = 1387434.04 ms / 21569 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 0.84840 ( 9805 accepted / 11557 generated)

So minimal loss of performance.

This PR:

prompt eval time =  284530.95 ms /  9846 tokens (   28.90 ms per token,    34.60 tokens per second)
       eval time = 1020527.03 ms / 11227 tokens (   90.90 ms per token,    11.00 tokens per second)
      total time = 1305057.98 ms / 21073 tokens
slot print_timing: id  0 | task 19 | 
draft acceptance rate = 0.83936 ( 9280 accepted / 11056 generated)

This PR without ggml_mul_mat_set_prec(kq, GGML_PREC_F32) (see this branch):

prompt eval time =  290181.36 ms /  9848 tokens (   29.47 ms per token,    33.94 tokens per second)
       eval time = 1076024.56 ms / 11784 tokens (   91.31 ms per token,    10.95 tokens per second)
      total time = 1366205.92 ms / 21632 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 0.84582 ( 9853 accepted / 11649 generated)

So for me using CUDA, the performance drop isn't all that much (at least with this specific refactoring task using a draft model and a very high acceptance rate, 6-experts and Q4_K expert tensors, etc).

jukofyork · 2025-04-15T13:25:17Z

In case it's helpful, here is my compile script that:

Works around the current CUDA problems with attn_k_b and attn_v_b by making them BF16.
Adds the <= MMV_MAX_ROWS optimisation that works with (1) to gain quite a bit for token generation.
Increases the min_batch_size as I found that using PCI-e 3.0 16x was slower than just running on CPU.

#!/bin/bash

function safe_sed() {
    local file=$1
    local pattern=$2
    local replacement=$3

    # Check if pattern exists
    if ! sed -n "s/${pattern}/${replacement}/p" "$file" | grep -q .; then
        echo "Error: Pattern not found in $file: $pattern"
        return 1
    fi

    # Create backup
    cp "$file" "$file.bak"

    # Perform the replacement
    sed -i "s/${pattern}/${replacement}/g" "$file"

    # Show diff
    echo "Changes in $file:"
    diff "$file.bak" "$file"

    # Clean up
    rm "$file.bak"

    echo "Successfully replaced in $file"
    echo "-------------------"
}

function safe_sed_function() {
    local file=$1
    local function_signature=$2
    local replacement=$3

    # Create backup
    cp "$file" "$file.bak"

    # Perform the replacement using address range and c command
    sed -i "${function_signature}/,/^}/c\\${replacement}" "$file"

    # Clean up
    rm "$file.bak"

    echo "Successfully replaced function in $file"
    echo "-------------------"
}

rm -rf ~/llama.cpp_MLA
mkdir ~/llama.cpp_MLA
cd ~/llama.cpp_MLA

git clone https://github.com/jukofyork/llama.cpp --branch master-mla-optimise-q
cd llama.cpp

# For attn_v_b to use fast mmv call.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "< MMV_MAX_ROWS" "<= MMV_MAX_ROWS"

# Don't offload these huge tensors to GPU as PCI-E transfer is slower than just just using CPU.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "const int min_batch_size = 32" "const int min_batch_size = 9999999"

# Hack llama_tensor_get_type() to use our custom quant.
safe_sed_function "src/llama-quant.cpp" \
  "/^static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor \\* tensor, llama_ftype ftype) {" \
  "static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {\n\
    const std::string name = ggml_get_name(tensor);\n\
    if (name.find(\"_exps\") != std::string::npos) {\n\
        return GGML_TYPE_Q4_K;\n\
    } else if (name.find(\"attn_k_b\") != std::string::npos || name.find(\"attn_v_b\") != std::string::npos) {\n\
        return GGML_TYPE_BF16;\n\
    }\n\
    return GGML_TYPE_Q6_K;\n\
}"

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON
cmake --build build --config Release -- -j 44

and here is the script for running with 6 experts, a draft model, etc:

#!/bin/bash

host_address=192.168.1.2
port_number=8080

# Store the original directory
ORIGINAL_DIR=$(pwd)

# Change to the target directory
cd ~/llama.cpp_MLA/llama.cpp/build/bin

# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null

# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo    # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
    echo "Dropping caches..."
    echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi

# Run the main command
./llama-server \
        --host "$host_address" \
        --port "$port_number" \
        --model ~/models/gguf/deepseek-v3-0324-mla-Q4_K_L+BF16.gguf \
        --alias "deepseek-v3-0324" \
        --chat-template deepseek3 \
        --n-gpu-layers 99 \
        --numa distribute \
        --override-tensor exps=CPU \
        --override-kv "deepseek2.expert_used_count=int:6" \
        --override-kv "deepseek2.expert_weights_scale=float:2.3" \
        --ctx_size 32768 \
        --batch-size 1024 \
        --ubatch-size 256 \
        --model-draft ~/models/gguf/draft_models/DeepSeek-V3-0324-DRAFT-0.5B-Q4_0.gguf \
        --top-k 1 \
        --samplers "top_k" \
        --gpu-layers-draft 99 \
        --draft-min 3 \
        --draft-max 32 \
        --draft-p-min 0.667

# Return to the original directory
cd "$ORIGINAL_DIR"

jukofyork · 2025-04-15T14:23:07Z

Closing as not solved the regression.

Conversation continue here: #12801 (comment)

jukofyork added 8 commits April 15, 2025 12:29

permute Qcur instead of q_nope_absorbed

66c374c

Also need to permute q_pe

0525166

removed random "dee" characters added

3c4423c

cont vCur

0518461

Add back MQA 2D x 2D optimisation

f2fcd2c

Check k is contiguous and reshape it

1d63edf

Fix kq

7b66649

Added missing n_embd and fixed n_kv bug

959a793

jukofyork mentioned this pull request Apr 15, 2025

DeepSeek V2/V3 MLA implementation #12801

Merged

jukofyork changed the title ~~DeepSeek V2/V3 MLA optimisations~~ DeepSeek V2/V3 MLA optimisations (please ignore) Apr 15, 2025

jukofyork closed this Apr 15, 2025

jukofyork deleted the master-mla-optimise-q branch July 10, 2025 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek V2/V3 MLA optimisations (please ignore) #12961

DeepSeek V2/V3 MLA optimisations (please ignore) #12961

Uh oh!

jukofyork commented Apr 15, 2025 •

edited

Loading

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DeepSeek V2/V3 MLA optimisations (please ignore) #12961

DeepSeek V2/V3 MLA optimisations (please ignore) #12961

Uh oh!

Conversation

jukofyork commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

jukofyork commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jukofyork commented Apr 15, 2025 •

edited

Loading