Skip to content

Conversation

@jukofyork
Copy link
Collaborator

@jukofyork jukofyork commented Apr 15, 2025

Addresses #12801 (comment)

  1. Permutes q_pe and q_nope before the ggml_concat(), then permutes Qcur; which will then get permuted back inside of build_attn() and thus should be contiguous by the time it gets to build_attn_mha().
  2. Performs Vcur = ggml_cont(ctx0, Vcur) to ensure that Vcur is contiguous.
  3. Added back the MQA optimisation in build_attn_mha() (NOTE: requires the change in (1) to ensure q is contiguous!).

Apologies if it doesn't work 100% as just did it using GitHub editor as away from home, but it seems to compile and run OK for me using CUDA (not tested the performance yet though).

Leaving as a draft as @ggerganov will likely be addressing some or all of these in his subsequent MLA-related PRs - this is really just to try to track down the performance regression @fairydreaming is finding...

@jukofyork
Copy link
Collaborator Author

@fairydreaming If this doesn't fully resolve the performance regression then one more slightly dangerous / hacky change is to remove the ggml_mul_mat_set_prec(kq, GGML_PREC_F32) from here:

        ggml_tensor * kq = nullptr;
        if (ggml_is_contiguous(k) && ggml_is_contiguous(q) && n_head_kv == 1) {
            k = ggml_reshape_2d(ctx0, k, n_embd, n_kv);
            q = ggml_reshape_2d(ctx0, q, n_embd, n_tokens*n_head);
            kq = ggml_mul_mat(ctx0, k, q);
            kq = ggml_reshape_3d(ctx0, kq, n_kv, n_tokens, n_head);
        }

as IIRC, it isn't actually needed for deepseek-r1 and your tests may have been run without this before.

@jukofyork
Copy link
Collaborator Author

My original tests (sorry not very scientific lol):

I can generate over 11 tokens per second for refactoring tasks now on a machine with:
and around 35-40 tokens per second prompt processing.


Master with MLA PR merged/ (see here):

prompt eval time =  285665.33 ms /  9848 tokens (   29.01 ms per token,    34.47 tokens per second)
       eval time = 1101768.72 ms / 11721 tokens (   94.00 ms per token,    10.64 tokens per second)
      total time = 1387434.04 ms / 21569 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 0.84840 ( 9805 accepted / 11557 generated)

So minimal loss of performance.


This PR:

prompt eval time =  284530.95 ms /  9846 tokens (   28.90 ms per token,    34.60 tokens per second)
       eval time = 1020527.03 ms / 11227 tokens (   90.90 ms per token,    11.00 tokens per second)
      total time = 1305057.98 ms / 21073 tokens
slot print_timing: id  0 | task 19 | 
draft acceptance rate = 0.83936 ( 9280 accepted / 11056 generated)

This PR without ggml_mul_mat_set_prec(kq, GGML_PREC_F32) (see this branch):

prompt eval time =  290181.36 ms /  9848 tokens (   29.47 ms per token,    33.94 tokens per second)
       eval time = 1076024.56 ms / 11784 tokens (   91.31 ms per token,    10.95 tokens per second)
      total time = 1366205.92 ms / 21632 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 0.84582 ( 9853 accepted / 11649 generated)

So for me using CUDA, the performance drop isn't all that much (at least with this specific refactoring task using a draft model and a very high acceptance rate, 6-experts and Q4_K expert tensors, etc).

@jukofyork
Copy link
Collaborator Author

In case it's helpful, here is my compile script that:

  1. Works around the current CUDA problems with attn_k_b and attn_v_b by making them BF16.
  2. Adds the <= MMV_MAX_ROWS optimisation that works with (1) to gain quite a bit for token generation.
  3. Increases the min_batch_size as I found that using PCI-e 3.0 16x was slower than just running on CPU.
#!/bin/bash

function safe_sed() {
    local file=$1
    local pattern=$2
    local replacement=$3

    # Check if pattern exists
    if ! sed -n "s/${pattern}/${replacement}/p" "$file" | grep -q .; then
        echo "Error: Pattern not found in $file: $pattern"
        return 1
    fi

    # Create backup
    cp "$file" "$file.bak"

    # Perform the replacement
    sed -i "s/${pattern}/${replacement}/g" "$file"

    # Show diff
    echo "Changes in $file:"
    diff "$file.bak" "$file"

    # Clean up
    rm "$file.bak"

    echo "Successfully replaced in $file"
    echo "-------------------"
}

function safe_sed_function() {
    local file=$1
    local function_signature=$2
    local replacement=$3

    # Create backup
    cp "$file" "$file.bak"

    # Perform the replacement using address range and c command
    sed -i "${function_signature}/,/^}/c\\${replacement}" "$file"

    # Clean up
    rm "$file.bak"

    echo "Successfully replaced function in $file"
    echo "-------------------"
}

rm -rf ~/llama.cpp_MLA
mkdir ~/llama.cpp_MLA
cd ~/llama.cpp_MLA

git clone https://github.com/jukofyork/llama.cpp --branch master-mla-optimise-q
cd llama.cpp

# For attn_v_b to use fast mmv call.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "< MMV_MAX_ROWS" "<= MMV_MAX_ROWS"

# Don't offload these huge tensors to GPU as PCI-E transfer is slower than just just using CPU.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "const int min_batch_size = 32" "const int min_batch_size = 9999999"

# Hack llama_tensor_get_type() to use our custom quant.
safe_sed_function "src/llama-quant.cpp" \
  "/^static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor \\* tensor, llama_ftype ftype) {" \
  "static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {\n\
    const std::string name = ggml_get_name(tensor);\n\
    if (name.find(\"_exps\") != std::string::npos) {\n\
        return GGML_TYPE_Q4_K;\n\
    } else if (name.find(\"attn_k_b\") != std::string::npos || name.find(\"attn_v_b\") != std::string::npos) {\n\
        return GGML_TYPE_BF16;\n\
    }\n\
    return GGML_TYPE_Q6_K;\n\
}"

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON
cmake --build build --config Release -- -j 44

and here is the script for running with 6 experts, a draft model, etc:

#!/bin/bash

host_address=192.168.1.2
port_number=8080

# Store the original directory
ORIGINAL_DIR=$(pwd)

# Change to the target directory
cd ~/llama.cpp_MLA/llama.cpp/build/bin

# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null

# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo    # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
    echo "Dropping caches..."
    echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi

# Run the main command
./llama-server \
        --host "$host_address" \
        --port "$port_number" \
        --model ~/models/gguf/deepseek-v3-0324-mla-Q4_K_L+BF16.gguf \
        --alias "deepseek-v3-0324" \
        --chat-template deepseek3 \
        --n-gpu-layers 99 \
        --numa distribute \
        --override-tensor exps=CPU \
        --override-kv "deepseek2.expert_used_count=int:6" \
        --override-kv "deepseek2.expert_weights_scale=float:2.3" \
        --ctx_size 32768 \
        --batch-size 1024 \
        --ubatch-size 256 \
        --model-draft ~/models/gguf/draft_models/DeepSeek-V3-0324-DRAFT-0.5B-Q4_0.gguf \
        --top-k 1 \
        --samplers "top_k" \
        --gpu-layers-draft 99 \
        --draft-min 3 \
        --draft-max 32 \
        --draft-p-min 0.667

# Return to the original directory
cd "$ORIGINAL_DIR"

@jukofyork jukofyork changed the title DeepSeek V2/V3 MLA optimisations DeepSeek V2/V3 MLA optimisations (please ignore) Apr 15, 2025
@jukofyork
Copy link
Collaborator Author

Closing as not solved the regression.

Conversation continue here: #12801 (comment)

@jukofyork jukofyork closed this Apr 15, 2025
@jukofyork jukofyork deleted the master-mla-optimise-q branch July 10, 2025 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant