-
Notifications
You must be signed in to change notification settings - Fork 13.5k
DeepSeek V2/V3 MLA optimisations (please ignore) #12961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@fairydreaming If this doesn't fully resolve the performance regression then one more slightly dangerous / hacky change is to remove the as IIRC, it isn't actually needed for |
|
My original tests (sorry not very scientific lol):
Master with
This PR: This PR without So for me using CUDA, the performance drop isn't all that much (at least with this specific refactoring task using a draft model and a very high acceptance rate, 6-experts and |
|
In case it's helpful, here is my compile script that:
#!/bin/bash
function safe_sed() {
local file=$1
local pattern=$2
local replacement=$3
# Check if pattern exists
if ! sed -n "s/${pattern}/${replacement}/p" "$file" | grep -q .; then
echo "Error: Pattern not found in $file: $pattern"
return 1
fi
# Create backup
cp "$file" "$file.bak"
# Perform the replacement
sed -i "s/${pattern}/${replacement}/g" "$file"
# Show diff
echo "Changes in $file:"
diff "$file.bak" "$file"
# Clean up
rm "$file.bak"
echo "Successfully replaced in $file"
echo "-------------------"
}
function safe_sed_function() {
local file=$1
local function_signature=$2
local replacement=$3
# Create backup
cp "$file" "$file.bak"
# Perform the replacement using address range and c command
sed -i "${function_signature}/,/^}/c\\${replacement}" "$file"
# Clean up
rm "$file.bak"
echo "Successfully replaced function in $file"
echo "-------------------"
}
rm -rf ~/llama.cpp_MLA
mkdir ~/llama.cpp_MLA
cd ~/llama.cpp_MLA
git clone https://github.com/jukofyork/llama.cpp --branch master-mla-optimise-q
cd llama.cpp
# For attn_v_b to use fast mmv call.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "< MMV_MAX_ROWS" "<= MMV_MAX_ROWS"
# Don't offload these huge tensors to GPU as PCI-E transfer is slower than just just using CPU.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "const int min_batch_size = 32" "const int min_batch_size = 9999999"
# Hack llama_tensor_get_type() to use our custom quant.
safe_sed_function "src/llama-quant.cpp" \
"/^static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor \\* tensor, llama_ftype ftype) {" \
"static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {\n\
const std::string name = ggml_get_name(tensor);\n\
if (name.find(\"_exps\") != std::string::npos) {\n\
return GGML_TYPE_Q4_K;\n\
} else if (name.find(\"attn_k_b\") != std::string::npos || name.find(\"attn_v_b\") != std::string::npos) {\n\
return GGML_TYPE_BF16;\n\
}\n\
return GGML_TYPE_Q6_K;\n\
}"
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON
cmake --build build --config Release -- -j 44and here is the script for running with 6 experts, a draft model, etc: #!/bin/bash
host_address=192.168.1.2
port_number=8080
# Store the original directory
ORIGINAL_DIR=$(pwd)
# Change to the target directory
cd ~/llama.cpp_MLA/llama.cpp/build/bin
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
./llama-server \
--host "$host_address" \
--port "$port_number" \
--model ~/models/gguf/deepseek-v3-0324-mla-Q4_K_L+BF16.gguf \
--alias "deepseek-v3-0324" \
--chat-template deepseek3 \
--n-gpu-layers 99 \
--numa distribute \
--override-tensor exps=CPU \
--override-kv "deepseek2.expert_used_count=int:6" \
--override-kv "deepseek2.expert_weights_scale=float:2.3" \
--ctx_size 32768 \
--batch-size 1024 \
--ubatch-size 256 \
--model-draft ~/models/gguf/draft_models/DeepSeek-V3-0324-DRAFT-0.5B-Q4_0.gguf \
--top-k 1 \
--samplers "top_k" \
--gpu-layers-draft 99 \
--draft-min 3 \
--draft-max 32 \
--draft-p-min 0.667
# Return to the original directory
cd "$ORIGINAL_DIR" |
|
Closing as not solved the regression. Conversation continue here: #12801 (comment) |
Addresses #12801 (comment)
q_peandq_nopebefore theggml_concat(), then permutesQcur; which will then get permuted back inside ofbuild_attn()and thus should be contiguous by the time it gets tobuild_attn_mha().Vcur = ggml_cont(ctx0, Vcur)to ensure thatVcuris contiguous.build_attn_mha()(NOTE: requires the change in (1) to ensureqis contiguous!).Apologies if it doesn't work 100% as just did it using GitHub editor as away from home, but it seems to compile and run OK for me using CUDA (not tested the performance yet though).
Leaving as a draft as @ggerganov will likely be addressing some or all of these in his subsequent MLA-related PRs - this is really just to try to track down the performance regression @fairydreaming is finding...