Skip to content

Conversation

@yeahdongcn
Copy link
Collaborator

This is a manual merge of ggml-org/llama.cpp#13647.

Testing Done

❯ ollama run gemma3n:e4b
total duration:       7.828684474s
load duration:        58.973429ms
prompt eval count:    11 token(s)
prompt eval duration: 2.363928229s
prompt eval rate:     4.65 tokens/s
eval count:           53 token(s)
eval duration:        5.405308512s
eval rate:            9.81 tokens/s

The previous eval rate was around 7 tokens/s — this update improves it by approximately 40%.

@yeahdongcn yeahdongcn requested review from Copilot and fishingfly July 4, 2025 06:44
@yeahdongcn yeahdongcn self-assigned this Jul 4, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates MUSA’s mudnn::Unary IDENTITY operation to accelerate device-to-device memory copies for FP16/FP32 tensors, offering ~40% performance improvements.

  • Adds mudnnMemcpyAsync API declaration and implementation using mudnn::Unary::IDENTITY
  • Updates cpy.cu to route contiguous F32/F16 copies through MUSA when enabled
  • Extends CMake configurations to include new sources and link against the mudnn library

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ml/backend/ggml/ggml/src/ggml-musa/mudnn.cuh Declares mudnnMemcpyAsync
ml/backend/ggml/ggml/src/ggml-musa/mudnn.cu Implements mudnnMemcpyAsync with MUSA DNN identity op
ml/backend/ggml/ggml/src/ggml-musa/CMakeLists.txt Adds mudnn headers/sources and links mudnn library
ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu Routes contiguous FP16/FP32 copies through mudnnMemcpyAsync
CMakeLists.txt Includes mudnn in runtime dependency regex and link steps
Comments suppressed due to low confidence (3)

ml/backend/ggml/ggml/src/ggml-musa/mudnn.cu:88

  • There are no unit tests covering mudnnMemcpyAsync; adding tests for both FLOAT and HALF copy paths would help ensure correctness and prevent regressions.
musaError_t mudnnMemcpyAsync(ggml_backend_cuda_context& ctx, const ggml_tensor* dst, const ggml_tensor* src) {

ml/backend/ggml/ggml/src/ggml-musa/mudnn.cu:1

  • The file uses std::vector and std::unordered_map but does not include or <unordered_map>, leading to compilation failures.
#include <mutex>

ml/backend/ggml/ggml/src/ggml-musa/CMakeLists.txt:101

  • Static builds are not linking against the mudnn library, which will cause unresolved symbol errors when mudnn code is compiled; consider linking mudnn for static configuration or disabling mudnn support in static mode.
        target_link_libraries(ggml-musa PRIVATE MUSA::musart_static MUSA::mublas_static)

@yeahdongcn yeahdongcn merged commit 5b24e02 into main Jul 7, 2025
@yeahdongcn yeahdongcn deleted the xd/mudnn branch July 7, 2025 07:52
yeahdongcn added a commit that referenced this pull request Jul 10, 2025
yeahdongcn added a commit that referenced this pull request Jul 31, 2025
yeahdongcn added a commit that referenced this pull request Aug 6, 2025
yeahdongcn added a commit that referenced this pull request Aug 7, 2025
yeahdongcn added a commit that referenced this pull request Aug 11, 2025
yeahdongcn added a commit that referenced this pull request Aug 18, 2025
yeahdongcn added a commit that referenced this pull request Aug 19, 2025
yeahdongcn added a commit that referenced this pull request Aug 20, 2025
yeahdongcn added a commit that referenced this pull request Aug 21, 2025
yeahdongcn added a commit that referenced this pull request Aug 26, 2025
yeahdongcn added a commit that referenced this pull request Sep 1, 2025
yeahdongcn added a commit that referenced this pull request Oct 2, 2025
yeahdongcn added a commit that referenced this pull request Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants