Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17479

One row per workgroup, similar to sum_rows. Depending on how large real-world use cases are it may be possible to make it faster.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

  • build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
  • build.bin.libggml-base.so: 71,255 nJ (unchanged)
  • build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

3 similar comments
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

  • build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
  • build.bin.libggml-base.so: 71,255 nJ (unchanged)
  • build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

  • build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
  • build.bin.libggml-base.so: 71,255 nJ (unchanged)
  • build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

  • build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
  • build.bin.libggml-base.so: 71,255 nJ (unchanged)
  • build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

@loci-dev loci-dev force-pushed the main branch 13 times, most recently from 92ef8cd to 7dd50b8 Compare November 26, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants