UPSTREAM PR #17479: vulkan: Implement GGML_OP_CUMSUM #312

loci-dev · 2025-11-24T22:37:05Z

One row per workgroup, similar to sum_rows. Depending on how large real-world use cases are it may be possible to make it faster.

loci-agentic-ai · 2025-11-24T23:06:01Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
build.bin.libggml-base.so: 71,255 nJ (unchanged)
build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

loci-agentic-ai · 2025-11-24T23:06:01Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
build.bin.libggml-base.so: 71,255 nJ (unchanged)
build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

loci-agentic-ai · 2025-11-24T23:06:01Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
build.bin.libggml-base.so: 71,255 nJ (unchanged)
build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

loci-agentic-ai · 2025-11-24T23:06:32Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #312

Overview

PR #312 implements GGML_OP_CUMSUM operation for the Vulkan backend, adding GPU-accelerated cumulative sum functionality. The changes span 5 files with 125 additions and 24 deletions, primarily affecting Vulkan shader infrastructure and backend integration.

Performance Impact Assessment

No Impact on Inference Performance

The implementation adds a new operation (GGML_OP_CUMSUM) to the Vulkan backend without modifying existing inference paths. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unchanged, resulting in zero impact on tokens per second throughput.

Power Consumption Analysis

Binary-level analysis shows no changes to inference-related binaries:

build.bin.libggml-cpu.so: 128,302 nJ (unchanged)
build.bin.libggml-base.so: 71,255 nJ (unchanged)
build.bin.llama-bench: 49,381 nJ (unchanged)

The 56.96% overall power consumption reduction observed in the version comparison reflects architectural reorganization in other binaries (libllama.so, llama-run, llama-cvector-generator, llama-tts showing 100% reduction) unrelated to this PR's changes.

Key Findings

Functional Additions

The PR introduces cumulative sum computation using Vulkan subgroup arithmetic primitives. The implementation processes one row per workgroup with 128 threads, utilizing subgroupInclusiveAdd for parallel prefix sum within subgroups. Multi-subgroup coordination uses shared memory for partial sum aggregation across iterations.

Code Structure

Refactored sum_rows.comp by extracting common push constants and utility functions into sum_rows.glsl, reducing duplication. The new cumsum.comp shader (69 lines) implements the operation with 2 synchronization barriers per iteration for correctness.

Hardware Requirements

Operation requires VK_KHR_shader_subgroup_arithmetic extension support. Devices without subgroup arithmetic fall back to CPU execution. Only F32 tensors with contiguous rows are supported.

Integration Points

Changes integrate into existing Vulkan backend infrastructure: pipeline registration, operation dispatch, graph builder, and test harness. No modifications to public API or core GGML operations.

Performance Characteristics

The implementation exhibits linear scaling with row width (n_cols) due to iteration loops processing 128 elements per iteration. Inter-iteration dependency via last_sum creates sequential execution across iterations. Memory access patterns are coalesced for both reads and writes.

vulkan: Implement GGML_OP_CUMSUM

d2d7abc

loci-dev temporarily deployed to PROD__AL_DEMO November 24, 2025 22:37 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 13 times, most recently from 92ef8cd to 7dd50b8 Compare November 26, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17479: vulkan: Implement GGML_OP_CUMSUM #312

UPSTREAM PR #17479: vulkan: Implement GGML_OP_CUMSUM #312

loci-dev commented Nov 24, 2025

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17479: vulkan: Implement GGML_OP_CUMSUM #312

Are you sure you want to change the base?

UPSTREAM PR #17479: vulkan: Implement GGML_OP_CUMSUM #312

Conversation

loci-dev commented Nov 24, 2025

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Performance Analysis Summary - PR #312

Overview

Performance Impact Assessment

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Performance Analysis Summary - PR #312

Overview

Performance Impact Assessment

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Performance Analysis Summary - PR #312

Overview

Performance Impact Assessment

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 24, 2025

Performance Analysis Summary - PR #312

Overview

Performance Impact Assessment

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants