Skip to content

Conversation

xuanson2025
Copy link

@xuanson2025 xuanson2025 commented Sep 14, 2025

Make sure to read the contributing guidelines before submitting a PR

Summary by CodeRabbit

  • New Features

    • Added int32 support for tensor duplication with automatic conversions between FP16, BF16, FP32, and int32.
    • Added bidirectional paths between quantized and float types, enabling dequantization to float and conversion into quantized formats.
  • Refactor

    • Unified duplication into a generic, type-safe system for consistent cross-type behavior and easier extensibility.
    • Streamlined dispatch to reduce per-type special cases while keeping optimized same-type copies.

Copy link

coderabbitai bot commented Sep 14, 2025

Walkthrough

Adds int32_t↔float conversion helpers and a specialization for type_conversion_table<int32_t> in common.h. Refactors tensor duplication in ops.cpp to templated, type-generic paths for non-quantized copies, float↔quantized conversions, and quantized→float dequantization; updates dispatch and leaves some combos unimplemented.

Changes

Cohort / File(s) Summary
Type conversion integration
ggml/src/ggml-cpu/common.h
Adds inline helpers i32_to_f32(int32_t) -> float and f32_to_i32(float) -> int32_t. Adds template <> struct type_conversion_table<int32_t> with to_f32 and from_f32 function pointers.
Duplication op refactor
ggml/src/ggml-cpu/ops.cpp
Replaces per-type dup routines with templates: ggml_compute_forward_dup_flt<src_t,dst_t>, ggml_compute_forward_dup_to_q<src_t>, and ggml_compute_forward_dup_from_q. Updates ggml_compute_forward_dup(...) dispatch to call the new generic paths, includes quantized↔float handling and some not implemented branches.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Dispatcher as ggml_compute_forward_dup
  participant DupFlt as dup_flt<src,dst>
  participant DupToQ as dup_to_q<src>
  participant DupFromQ as dup_from_q
  participant TCT as type_conversion_table<>

  Caller->>Dispatcher: request dup(src_tensor, dst_tensor)
  alt src is non-quantized and dst is non-quantized
    Dispatcher->>DupFlt: call dup_flt<src,dst>
    DupFlt->>TCT: use to_f32(src) / from_f32(dst) per element
    TCT-->>DupFlt: converted values
    DupFlt-->>Dispatcher: done
  else src is non-quantized and dst is quantized
    Dispatcher->>DupToQ: call dup_to_q<src>
    DupToQ->>TCT: to_f32(src) then quantize
    DupToQ-->>Dispatcher: done
  else src is quantized and dst is float
    Dispatcher->>DupFromQ: call dup_from_q
    DupFromQ-->>Dispatcher: dequantized values
  else
    Dispatcher-->>Caller: abort / not implemented
  end
  Dispatcher-->>Caller: return result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

Hop-hop, I tweak bits in the burrow tonight,
ints dance to floats in the soft LED light.
Templates I weave with a twitch of my nose,
copying, quantizing—wherever it goes.
A carrot for types, a hop and a cheer! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description contains only the contributing-guidelines reminder and does not document the actual changes, motivation, testing, or impact; the provided raw_summary and pr_objectives show a non-trivial CPU refactor and new type-conversion helpers, but none of that is described in the PR body. Please expand the PR description to include a short summary of what was changed and why, the key files and APIs affected, testing performed and how reviewers can reproduce results, and any compatibility notes or known incomplete paths so reviewers can effectively evaluate the refactor.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "ggml : refactor forward_dup for cpu backend" succinctly and accurately describes the primary change in the changeset — a refactor of the forward_dup implementation for the CPU backend as seen in ggml/src/ggml-cpu/ops.cpp and related headers — and is concise and clear for repository history scanning.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the ggml label Sep 14, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
ggml/src/ggml-cpu/ops.cpp (3)

10-12: Missing header for std::is_same_v

This TU uses std::is_same_v but doesn’t include <type_traits>. Add it to avoid build breaks on some compilers.

Apply this diff:

 #include <float.h>
 #include <algorithm>
+#include <type_traits>
 

157-209: Bug: wrong wrap condition uses ne00 instead of ne0

In the same‑type, non‑contiguous branch you advance the dst logical X counter with ne00 (src row length) instead of ne0 (dst). This will mis-wrap when src/dst shapes differ but element‑wise copy is intended.

Apply this diff:

-                        if (++i10 == ne00) {
+                        if (++i10 == ne0) {

267-324: to‑Q path: only works for contiguous dst and contiguous dim0 src; add guards/fallback

  • Only the nb00 == sizeof(src_t) + ggml_is_contiguous(dst) path is implemented; other cases abort. At minimum, assert block alignment and document the limitation.
  • Consider adding a gather path similar to dup_flt’s non‑contig case to avoid the hard abort for common views.

Apply these minimal safety checks:

             if (ggml_get_type_traits_cpu(dst->type)->from_float) {
                 // casting non-quantized types --> intermediate f32 --> quantized
                 ggml_from_float_t const quantize_row_q = ggml_get_type_traits_cpu(dst->type)->from_float;
                 float * src0_f32 = (float *) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) * ith;
+                GGML_ASSERT(ne00 % ggml_blck_size(dst->type) == 0);

If you want, I can draft a small gather loop for the nb00 != sizeof(src_t) case mirroring Lines 129–147 in dup_flt.

🧹 Nitpick comments (5)
ggml/src/ggml-cpu/ops.cpp (5)

86-149: Contiguous cast path looks good; minor guard missing

When quantizing row-wise later you assert block-alignment; add the same explicit guard here for clarity when you rely on rs = ne00*nb00 semantics (non‑Q only).

Please confirm upstream callers never pass ne00 = 0 (empty row). If that’s possible, early‑return would avoid touching id math.


322-324: Hard abort on non‑contiguous/unsupported to‑Q

Abort is fine for now, but please convert to GGML_ASSERT with an explanatory message or return GGML_ABORT("dup_to_q: only contiguous dst and dim0‑contig src supported") to aid users.


476-525: from‑Q path: assumptions are fine; add explicit dtype assert

This function writes float rows. Dispatch guarantees dst F32, but adding an assert improves local safety.

Apply this diff:

     const ggml_type type = src0->type;
     ggml_to_float_t const dequantize_row_q = ggml_get_type_traits(type)->to_float;
+    GGML_ASSERT(dst->type == GGML_TYPE_F32);

86-106: Potential micro‑perf: hoist row-stride math and remove repeated id adjustments

The id += rsir0 / id += rs(ne01-ir1) pattern is correct but can be simplified by computing a base pointer per (i03,i02) and iterating rows linearly. This will slightly reduce ALU pressure in hot copies.

Also applies to: 107-126, 129-147


49-53: Tighten non‑Q precondition

You assert both src and dst are non‑quantized. Consider guarding dst element sizes too (nb0 == sizeof(dst_t), nb00 == sizeof(src_t)) and fallback early to bytes path when equal types/strides to avoid duplicate logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa0c461 and 06dda63.

📒 Files selected for processing (2)
  • ggml/src/ggml-cpu/common.h (2 hunks)
  • ggml/src/ggml-cpu/ops.cpp (14 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
ggml/src/ggml-cpu/ops.cpp (2)
ggml/src/ggml.c (3)
  • ggml_nelements (1187-1191)
  • ggml_is_quantized (1249-1251)
  • ggml_is_contiguous (1382-1384)
ggml/src/ggml-cpu/ggml-cpu.c (1)
  • ggml_get_type_traits_cpu (381-383)
🔇 Additional comments (5)
ggml/src/ggml-cpu/common.h (1)

65-70: Confirm: dup call sites use the conversion helpers — no further change required

  • common.h already specializes type_conversion_table<int32_t> to use i32_to_f32 / f32_to_i32.
  • ops.cpp dup paths instantiate ggml_compute_forward_dup_flt with float↔int32 (lines shown by your script), so conversions go through the table and will use the f32→i32 semantics (truncation + saturation) you mentioned.
  • No other uses of type_conversion_table<int32_t> were discovered by the search you ran; the dup paths are the relevant callers that would be sensitive to rounding vs truncation.

Tag:

ggml/src/ggml-cpu/ops.cpp (4)

44-85: Generic dup: solid structure; asserts are right-sized

Nice consolidation of non‑quantized paths into a single templated kernel with fast memcpy when shapes/strides align. The quantized guard at Line 52 is correct.


210-263: Non‑contig cast branch is consistent

Element-wise cast through type_conversion_table is correct and matches stride semantics. Nice.


290-317: Work buffer sizing for src0_f32

This uses per‑thread scratch (ne00 + CACHE_LINE_SIZE_F32) floats. Ensure ggml_compute_forward’s planner accounts for this in params->wsize for Dupy→Q; otherwise this will stomp memory under load.

I can write a quick repo scan to confirm the worksize for OP_DUP is updated in the allocator.


541-565: Dispatch matrix: good coverage; I32 conversions partial

  • F32→I32 and I32→F32 covered; other I32 casts intentionally unimplemented (abort). OK if documented; otherwise gate via feature flag.
  • Keep the early “same‑type” fast path (Lines 533–536) — good.

Please confirm any previous BF16↔Q direct dup paths weren’t used by frontends; they now route via to_q/from_q.

Comment on lines +31 to +38
static inline float i32_to_f32(int32_t x) {
return x;
}

static inline int32_t f32_to_i32(float x) {
return x;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Define f32↔i32 semantics; add saturation and explicit casts to prevent UB and warnings

  • Implicit int→float and float→int conversions can trigger -Wconversion and yield UB for NaN/out‑of‑range floats. Make casts explicit and clamp float→int32_t.

Apply this diff:

 static inline float i32_to_f32(int32_t x) {
-    return x;
+    return static_cast<float>(x);
 }
 
 static inline int32_t f32_to_i32(float x) {
-    return x;
+    // Saturate and truncate toward zero to match C/C++ cast semantics
+    if (std::isnan(x)) return 0;
+    constexpr float kMin = static_cast<float>(std::numeric_limits<int32_t>::min());
+    constexpr float kMax = static_cast<float>(std::numeric_limits<int32_t>::max());
+    if (x <= kMin) return std::numeric_limits<int32_t>::min();
+    if (x >= kMax) return std::numeric_limits<int32_t>::max();
+    return static_cast<int32_t>(x);
 }

Add the following includes near the existing C++ includes (outside this hunk):

#include <limits>
#include <cmath>
#include <cstdint> // if not already provided transitively
🤖 Prompt for AI Agents
In ggml/src/ggml-cpu/common.h around lines 31 to 38, the conversion helpers use
implicit conversions that cause -Wconversion warnings and undefined behavior for
NaN/out-of-range floats; update as follows: add includes <limits>, <cmath>, and
<cstdint> near the other C++ includes, change i32_to_f32 to use an explicit
static_cast<float>(x), and change f32_to_i32 to explicitly handle NaN and
saturate to int32 bounds using std::numeric_limits<int32_t>::min()/max() (e.g.,
if std::isnan(x) return 0; if x <= min return min; if x >= max return max;
otherwise return static_cast<int32_t>(x) or use a defined rounding strategy),
ensuring all casts are explicit to avoid UB and warnings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ggml/src/ggml-cpu/ops.cpp (1)

157-209: Bug: wrong dimension counters in non‑contiguous same‑type branch

The carry logic uses src dims (ne00..ne03) to advance dst indices (i10..i13). This misindexes when src/dst shapes differ (but have equal nelements), causing incorrect writes.

Fix by advancing with dst dims:

-                i10 += ne00 * ir0;
+                i10 += ne0 * ir0;
...
-                        memcpy(dst_ptr, src0_ptr, sizeof(dst_t));
+                        memcpy(dst_ptr, src0_ptr, sizeof(dst_t));
-                        if (++i10 == ne00) {
+                        if (++i10 == ne0) {
                             i10 = 0;
-                            if (++i11 == ne01) {
+                            if (++i11 == ne1) {
                                 i11 = 0;
-                                if (++i12 == ne02) {
+                                if (++i12 == ne2) {
                                     i12 = 0;
-                                    if (++i13 == ne03) {
+                                    if (++i13 == ne3) {
                                         i13 = 0;
                                     }
                                 }
                             }
                         }
...
-                i10 += ne00 * (ne01 - ir1);
+                i10 += ne0 * (ne1 - ir1);
🧹 Nitpick comments (4)
ggml/src/ggml-cpu/ops.cpp (4)

86-149: Minor: prefer ggml_row_size for row byte count and drop leftover debug

  • Use ggml_row_size(dst->type, ne00) instead of nb0 * (ne00 / blck_size) patterns elsewhere for consistency and future-proofing.
  • Remove the commented printf.

Apply:

-            //printf("%s: this is not optimal - fix me\n", __func__);
+            // keep hot path clean; rely on logging infra if needed

267-321: Safer row sizing and preconditions for float→quantized path

  • Compute row size with ggml_row_size; assert block alignment up front.
  • Consider asserting params->wsize capacity for per‑thread buffers.
-        size_t rs = nb0 * (ne00 / ggml_blck_size(dst->type));
+        GGML_ASSERT(ne00 % ggml_blck_size(dst->type) == 0);
+        const size_t rs = ggml_row_size(dst->type, ne00);

Optional (guard scratch):

+        GGML_ASSERT(params->wsize >= (size_t)((ne00 + CACHE_LINE_SIZE_F32) * params->nth * sizeof(float)));

473-522: Add defensive check for dequantize func and clarify assumptions

If a non-quantized type ever reaches here by mistake, dequantize_row_q could be null; add an assert.

-    ggml_to_float_t const dequantize_row_q = ggml_get_type_traits(type)->to_float;
+    ggml_to_float_t const dequantize_row_q = ggml_get_type_traits(type)->to_float;
+    GGML_ASSERT(dequantize_row_q != NULL);

Also, consider documenting that src1 provides logical indices shaping nr = nelements(src1)/qk.


560-562: I32 duplication limited to I32→F32 only

This is fine if intentional; add a comment/TODO for other directions to prevent future confusion.

-                else GGML_ABORT("not implemented");
+                else GGML_ABORT("I32 dup: only to F32 is supported currently");
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06dda63 and a31ba15.

📒 Files selected for processing (1)
  • ggml/src/ggml-cpu/ops.cpp (12 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
ggml/src/ggml-cpu/ops.cpp (2)
ggml/src/ggml.c (4)
  • ggml_nelements (1187-1191)
  • ggml_is_quantized (1249-1251)
  • ggml_is_contiguous (1382-1384)
  • ggml_blck_size (1228-1230)
ggml/src/ggml-cpu/ggml-cpu.c (1)
  • ggml_get_type_traits_cpu (381-383)
🔇 Additional comments (4)
ggml/src/ggml-cpu/ops.cpp (4)

67-85: Row-copy fast path looks good

Same-type, row-sized memcpy path is straightforward and correct.


210-264: Casting branch indexing is correct

This branch advances with dst dims (ne0..ne3); no changes needed.


538-557: Dispatch coverage: confirm intended unsupported casts and add TODOs

  • F16/BF16/I32 sources only support the listed targets; others abort. If previous code supported more combinations (e.g., q→F16/BF16 or F16→I32), call sites must avoid them or we should add fallbacks.

Please verify downstream graphs never require:

  • quantized → F16/BF16
  • F16/BF16 → I32
  • I32 → non‑F32

If needed, I can sketch minimal bridges via F32.


566-567: Good: quantized→F32 routes through a single dequant path

Centralizing into ggml_compute_forward_dup_from_q reduces duplication.

@ngxson ngxson changed the base branch from master to xsn/refactor_cpu_dup_op September 18, 2025 01:45
@ngxson ngxson merged commit 59ca217 into ngxson:xsn/refactor_cpu_dup_op Sep 18, 2025
46 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants