-
Notifications
You must be signed in to change notification settings - Fork 5
ggml : refactor forward_dup for cpu backend #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : refactor forward_dup for cpu backend #30
Conversation
WalkthroughAdds int32_t↔float conversion helpers and a specialization for Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Dispatcher as ggml_compute_forward_dup
participant DupFlt as dup_flt<src,dst>
participant DupToQ as dup_to_q<src>
participant DupFromQ as dup_from_q
participant TCT as type_conversion_table<>
Caller->>Dispatcher: request dup(src_tensor, dst_tensor)
alt src is non-quantized and dst is non-quantized
Dispatcher->>DupFlt: call dup_flt<src,dst>
DupFlt->>TCT: use to_f32(src) / from_f32(dst) per element
TCT-->>DupFlt: converted values
DupFlt-->>Dispatcher: done
else src is non-quantized and dst is quantized
Dispatcher->>DupToQ: call dup_to_q<src>
DupToQ->>TCT: to_f32(src) then quantize
DupToQ-->>Dispatcher: done
else src is quantized and dst is float
Dispatcher->>DupFromQ: call dup_from_q
DupFromQ-->>Dispatcher: dequantized values
else
Dispatcher-->>Caller: abort / not implemented
end
Dispatcher-->>Caller: return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
ggml/src/ggml-cpu/ops.cpp (3)
10-12
: Missing header for std::is_same_vThis TU uses std::is_same_v but doesn’t include <type_traits>. Add it to avoid build breaks on some compilers.
Apply this diff:
#include <float.h> #include <algorithm> +#include <type_traits>
157-209
: Bug: wrong wrap condition uses ne00 instead of ne0In the same‑type, non‑contiguous branch you advance the dst logical X counter with ne00 (src row length) instead of ne0 (dst). This will mis-wrap when src/dst shapes differ but element‑wise copy is intended.
Apply this diff:
- if (++i10 == ne00) { + if (++i10 == ne0) {
267-324
: to‑Q path: only works for contiguous dst and contiguous dim0 src; add guards/fallback
- Only the nb00 == sizeof(src_t) + ggml_is_contiguous(dst) path is implemented; other cases abort. At minimum, assert block alignment and document the limitation.
- Consider adding a gather path similar to dup_flt’s non‑contig case to avoid the hard abort for common views.
Apply these minimal safety checks:
if (ggml_get_type_traits_cpu(dst->type)->from_float) { // casting non-quantized types --> intermediate f32 --> quantized ggml_from_float_t const quantize_row_q = ggml_get_type_traits_cpu(dst->type)->from_float; float * src0_f32 = (float *) params->wdata + (ne00 + CACHE_LINE_SIZE_F32) * ith; + GGML_ASSERT(ne00 % ggml_blck_size(dst->type) == 0);
If you want, I can draft a small gather loop for the nb00 != sizeof(src_t) case mirroring Lines 129–147 in dup_flt.
🧹 Nitpick comments (5)
ggml/src/ggml-cpu/ops.cpp (5)
86-149
: Contiguous cast path looks good; minor guard missingWhen quantizing row-wise later you assert block-alignment; add the same explicit guard here for clarity when you rely on rs = ne00*nb00 semantics (non‑Q only).
Please confirm upstream callers never pass ne00 = 0 (empty row). If that’s possible, early‑return would avoid touching id math.
322-324
: Hard abort on non‑contiguous/unsupported to‑QAbort is fine for now, but please convert to GGML_ASSERT with an explanatory message or return GGML_ABORT("dup_to_q: only contiguous dst and dim0‑contig src supported") to aid users.
476-525
: from‑Q path: assumptions are fine; add explicit dtype assertThis function writes float rows. Dispatch guarantees dst F32, but adding an assert improves local safety.
Apply this diff:
const ggml_type type = src0->type; ggml_to_float_t const dequantize_row_q = ggml_get_type_traits(type)->to_float; + GGML_ASSERT(dst->type == GGML_TYPE_F32);
86-106
: Potential micro‑perf: hoist row-stride math and remove repeated id adjustmentsThe id += rsir0 / id += rs(ne01-ir1) pattern is correct but can be simplified by computing a base pointer per (i03,i02) and iterating rows linearly. This will slightly reduce ALU pressure in hot copies.
Also applies to: 107-126, 129-147
49-53
: Tighten non‑Q preconditionYou assert both src and dst are non‑quantized. Consider guarding dst element sizes too (nb0 == sizeof(dst_t), nb00 == sizeof(src_t)) and fallback early to bytes path when equal types/strides to avoid duplicate logic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ggml/src/ggml-cpu/common.h
(2 hunks)ggml/src/ggml-cpu/ops.cpp
(14 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
ggml/src/ggml-cpu/ops.cpp (2)
ggml/src/ggml.c (3)
ggml_nelements
(1187-1191)ggml_is_quantized
(1249-1251)ggml_is_contiguous
(1382-1384)ggml/src/ggml-cpu/ggml-cpu.c (1)
ggml_get_type_traits_cpu
(381-383)
🔇 Additional comments (5)
ggml/src/ggml-cpu/common.h (1)
65-70
: Confirm: dup call sites use the conversion helpers — no further change required
- common.h already specializes type_conversion_table<int32_t> to use i32_to_f32 / f32_to_i32.
- ops.cpp dup paths instantiate ggml_compute_forward_dup_flt with float↔int32 (lines shown by your script), so conversions go through the table and will use the f32→i32 semantics (truncation + saturation) you mentioned.
- No other uses of type_conversion_table<int32_t> were discovered by the search you ran; the dup paths are the relevant callers that would be sensitive to rounding vs truncation.
Tag:
ggml/src/ggml-cpu/ops.cpp (4)
44-85
: Generic dup: solid structure; asserts are right-sizedNice consolidation of non‑quantized paths into a single templated kernel with fast memcpy when shapes/strides align. The quantized guard at Line 52 is correct.
210-263
: Non‑contig cast branch is consistentElement-wise cast through type_conversion_table is correct and matches stride semantics. Nice.
290-317
: Work buffer sizing for src0_f32This uses per‑thread scratch (ne00 + CACHE_LINE_SIZE_F32) floats. Ensure ggml_compute_forward’s planner accounts for this in params->wsize for Dupy→Q; otherwise this will stomp memory under load.
I can write a quick repo scan to confirm the worksize for OP_DUP is updated in the allocator.
541-565
: Dispatch matrix: good coverage; I32 conversions partial
- F32→I32 and I32→F32 covered; other I32 casts intentionally unimplemented (abort). OK if documented; otherwise gate via feature flag.
- Keep the early “same‑type” fast path (Lines 533–536) — good.
Please confirm any previous BF16↔Q direct dup paths weren’t used by frontends; they now route via to_q/from_q.
static inline float i32_to_f32(int32_t x) { | ||
return x; | ||
} | ||
|
||
static inline int32_t f32_to_i32(float x) { | ||
return x; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define f32↔i32 semantics; add saturation and explicit casts to prevent UB and warnings
- Implicit int→float and float→int conversions can trigger -Wconversion and yield UB for NaN/out‑of‑range floats. Make casts explicit and clamp float→int32_t.
Apply this diff:
static inline float i32_to_f32(int32_t x) {
- return x;
+ return static_cast<float>(x);
}
static inline int32_t f32_to_i32(float x) {
- return x;
+ // Saturate and truncate toward zero to match C/C++ cast semantics
+ if (std::isnan(x)) return 0;
+ constexpr float kMin = static_cast<float>(std::numeric_limits<int32_t>::min());
+ constexpr float kMax = static_cast<float>(std::numeric_limits<int32_t>::max());
+ if (x <= kMin) return std::numeric_limits<int32_t>::min();
+ if (x >= kMax) return std::numeric_limits<int32_t>::max();
+ return static_cast<int32_t>(x);
}
Add the following includes near the existing C++ includes (outside this hunk):
#include <limits>
#include <cmath>
#include <cstdint> // if not already provided transitively
🤖 Prompt for AI Agents
In ggml/src/ggml-cpu/common.h around lines 31 to 38, the conversion helpers use
implicit conversions that cause -Wconversion warnings and undefined behavior for
NaN/out-of-range floats; update as follows: add includes <limits>, <cmath>, and
<cstdint> near the other C++ includes, change i32_to_f32 to use an explicit
static_cast<float>(x), and change f32_to_i32 to explicitly handle NaN and
saturate to int32 bounds using std::numeric_limits<int32_t>::min()/max() (e.g.,
if std::isnan(x) return 0; if x <= min return min; if x >= max return max;
otherwise return static_cast<int32_t>(x) or use a defined rounding strategy),
ensuring all casts are explicit to avoid UB and warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ggml/src/ggml-cpu/ops.cpp (1)
157-209
: Bug: wrong dimension counters in non‑contiguous same‑type branchThe carry logic uses src dims (ne00..ne03) to advance dst indices (i10..i13). This misindexes when src/dst shapes differ (but have equal nelements), causing incorrect writes.
Fix by advancing with dst dims:
- i10 += ne00 * ir0; + i10 += ne0 * ir0; ... - memcpy(dst_ptr, src0_ptr, sizeof(dst_t)); + memcpy(dst_ptr, src0_ptr, sizeof(dst_t)); - if (++i10 == ne00) { + if (++i10 == ne0) { i10 = 0; - if (++i11 == ne01) { + if (++i11 == ne1) { i11 = 0; - if (++i12 == ne02) { + if (++i12 == ne2) { i12 = 0; - if (++i13 == ne03) { + if (++i13 == ne3) { i13 = 0; } } } } ... - i10 += ne00 * (ne01 - ir1); + i10 += ne0 * (ne1 - ir1);
🧹 Nitpick comments (4)
ggml/src/ggml-cpu/ops.cpp (4)
86-149
: Minor: prefer ggml_row_size for row byte count and drop leftover debug
- Use ggml_row_size(dst->type, ne00) instead of nb0 * (ne00 / blck_size) patterns elsewhere for consistency and future-proofing.
- Remove the commented printf.
Apply:
- //printf("%s: this is not optimal - fix me\n", __func__); + // keep hot path clean; rely on logging infra if needed
267-321
: Safer row sizing and preconditions for float→quantized path
- Compute row size with ggml_row_size; assert block alignment up front.
- Consider asserting params->wsize capacity for per‑thread buffers.
- size_t rs = nb0 * (ne00 / ggml_blck_size(dst->type)); + GGML_ASSERT(ne00 % ggml_blck_size(dst->type) == 0); + const size_t rs = ggml_row_size(dst->type, ne00);Optional (guard scratch):
+ GGML_ASSERT(params->wsize >= (size_t)((ne00 + CACHE_LINE_SIZE_F32) * params->nth * sizeof(float)));
473-522
: Add defensive check for dequantize func and clarify assumptionsIf a non-quantized type ever reaches here by mistake, dequantize_row_q could be null; add an assert.
- ggml_to_float_t const dequantize_row_q = ggml_get_type_traits(type)->to_float; + ggml_to_float_t const dequantize_row_q = ggml_get_type_traits(type)->to_float; + GGML_ASSERT(dequantize_row_q != NULL);Also, consider documenting that src1 provides logical indices shaping nr = nelements(src1)/qk.
560-562
: I32 duplication limited to I32→F32 onlyThis is fine if intentional; add a comment/TODO for other directions to prevent future confusion.
- else GGML_ABORT("not implemented"); + else GGML_ABORT("I32 dup: only to F32 is supported currently");
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ggml/src/ggml-cpu/ops.cpp
(12 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
ggml/src/ggml-cpu/ops.cpp (2)
ggml/src/ggml.c (4)
ggml_nelements
(1187-1191)ggml_is_quantized
(1249-1251)ggml_is_contiguous
(1382-1384)ggml_blck_size
(1228-1230)ggml/src/ggml-cpu/ggml-cpu.c (1)
ggml_get_type_traits_cpu
(381-383)
🔇 Additional comments (4)
ggml/src/ggml-cpu/ops.cpp (4)
67-85
: Row-copy fast path looks goodSame-type, row-sized memcpy path is straightforward and correct.
210-264
: Casting branch indexing is correctThis branch advances with dst dims (ne0..ne3); no changes needed.
538-557
: Dispatch coverage: confirm intended unsupported casts and add TODOs
- F16/BF16/I32 sources only support the listed targets; others abort. If previous code supported more combinations (e.g., q→F16/BF16 or F16→I32), call sites must avoid them or we should add fallbacks.
Please verify downstream graphs never require:
- quantized → F16/BF16
- F16/BF16 → I32
- I32 → non‑F32
If needed, I can sketch minimal bridges via F32.
566-567
: Good: quantized→F32 routes through a single dequant pathCentralizing into ggml_compute_forward_dup_from_q reduces duplication.
59ca217
into
ngxson:xsn/refactor_cpu_dup_op
Make sure to read the contributing guidelines before submitting a PR
Summary by CodeRabbit
New Features
Refactor