Skip to content

Conversation

@Ka-zam
Copy link
Contributor

@Ka-zam Ka-zam commented Dec 26, 2025

New saturated sum kernels

Copy link
Member

@marcusmueller marcusmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be named saturated_add (or add_saturated) to keep in style with the existing add-named kernels (we use sum only for "complicated" polynomial sums, not element-wise stuff)

@Ka-zam Ka-zam force-pushed the saturated_sum_kernels branch 3 times, most recently from 73647f0 to 0d71ec1 Compare December 26, 2025 20:05
@Ka-zam
Copy link
Contributor Author

Ka-zam commented Dec 26, 2025

#807 handles this log2 failure

Copy link
Contributor

@jdemel jdemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'm impressed by how many kernels you implement. Thanks a lot.

@jdemel
Copy link
Contributor

jdemel commented Jan 15, 2026

Can you rebase this PR on latest main? The error should be fixed but I'd like to make sure before I merge anything. Thanks!

Signed-off-by: Magnus Lundmark <[email protected]>
@Ka-zam Ka-zam force-pushed the saturated_sum_kernels branch from 0d71ec1 to 8487ce2 Compare January 16, 2026 08:37
@Ka-zam
Copy link
Contributor Author

Ka-zam commented Jan 16, 2026

LGTM

I'm impressed by how many kernels you implement. Thanks a lot.

We should be in a better place now! 😄

@jdemel
Copy link
Contributor

jdemel commented Jan 16, 2026

There's still @marcusmueller comment on naming.
To stick with our naming convention and thus go the way of least surprise, the kernels should be named add_saturated. We use the term sum like e.g. NumPy does when all elements in a vector get summed up.

Signed-off-by: Magnus Lundmark <[email protected]>
@Ka-zam
Copy link
Contributor Author

Ka-zam commented Jan 17, 2026

There's still @marcusmueller comment on naming. To stick with our naming convention and thus go the way of least surprise, the kernels should be named add_saturated. We use the term sum like e.g. NumPy does when all elements in a vector get summed up.

Oops, forgot to remove the unused kernels. Should be good now.

$ volk_profile -R satu

RUN_VOLK_TESTS: volk_8i_x2_add_saturated_8i(vlen=131071, iter=1987, tol=1e-06)
arch                       |         time |     throughput |    max_abs |
---------------------------+--------------+----------------+------------+
generic                    |    7.6207 ms |  102529.3 MB/s |          - |
u_sse2                     |    8.3234 ms |   93873.1 MB/s |    0.0e+00 |
u_avx2                     |    6.2392 ms |  125231.5 MB/s |    0.0e+00 | *
a_sse2                     |    7.7166 ms |  101255.1 MB/s |    0.0e+00 |
a_avx2                     |    6.2236 ms |  125545.4 MB/s |    0.0e+00 | *
Best aligned arch          | a_avx2 (1.22x)
Best unaligned arch        | u_avx2 (1.22x)
--------------------------------------------------------------------------------

RUN_VOLK_TESTS: volk_8u_x2_add_saturated_8u(vlen=131071, iter=1987, tol=1e-06)
arch                       |         time |     throughput |    max_abs |
---------------------------+--------------+----------------+------------+
generic                    |    6.2801 ms |  124415.0 MB/s |          - |
u_sse2                     |    6.0746 ms |  128624.6 MB/s |    0.0e+00 | *
u_avx2                     |    6.1945 ms |  126134.9 MB/s |    0.0e+00 |
a_sse2                     |    5.9526 ms |  131260.8 MB/s |    0.0e+00 |
a_avx2                     |    5.9446 ms |  131437.5 MB/s |    0.0e+00 | *
Best aligned arch          | a_avx2 (1.06x)
Best unaligned arch        | u_sse2 (1.03x)
--------------------------------------------------------------------------------

RUN_VOLK_TESTS: volk_16i_x2_add_saturated_16i(vlen=131071, iter=1987, tol=1e-06)
arch                       |         time |     throughput |    max_abs |
---------------------------+--------------+----------------+------------+
generic                    |   13.6151 ms |  114775.7 MB/s |          - | *
u_sse2                     |   16.0701 ms |   97241.9 MB/s |    0.0e+00 |
u_avx2                     |   13.9325 ms |  112161.2 MB/s |    0.0e+00 |
a_sse2                     |   16.0001 ms |   97667.6 MB/s |    0.0e+00 |
a_avx2                     |   14.4234 ms |  108344.3 MB/s |    0.0e+00 |
Best aligned arch          | generic
Best unaligned arch        | generic
--------------------------------------------------------------------------------

RUN_VOLK_TESTS: volk_16u_x2_add_saturated_16u(vlen=131071, iter=1987, tol=1e-06)
arch                       |         time |     throughput |    max_abs |
---------------------------+--------------+----------------+------------+
generic                    |   14.0463 ms |  111253.0 MB/s |          - |
u_sse2                     |   16.9538 ms |   92173.2 MB/s |    0.0e+00 |
u_avx2                     |   13.7991 ms |  113246.0 MB/s |    0.0e+00 | *
a_sse2                     |   13.6123 ms |  114800.1 MB/s |    0.0e+00 | *
a_avx2                     |   13.9583 ms |  111954.2 MB/s |    0.0e+00 |
Best aligned arch          | a_sse2 (1.03x)
Best unaligned arch        | u_avx2 (1.02x)
--------------------------------------------------------------------------------

Session summary (4 kernels):
  Average speedup vs generic: 1.08x
  Max speedup vs generic: 1.22x (volk_8i_x2_add_saturated_8i)

@marcusmueller marcusmueller self-requested a review January 17, 2026 12:59
@jdemel jdemel merged commit f921334 into gnuradio:main Jan 18, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants