Skip to content

Conversation

@f14XuanLv
Copy link

@f14XuanLv f14XuanLv commented Dec 3, 2025

Checklist

  • The title and commit message(s) are descriptive.
  • Small commits made to fix your PR have been squashed to avoid history pollution.
  • Tests have been added for new features or bug fixes.
  • API of new functions and classes are documented.

Description

This PR extends the xt::roll function to support multiple axes simultaneously, completing the NumPy-compatible API as originally requested.

Related: #1766, #1823

The original implementation in #1823 only supported single-axis roll operations. According to NumPy's roll signature, both shift and axis parameters can be tuples of ints:

numpy.roll(a, shift, axis=None)
# shift : int or tuple of ints
# axis : int or tuple of ints, optional

This PR adds the missing multi-axis support:

xt::xarray<int> a = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}};
auto result = xt::roll(a, {1, 2}, {0, 1});  // roll 1 on axis 0, roll 2 on axis 1

Commits

1. feat: add multi-axis support for xt::roll with optimized pointer arithmetic

  • Implemented detail::roll_multi() using recursive pointer arithmetic algorithm extended from the existing single-axis implementation
  • Added two new xt::roll() overloads: one for containers (std::vector, std::array) and one for C-style arrays
  • Supports accumulated shifts when the same axis appears multiple times (NumPy-compatible behavior)
  • Supports negative axis indices
  • Added comprehensive unit tests covering various scenarios (2D/3D, negative shifts, different container types, column-major layout, etc.)

2. docs: add multi-axis roll examples to documentation

  • Updated docs/source/numpy.rst with NumPy-to-xtensor comparison for multi-axis roll
  • Updated docs/source/quickref/manipulation.rst with usage examples

3. bench: add benchmark for multi-axis roll

  • Added benchmark/benchmark_roll.cpp with comprehensive benchmarks comparing sequential single-axis rolls vs. the new multi-axis implementation
  • Benchmarks cover 2D to 5D tensors, various sizes, and realistic image processing scenarios (RGB 1080p to 8K)

Build & Test

# Run unit tests
mkdir build && cd build
cmake .. -DBUILD_TESTS=ON -DDOWNLOAD_GTEST=ON
make -j8 test_xmanipulation
./test/test_xmanipulation --test-case="*roll*" -s
# Run benchmarks
cd benchmark
mkdir build && cd build
cmake .. -DDOWNLOAD_GBENCHMARK=ON -Dxtensor_DIR=$(realpath ../../build)
make -j8 benchmark_xtensor
./benchmark_xtensor --benchmark_filter=roll
📊 Benchmark Results (click to expand)

Test Environment: Intel Core i7-12700H, 16GB RAM, Ubuntu 24.04.3 LTS

Run on (20 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)

---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
detail::roll_3d_3axes_sequential/128/r0.01       11735715 ns     11731715 ns           51 bytes_per_second=1.33Gi/s
detail::roll_3d_3axes_multi/128/r0.01             1184996 ns      1184922 ns          574 bytes_per_second=13.19Gi/s
detail::roll_3d_3axes_sequential/128/r0.3        12288778 ns     12288030 ns           56 bytes_per_second=1.27Gi/s
detail::roll_3d_3axes_multi/128/r0.3              1267342 ns      1267286 ns          548 bytes_per_second=12.33Gi/s
detail::roll_2d_sequential/64x64                     1841 ns         1841 ns       382473 bytes_per_second=16.57Gi/s
detail::roll_2d_multi/64x64                           854 ns          854 ns       821232 bytes_per_second=35.74Gi/s
detail::roll_2d_sequential/256x256                  30787 ns        30784 ns        22481 bytes_per_second=15.86Gi/s
detail::roll_2d_multi/256x256                       12725 ns        12724 ns        53445 bytes_per_second=38.37Gi/s
detail::roll_2d_sequential/1024x1024               963154 ns       963069 ns          736 bytes_per_second=8.11Gi/s
detail::roll_2d_multi/1024x1024                    419272 ns       419185 ns         1857 bytes_per_second=18.64Gi/s
detail::roll_3d_2axes_sequential/128x128x128     10321186 ns     10321178 ns           66 bytes_per_second=1.51Gi/s
detail::roll_3d_2axes_multi/128x128x128           1200420 ns      1200325 ns          561 bytes_per_second=13.02Gi/s
detail::roll_4d_4axes_sequential/16                 95978 ns        95957 ns         7380 bytes_per_second=5.09Gi/s
detail::roll_4d_4axes_multi/16                      26161 ns        26162 ns        26964 bytes_per_second=18.66Gi/s
detail::roll_4d_4axes_sequential/32              10419965 ns     10419994 ns           67 bytes_per_second=767.76Mi/s
detail::roll_4d_4axes_multi/32                     422750 ns       422731 ns         1665 bytes_per_second=18.48Gi/s
detail::roll_5d_5axes_sequential/16              11791085 ns     11790205 ns           61 bytes_per_second=678.53Mi/s
detail::roll_5d_5axes_multi/16                     508597 ns       508588 ns         1388 bytes_per_second=15.36Gi/s
detail::roll_5d_5axes_sequential/32             482957914 ns    482920008 ns            2 bytes_per_second=530.11Mi/s
detail::roll_5d_5axes_multi/32                  100002192 ns     99992129 ns            7 bytes_per_second=2.50Gi/s
detail::roll_3d_2axes_sequential/rgb_1080p       36946640 ns     36943762 ns           19 bytes_per_second=1.25Gi/s
detail::roll_3d_2axes_multi/rgb_1080p            17183395 ns     17182811 ns           39 bytes_per_second=2.70Gi/s
detail::roll_3d_2axes_sequential/rgb_4K         158555547 ns    158541395 ns            4 bytes_per_second=1.17Gi/s
detail::roll_3d_2axes_multi/rgb_4K               75991080 ns     75982624 ns            9 bytes_per_second=2.44Gi/s
detail::roll_3d_2axes_sequential/rgb_8K         606234407 ns    606221933 ns            1 bytes_per_second=1.22Gi/s
detail::roll_3d_2axes_multi/rgb_8K              294193380 ns    294178950 ns            2 bytes_per_second=2.52Gi/s

Performance Summary

The multi-axis implementation achieves significant speedups over sequential single-axis roll calls:

Scenario Speedup
2D tensors ~2-3x
3D tensors (2-3 axes) ~2-10x
4D tensors (4 axes) ~4-25x
5D tensors (5 axes) ~5-20x
RGB images (1080p-8K) ~2x

The performance gain increases with the number of axes being rolled, as the multi-axis version avoids creating intermediate temporary arrays and performs the roll operation in a single pass through the data.

@f14XuanLv f14XuanLv changed the title Feature/roll multi axis feat: add multi-axis support for xt::roll Dec 3, 2025
@f14XuanLv f14XuanLv changed the title feat: add multi-axis support for xt::roll feat: add multi-axis support for xt::roll (numpy.roll parity) Dec 3, 2025
@f14XuanLv f14XuanLv force-pushed the feature/roll-multi-axis branch from 43edec3 to 286aae0 Compare December 3, 2025 13:38
@f14XuanLv f14XuanLv marked this pull request as draft December 5, 2025 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant