Skip to content

Conversation

@copybara-service
Copy link

@copybara-service copybara-service bot commented Jan 7, 2026

[XLA:GPU] Support Numpy-order argsort through CUB via key packing

FR: #35587

Adds support for CUB-accelerated argsort with Numpy order (NaNs last) for F16, BF16, and F32 keys with S16 or S32 indices.

This is implemented by packing the key (converted to an order-preserving unsigned integer) and the index into a single U32 or U64 payload. This allows us to use the standard fast CUB radix sort on the packed pairs.

Microbenchmark:

Device: NVIDIA_H100_80GB_HBM3
                                         Speedups           Clean           Dirty
name                                                                             
argsort_numpy_order_1024_f32                1.00x          9.8 us          9.8 us
argsort_numpy_order_1048576_f64             1.00x        564.7 us        565.1 us
argsort_numpy_order_25690112_f64            1.00x      22826.0 us      22835.3 us
argsort_numpy_order_1024_f64                1.00x         13.4 us         13.4 us
argsort_numpy_order_1024_bf16               1.29x          9.6 us          7.5 us
argsort_numpy_order_1024_f16                1.44x         11.1 us          7.7 us
argsort_numpy_order_1048576_f32             3.32x        388.9 us        117.2 us
argsort_numpy_order_1048576_bf16            5.24x        337.9 us         64.5 us
argsort_numpy_order_1048576_f16             5.79x        378.0 us         65.3 us
argsort_numpy_order_25690112_f32            8.63x      15299.5 us       1772.2 us
argsort_numpy_order_25690112_bf16          12.74x      12155.9 us        954.1 us
argsort_numpy_order_25690112_f16           13.67x      13248.3 us        969.0 us

@copybara-service copybara-service bot force-pushed the test_853108960 branch 2 times, most recently from c0b0dea to 9b937a6 Compare January 7, 2026 13:35
@copybara-service copybara-service bot changed the title [XLA:GPU] cub sort floating point argsort [XLA:GPU] Support Numpy-order argsort through CUB via key packing Jan 7, 2026
@copybara-service copybara-service bot force-pushed the test_853108960 branch 3 times, most recently from d560b4e to b2fd8d0 Compare January 12, 2026 07:11
FR: #35587

Adds support for CUB-accelerated argsort with Numpy order (NaNs last) for F16, BF16, and F32 keys with S16 or S32 indices.

This is implemented by packing the key (converted to an order-preserving unsigned integer) and the index into a single U32 or U64 payload. This allows us to use the standard fast CUB radix sort on the packed pairs.

Microbenchmark:
```
Device: NVIDIA_H100_80GB_HBM3
                                         Speedups           Clean           Dirty
name
argsort_numpy_order_1024_f32                1.00x          9.8 us          9.8 us
argsort_numpy_order_1048576_f64             1.00x        564.7 us        565.1 us
argsort_numpy_order_25690112_f64            1.00x      22826.0 us      22835.3 us
argsort_numpy_order_1024_f64                1.00x         13.4 us         13.4 us
argsort_numpy_order_1024_bf16               1.29x          9.6 us          7.5 us
argsort_numpy_order_1024_f16                1.44x         11.1 us          7.7 us
argsort_numpy_order_1048576_f32             3.32x        388.9 us        117.2 us
argsort_numpy_order_1048576_bf16            5.24x        337.9 us         64.5 us
argsort_numpy_order_1048576_f16             5.79x        378.0 us         65.3 us
argsort_numpy_order_25690112_f32            8.63x      15299.5 us       1772.2 us
argsort_numpy_order_25690112_bf16          12.74x      12155.9 us        954.1 us
argsort_numpy_order_25690112_f16           13.67x      13248.3 us        969.0 us
```
PiperOrigin-RevId: 855076701
@copybara-service copybara-service bot merged commit 5925e1e into main Jan 12, 2026
@copybara-service copybara-service bot deleted the test_853108960 branch January 12, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant