[XLA:GPU] Support Numpy-order argsort through CUB via key packing #36043

copybara-service · 2026-01-07T11:01:59Z

[XLA:GPU] Support Numpy-order argsort through CUB via key packing

Adds support for CUB-accelerated argsort with Numpy order (NaNs last) for F16, BF16, and F32 keys with S16 or S32 indices.

This is implemented by packing the key (converted to an order-preserving unsigned integer) and the index into a single U32 or U64 payload. This allows us to use the standard fast CUB radix sort on the packed pairs.

Microbenchmark:

Device: NVIDIA_H100_80GB_HBM3
                                         Speedups           Clean           Dirty
name                                                                             
argsort_numpy_order_1024_f32                1.00x          9.8 us          9.8 us
argsort_numpy_order_1048576_f64             1.00x        564.7 us        565.1 us
argsort_numpy_order_25690112_f64            1.00x      22826.0 us      22835.3 us
argsort_numpy_order_1024_f64                1.00x         13.4 us         13.4 us
argsort_numpy_order_1024_bf16               1.29x          9.6 us          7.5 us
argsort_numpy_order_1024_f16                1.44x         11.1 us          7.7 us
argsort_numpy_order_1048576_f32             3.32x        388.9 us        117.2 us
argsort_numpy_order_1048576_bf16            5.24x        337.9 us         64.5 us
argsort_numpy_order_1048576_f16             5.79x        378.0 us         65.3 us
argsort_numpy_order_25690112_f32            8.63x      15299.5 us       1772.2 us
argsort_numpy_order_25690112_bf16          12.74x      12155.9 us        954.1 us
argsort_numpy_order_25690112_f16           13.67x      13248.3 us        969.0 us

FR: #35587 Adds support for CUB-accelerated argsort with Numpy order (NaNs last) for F16, BF16, and F32 keys with S16 or S32 indices. This is implemented by packing the key (converted to an order-preserving unsigned integer) and the index into a single U32 or U64 payload. This allows us to use the standard fast CUB radix sort on the packed pairs. Microbenchmark: ``` Device: NVIDIA_H100_80GB_HBM3 Speedups Clean Dirty name argsort_numpy_order_1024_f32 1.00x 9.8 us 9.8 us argsort_numpy_order_1048576_f64 1.00x 564.7 us 565.1 us argsort_numpy_order_25690112_f64 1.00x 22826.0 us 22835.3 us argsort_numpy_order_1024_f64 1.00x 13.4 us 13.4 us argsort_numpy_order_1024_bf16 1.29x 9.6 us 7.5 us argsort_numpy_order_1024_f16 1.44x 11.1 us 7.7 us argsort_numpy_order_1048576_f32 3.32x 388.9 us 117.2 us argsort_numpy_order_1048576_bf16 5.24x 337.9 us 64.5 us argsort_numpy_order_1048576_f16 5.79x 378.0 us 65.3 us argsort_numpy_order_25690112_f32 8.63x 15299.5 us 1772.2 us argsort_numpy_order_25690112_bf16 12.74x 12155.9 us 954.1 us argsort_numpy_order_25690112_f16 13.67x 13248.3 us 969.0 us ``` PiperOrigin-RevId: 855076701

copybara-service bot assigned thcmbs Jan 7, 2026

copybara-service bot force-pushed the test_853108960 branch 2 times, most recently from c0b0dea to 9b937a6 Compare January 7, 2026 13:35

copybara-service bot changed the title ~~[XLA:GPU] cub sort floating point argsort~~ [XLA:GPU] Support Numpy-order argsort through CUB via key packing Jan 7, 2026

copybara-service bot force-pushed the test_853108960 branch 3 times, most recently from d560b4e to b2fd8d0 Compare January 12, 2026 07:11

copybara-service bot force-pushed the test_853108960 branch from b2fd8d0 to 5925e1e Compare January 12, 2026 07:52

copybara-service bot merged commit 5925e1e into main Jan 12, 2026

copybara-service bot deleted the test_853108960 branch January 12, 2026 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA:GPU] Support Numpy-order argsort through CUB via key packing #36043

[XLA:GPU] Support Numpy-order argsort through CUB via key packing #36043

Uh oh!

copybara-service bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[XLA:GPU] Support Numpy-order argsort through CUB via key packing #36043

[XLA:GPU] Support Numpy-order argsort through CUB via key packing #36043

Uh oh!

Conversation

copybara-service bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

copybara-service bot commented Jan 7, 2026 •

edited

Loading