diff --git a/CHANGELOG.md b/CHANGELOG.md
index 9e0d9feac9..97c06affac 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 * Improved performance of copy-and-cast operations from `numpy.ndarray` to `tensor.usm_ndarray` for contiguous inputs [gh-1829](https://github.com/IntelPython/dpctl/pull/1829)
 * Improved performance of copying operation to C-/F-contig array, with optimization for batch of square matrices [gh-1850](https://github.com/IntelPython/dpctl/pull/1850)
 * Improved performance of `tensor.argsort` function for all types [gh-1859](https://github.com/IntelPython/dpctl/pull/1859)
+* Improved performance of `tensor.sort` and `tensor.argsort` for short arrays in the range [16, 64] elements [gh-1866](https://github.com/IntelPython/dpctl/pull/1866)
 
 ### Fixed
 
diff --git a/dpctl/tensor/libtensor/include/kernels/sorting/sort.hpp b/dpctl/tensor/libtensor/include/kernels/sorting/sort.hpp
index 28db00facd..1432139bf3 100644
--- a/dpctl/tensor/libtensor/include/kernels/sorting/sort.hpp
+++ b/dpctl/tensor/libtensor/include/kernels/sorting/sort.hpp
@@ -734,7 +734,9 @@ sycl::event stable_sort_axis1_contig_impl(
 
     auto comp = Comp{};
 
-    constexpr size_t sequential_sorting_threshold = 64;
+    // constant chosen experimentally to ensure monotonicity of
+    // sorting performance, as measured on GPU Max, and Iris Xe
+    constexpr size_t sequential_sorting_threshold = 16;
 
     if (sort_nelems < sequential_sorting_threshold) {
         // equal work-item sorts entire row