NVIDIA
diff --git a/‎README.md‎
Lines changed: 8 additions & 8 deletions b/‎README.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎nvpl_blas/f77/cdotc.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/cdotc.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_blas/f77/cdotu.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/cdotu.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_blas/f77/ctpsv.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/ctpsv.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_blas/f77/dtpsv.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/dtpsv.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_blas/f77/stpsv.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/stpsv.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_blas/f77/zdotc.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/zdotc.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_blas/f77/zdotu.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/zdotu.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_blas/f77/ztpsv.f‎
Lines changed: 1 addition & 1 deletion b/‎nvpl_blas/f77/ztpsv.f‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎nvpl_fft/README.md‎
Lines changed: 22 additions & 15 deletions b/‎nvpl_fft/README.md‎
Lines changed: 22 additions & 15 deletions
@@ -4,14 +4,14 @@ The NVIDIA Performance Libraries (NVPL) are a collection of high performance mat
 
 These CPU-only libraries have no dependencies on CUDA or CTK, and are drop in replacements for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform.
 
-The provided sample codes show how to call and link to NVPL Libraries in Fortran, C, and C++ applications and libraries.  Most examples use [CMake](#cmake-usage), but are easily modified for use in custom build environments.
+The provided sample codes show how to call and link to NVPL Libraries in Fortran, C, and C++ applications and libraries.  Examples use [CMake](#cmake-usage), but are easily modified for use in custom build environments.
 
 * [NVPL Documentation](https://docs.nvidia.com/nvpl/)
 
 ## Installation
 
 * [NVPL Downloads](https://developer.nvidia.com/nvpl-downloads/)
-* Latest release: **NVPL-25.1**
+* Latest release: **NVPL-25.5**
 
 ## Library Samples
 
@@ -33,27 +33,27 @@ Samples are compatible with the latest nvpl release.  Compatibility with older r
 * Platform: Arm SBSA
 * CPUs Supported
    * [NVIDIA Grace](https://www.nvidia.com/en-us/data-center/grace-cpu/) (Armv9.0-A Neoverse-V2)
-   * AWS Graviton 4 (Armv9.00-A Neoverse-V2)
+   * AWS Graviton 4 (Armv9.0-A Neoverse-V2)
    * AWS Graviton 3/3e (Armv8.4-A Neoverse-V1)
    * AWS Graviton 2 (Arm-8.2-A Neoverse-N1)
    * Ampere Altra (Armv8.2-A Neoverse-N1)
    * Any CPU with Armv8.1-A or later micro Architecture
 * OS (Linux)
-   * Ubuntu: 20.04, 22.04, 24.04, 24.10
+   * Ubuntu: 20.04, 22.04, 24.04, 25.04
    * Debian: 12
    * RHEL: RHEL8, RHEL9
-   * Fedora: 39, 40, 41
+   * Fedora: 40, 41, 42
    * SLES: SLES15 (15.6)
    * OpenSUSE/leap: 15.6
    * AmazonLinux: 2, 2023
    * Generally any Linux OS with support for aarch64
 
 ### Compilers
 
-* GCC-8 - GCC-14+
-* Clang-14 - Clang-19+
+* GCC-8 - GCC-15+
+* Clang-14 - Clang-21+
 * [Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang/downloads): 16.x, 17.x, 18.x, 19.x
-* [NVIDA HPC Compilers](https://developer.nvidia.com/hpc-compilers): 23.9 - 24.11
+* [NVIDA HPC Compilers](https://developer.nvidia.com/hpc-compilers): 23.9 - 25.3+
 
 ### Languages
 
 
@@ -46,5 +46,5 @@ program   CDOTC_MAIN
  99   format('Example: CDOTC for computing the dot product '
      &       'of vectors X and Y')
  100  format('#### args: n=',i1,', incx=',i1,', incy=',i1)
- 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
+ 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
       end
@@ -46,5 +46,5 @@ program   CDOTU_MAIN
  99   format('Example: CDOTU for computing the dot product '
      &       'of vectors X and Y')
  100  format('#### args: n=',i1,', incx=',i1,', incy=',i1)
- 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
+ 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
       end
@@ -45,7 +45,7 @@ program   CTPSV_MAIN
       stop
  99   format('Example: CTPSV for solving a system of linear equations'
      &       ' whose coefficients are in a triangular packed matrix.')
- 100  format('#### args: n=',i1,', incx=',i1
+ 100  format('#### args: n=',i1,', incx=',i1,
      &       ', uplo=',a1,', transa=',a1,', diag=',a1)
  102  format(A)
       end
@@ -45,7 +45,7 @@ program   DTPSV_MAIN
       stop
  99   format('Example: DTPSV for solving a system of linear equations'
      &       ' whose coefficients are in a triangular packed matrix.')
- 100  format('#### args: n=',i1,', incx=',i1
+ 100  format('#### args: n=',i1,', incx=',i1,
      &       ', uplo=',a1,', transa=',a1,', diag=',a1)
  102  format(A)
       end
@@ -45,7 +45,7 @@ program   STPSV_MAIN
       stop
  99   format('Example: STPSV for solving a system of linear equations'
      &       ' whose coefficients are in a triangular packed matrix.')
- 100  format('#### args: n=',i1,', incx=',i1
+ 100  format('#### args: n=',i1,', incx=',i1,
      &       ', uplo=',a1,', transa=',a1,', diag=',a1)
  102  format(A)
       end
@@ -46,5 +46,5 @@ program   ZDOTC_MAIN
  99   format('Example: ZDOTC for computing the dot product '
      &       'of vectors X and Y')
  100  format('#### args: n=',i1,', incx=',i1,', incy=',i1)
- 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
+ 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
       end
@@ -46,5 +46,5 @@ program   ZDOTU_MAIN
  99   format('Example: ZDOTU for computing the dot product '
      &       'of vectors X and Y')
  100  format('#### args: n=',i1,', incx=',i1,', incy=',i1)
- 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
+ 101  format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
       end
@@ -45,7 +45,7 @@ program   ZTPSV_MAIN
       stop
  99   format('Example: ZTPSV for solving a system of linear equations'
      &       ' whose coefficients are in a triangular packed matrix.')
- 100  format('#### args: n=',i1,', incx=',i1
+ 100  format('#### args: n=',i1,', incx=',i1,
      &       ', uplo=',a1,', transa=',a1,', diag=',a1)
  102  format(A)
       end
@@ -6,7 +6,7 @@
 * `cxx/c2c_<single/double>_many_example` computes batched forward and backward 1D/2D/3D C2C FFTs in single/double precision with strided data.
 * `cxx/r2c_c2r_<single/double>_many_example` computes batched forward R2C and backward C2R 1D/2D/3D FFTs in single/double precision with strided data.
 * `cxx/<r2c_c2r/c2c>_single_withomp_example` computes batched forward and backward 1D FFTs in single precision with contiguous data. Plan creation and execution are called inside an omp parallel region.
-* `cxx/c2c_c2r_r2c_many_bench_example` measures performance of  1D/2D/3D C2C/C2R/R2C FFTs with single and double precision.
+* `cxx/plan_many_dft_benchmark_example` measures performance of 1D/2D/3D C2C/C2R/R2C FFTs with single and double precision and arbitrary strides for batched calls.
 * `cxx/c2c_c2r_r2c_single_apis_example` demonstrates usage of simple and advanced FFTW APIs for computing C2C / R2C / C2R FFTs with inplace / out-of-place data.
 * `cxx/auxiliary_apis_example` demonstrates usage of few auxiliary APIs.
 * `cxx/include_header_example` demonstrates inclusion of `fftw3.h` (as opposed to using `nvpl_fftw3.h`).
@@ -43,20 +43,27 @@ make
 ./fortran/c2c_c2r_r2c_single_apis_example.f90
 
 ```
-### c2c_c2r_r2c_many_bench_example
+### plan_many_dft_benchmark_example
+It's recommended to use the `./scripts/nvpl/nvplbench_generic.py` script to run the `plan_many_dft_benchmark_example`
+allowing to test multiple configurations grouped into one use-case.
 ```
-Usage: ./c2c_r2c_c2r_many_bench_example
+Usage: ./plan_many_dft_benchmark_example
 Arguments:
-	--prec precision:          The precision of the transform fp32 or fp64.
-	--fft_type fft_type:       The type of the transform c2c, r2c or c2r.
-	--mode mode:               (optional) The mode of the transform ip or oop (default: ip).
-	--config config_name:      (optional) Name of the config to be logged (default: no_config).
-	--cat bench_category:      (optional) The case to benchmark p_2357, f_2357_l_512_r_1, f_2357_l_512_r_2, varargs_r_1 (default: p_2357).
-	--size data_size:          (optional) Transform data size. Supported options:
-	                           * 0 - default, total data size is 256 MB.
-	                           * <number> - number of batches to process for each FFT size.
-	                           * <number>k or <number>m - for example 64m - the size of data in KB or MB to process.
-	--cycles cycles:           (optional) The number of cycles (default: 100).
-	--warmup warmup_runs:      (optional) The number of warm-up runs (default: 10).
-	--fft_sizes *fft_sizes:    (optional) If `varargs_r_1` is selected, fft sizes can be listed manually (for rank 1). This must be the last argument!
+        --prec precision:          The precision of the transform fp32 or fp64.
+        --fft_type fft_type:       (optional) The type of the transform c2c, r2c or c2r (default: c2c).
+        --mode mode:               (optional) The mode of the transform ip or oop (default: ip).
+        --rank rank:               (optional) Rank of the transform (default: 1).
+        --size data_size:          (optional) Transform data size. Supported options:
+                                   * 0 - default, total data size is 256 MB.
+                                   * <number> - number of batches to process for each FFT size.
+                                   * <number>k or <number>m - for example 64m - the size of data in KB or MB to process.
+        --cycles cycles:           (optional) The number of cycles (default: 100).
+        --warmup warmup_runs:      (optional) The number of warm-up runs (default: 10).
+        --fft_sizes *fft_sizes:    (optional) Size of the fft transform. If the rank != 1, it must be specified earlier!
+        --istride istride:         (optional) Input stride - distance between elements of the sample (default: 1)
+        --idist idist:             (optional) Distance between start of each input sample (~ number of transformed elements)
+        --ostride ostride:         (optional) Output stride - distance between elements of the sample (default: 1)
+        --odist odist:             (optional) Distance between start of each output sample (~ number of transformed elements)
 ```
+
+