Skip to content

Commit d1ec4c7

Browse files
authored
Merge pull request #4 from NVIDIA/release/nvpl-25.5
Release/nvpl 25.5
2 parents 43c2ab7 + 15047e4 commit d1ec4c7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+1154
-1100
lines changed

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@ The NVIDIA Performance Libraries (NVPL) are a collection of high performance mat
44

55
These CPU-only libraries have no dependencies on CUDA or CTK, and are drop in replacements for standard C and Fortran mathematical APIs allowing HPC applications to achieve maximum performance on the Grace platform.
66

7-
The provided sample codes show how to call and link to NVPL Libraries in Fortran, C, and C++ applications and libraries. Most examples use [CMake](#cmake-usage), but are easily modified for use in custom build environments.
7+
The provided sample codes show how to call and link to NVPL Libraries in Fortran, C, and C++ applications and libraries. Examples use [CMake](#cmake-usage), but are easily modified for use in custom build environments.
88

99
* [NVPL Documentation](https://docs.nvidia.com/nvpl/)
1010

1111
## Installation
1212

1313
* [NVPL Downloads](https://developer.nvidia.com/nvpl-downloads/)
14-
* Latest release: **NVPL-25.1**
14+
* Latest release: **NVPL-25.5**
1515

1616
## Library Samples
1717

@@ -33,27 +33,27 @@ Samples are compatible with the latest nvpl release. Compatibility with older r
3333
* Platform: Arm SBSA
3434
* CPUs Supported
3535
* [NVIDIA Grace](https://www.nvidia.com/en-us/data-center/grace-cpu/) (Armv9.0-A Neoverse-V2)
36-
* AWS Graviton 4 (Armv9.00-A Neoverse-V2)
36+
* AWS Graviton 4 (Armv9.0-A Neoverse-V2)
3737
* AWS Graviton 3/3e (Armv8.4-A Neoverse-V1)
3838
* AWS Graviton 2 (Arm-8.2-A Neoverse-N1)
3939
* Ampere Altra (Armv8.2-A Neoverse-N1)
4040
* Any CPU with Armv8.1-A or later micro Architecture
4141
* OS (Linux)
42-
* Ubuntu: 20.04, 22.04, 24.04, 24.10
42+
* Ubuntu: 20.04, 22.04, 24.04, 25.04
4343
* Debian: 12
4444
* RHEL: RHEL8, RHEL9
45-
* Fedora: 39, 40, 41
45+
* Fedora: 40, 41, 42
4646
* SLES: SLES15 (15.6)
4747
* OpenSUSE/leap: 15.6
4848
* AmazonLinux: 2, 2023
4949
* Generally any Linux OS with support for aarch64
5050

5151
### Compilers
5252

53-
* GCC-8 - GCC-14+
54-
* Clang-14 - Clang-19+
53+
* GCC-8 - GCC-15+
54+
* Clang-14 - Clang-21+
5555
* [Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang/downloads): 16.x, 17.x, 18.x, 19.x
56-
* [NVIDA HPC Compilers](https://developer.nvidia.com/hpc-compilers): 23.9 - 24.11
56+
* [NVIDA HPC Compilers](https://developer.nvidia.com/hpc-compilers): 23.9 - 25.3+
5757

5858
### Languages
5959

nvpl_blas/f77/cdotc.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,5 @@ program CDOTC_MAIN
4646
99 format('Example: CDOTC for computing the dot product '
4747
& 'of vectors X and Y')
4848
100 format('#### args: n=',i1,', incx=',i1,', incy=',i1)
49-
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
49+
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
5050
end

nvpl_blas/f77/cdotu.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,5 @@ program CDOTU_MAIN
4646
99 format('Example: CDOTU for computing the dot product '
4747
& 'of vectors X and Y')
4848
100 format('#### args: n=',i1,', incx=',i1,', incy=',i1)
49-
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
49+
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
5050
end

nvpl_blas/f77/ctpsv.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ program CTPSV_MAIN
4545
stop
4646
99 format('Example: CTPSV for solving a system of linear equations'
4747
& ' whose coefficients are in a triangular packed matrix.')
48-
100 format('#### args: n=',i1,', incx=',i1
48+
100 format('#### args: n=',i1,', incx=',i1,
4949
& ', uplo=',a1,', transa=',a1,', diag=',a1)
5050
102 format(A)
5151
end

nvpl_blas/f77/dtpsv.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ program DTPSV_MAIN
4545
stop
4646
99 format('Example: DTPSV for solving a system of linear equations'
4747
& ' whose coefficients are in a triangular packed matrix.')
48-
100 format('#### args: n=',i1,', incx=',i1
48+
100 format('#### args: n=',i1,', incx=',i1,
4949
& ', uplo=',a1,', transa=',a1,', diag=',a1)
5050
102 format(A)
5151
end

nvpl_blas/f77/stpsv.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ program STPSV_MAIN
4545
stop
4646
99 format('Example: STPSV for solving a system of linear equations'
4747
& ' whose coefficients are in a triangular packed matrix.')
48-
100 format('#### args: n=',i1,', incx=',i1
48+
100 format('#### args: n=',i1,', incx=',i1,
4949
& ', uplo=',a1,', transa=',a1,', diag=',a1)
5050
102 format(A)
5151
end

nvpl_blas/f77/zdotc.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,5 @@ program ZDOTC_MAIN
4646
99 format('Example: ZDOTC for computing the dot product '
4747
& 'of vectors X and Y')
4848
100 format('#### args: n=',i1,', incx=',i1,', incy=',i1)
49-
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
49+
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
5050
end

nvpl_blas/f77/zdotu.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,5 @@ program ZDOTU_MAIN
4646
99 format('Example: ZDOTU for computing the dot product '
4747
& 'of vectors X and Y')
4848
100 format('#### args: n=',i1,', incx=',i1,', incy=',i1)
49-
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2')')
49+
101 format('The dot product of vectors X and Y:(',f5.2,', ',f5.2,')')
5050
end

nvpl_blas/f77/ztpsv.f

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ program ZTPSV_MAIN
4545
stop
4646
99 format('Example: ZTPSV for solving a system of linear equations'
4747
& ' whose coefficients are in a triangular packed matrix.')
48-
100 format('#### args: n=',i1,', incx=',i1
48+
100 format('#### args: n=',i1,', incx=',i1,
4949
& ', uplo=',a1,', transa=',a1,', diag=',a1)
5050
102 format(A)
5151
end

nvpl_fft/README.md

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
* `cxx/c2c_<single/double>_many_example` computes batched forward and backward 1D/2D/3D C2C FFTs in single/double precision with strided data.
77
* `cxx/r2c_c2r_<single/double>_many_example` computes batched forward R2C and backward C2R 1D/2D/3D FFTs in single/double precision with strided data.
88
* `cxx/<r2c_c2r/c2c>_single_withomp_example` computes batched forward and backward 1D FFTs in single precision with contiguous data. Plan creation and execution are called inside an omp parallel region.
9-
* `cxx/c2c_c2r_r2c_many_bench_example` measures performance of 1D/2D/3D C2C/C2R/R2C FFTs with single and double precision.
9+
* `cxx/plan_many_dft_benchmark_example` measures performance of 1D/2D/3D C2C/C2R/R2C FFTs with single and double precision and arbitrary strides for batched calls.
1010
* `cxx/c2c_c2r_r2c_single_apis_example` demonstrates usage of simple and advanced FFTW APIs for computing C2C / R2C / C2R FFTs with inplace / out-of-place data.
1111
* `cxx/auxiliary_apis_example` demonstrates usage of few auxiliary APIs.
1212
* `cxx/include_header_example` demonstrates inclusion of `fftw3.h` (as opposed to using `nvpl_fftw3.h`).
@@ -43,20 +43,27 @@ make
4343
./fortran/c2c_c2r_r2c_single_apis_example.f90
4444
4545
```
46-
### c2c_c2r_r2c_many_bench_example
46+
### plan_many_dft_benchmark_example
47+
It's recommended to use the `./scripts/nvpl/nvplbench_generic.py` script to run the `plan_many_dft_benchmark_example`
48+
allowing to test multiple configurations grouped into one use-case.
4749
```
48-
Usage: ./c2c_r2c_c2r_many_bench_example
50+
Usage: ./plan_many_dft_benchmark_example
4951
Arguments:
50-
--prec precision: The precision of the transform fp32 or fp64.
51-
--fft_type fft_type: The type of the transform c2c, r2c or c2r.
52-
--mode mode: (optional) The mode of the transform ip or oop (default: ip).
53-
--config config_name: (optional) Name of the config to be logged (default: no_config).
54-
--cat bench_category: (optional) The case to benchmark p_2357, f_2357_l_512_r_1, f_2357_l_512_r_2, varargs_r_1 (default: p_2357).
55-
--size data_size: (optional) Transform data size. Supported options:
56-
* 0 - default, total data size is 256 MB.
57-
* <number> - number of batches to process for each FFT size.
58-
* <number>k or <number>m - for example 64m - the size of data in KB or MB to process.
59-
--cycles cycles: (optional) The number of cycles (default: 100).
60-
--warmup warmup_runs: (optional) The number of warm-up runs (default: 10).
61-
--fft_sizes *fft_sizes: (optional) If `varargs_r_1` is selected, fft sizes can be listed manually (for rank 1). This must be the last argument!
52+
--prec precision: The precision of the transform fp32 or fp64.
53+
--fft_type fft_type: (optional) The type of the transform c2c, r2c or c2r (default: c2c).
54+
--mode mode: (optional) The mode of the transform ip or oop (default: ip).
55+
--rank rank: (optional) Rank of the transform (default: 1).
56+
--size data_size: (optional) Transform data size. Supported options:
57+
* 0 - default, total data size is 256 MB.
58+
* <number> - number of batches to process for each FFT size.
59+
* <number>k or <number>m - for example 64m - the size of data in KB or MB to process.
60+
--cycles cycles: (optional) The number of cycles (default: 100).
61+
--warmup warmup_runs: (optional) The number of warm-up runs (default: 10).
62+
--fft_sizes *fft_sizes: (optional) Size of the fft transform. If the rank != 1, it must be specified earlier!
63+
--istride istride: (optional) Input stride - distance between elements of the sample (default: 1)
64+
--idist idist: (optional) Distance between start of each input sample (~ number of transformed elements)
65+
--ostride ostride: (optional) Output stride - distance between elements of the sample (default: 1)
66+
--odist odist: (optional) Distance between start of each output sample (~ number of transformed elements)
6267
```
68+
69+

0 commit comments

Comments
 (0)