oneapi-src
diff --git a/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/board_test/README.md
Lines changed: 41 additions & 51 deletions b/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/board_test/README.md
Lines changed: 41 additions & 51 deletions
diff --git a/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky/README.md
Lines changed: 4 additions & 4 deletions b/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky/README.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky_inversion/README.md
Lines changed: 4 additions & 5 deletions b/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky_inversion/README.md
Lines changed: 4 additions & 5 deletions
diff --git a/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/crr/README.md
Lines changed: 5 additions & 5 deletions b/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/crr/README.md
Lines changed: 5 additions & 5 deletions
diff --git a/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/fft2d/README.md
Lines changed: 13 additions & 13 deletions b/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/fft2d/README.md
Lines changed: 13 additions & 13 deletions
diff --git a/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/gzip/README.md
Lines changed: 13 additions & 13 deletions b/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/gzip/README.md
Lines changed: 13 additions & 13 deletions
diff --git a/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/matmul/README.md
Lines changed: 6 additions & 6 deletions b/‎DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/matmul/README.md
Lines changed: 6 additions & 6 deletions
@@ -264,7 +264,7 @@ The tests listed above check the following interfaces in a platform:
 
 ## Example Output
 
-Running on FPGA device (Terasic’s DE10-Agilex Development Board). Performance results are based on testing as of August 30, 2023.
+Running on FPGA device (Intel® FPGA SmartNIC N6001-PL). Performance results are based on testing as of May 10, 2024.
 
 > **Note**: Refer to the [Performance Disclaimers](/DirectProgramming/C++SYCL_FPGA/README.md#performance-disclaimers) section for important performance information.
 
@@ -287,21 +287,21 @@ The tests are:
 Note: Kernel Clock Frequency is run along with all tests except 1 (Host Speed and Host Read Write test)
 
 Running all tests 
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 
-clGetDeviceInfo CL_DEVICE_GLOBAL_MEM_SIZE = 34359737344
-clGetDeviceInfo CL_DEVICE_MAX_MEM_ALLOC_SIZE = 34359737344
-Device buffer size available for allocation = 34359737344 bytes
+clGetDeviceInfo CL_DEVICE_GLOBAL_MEM_SIZE = 17179869184
+clGetDeviceInfo CL_DEVICE_MAX_MEM_ALLOC_SIZE = 17179868160
+Device buffer size available for allocation = 17179868160 bytes
 
 *****************************************************************
 *********************** Host Speed Test *************************
 *****************************************************************
 
-Size of buffer created = 34359737344 bytes
-Writing 32767 MB to device global memory ... 8776.42 MB/s
-Reading 32767 MB from device global memory ... 9743.93 MB/s
+Size of buffer created = 17179868160 bytes
+Writing 16383 MiB to device global memory ... 7592.7 MB/s
+Reading 16383 MiB from device global memory ... 7628.8 MB/s
 Verifying data ...
-Successfully wrote and readback 32767 MB buffer
+Successfully wrote and readback 16383 MB buffer
 
 Transferring 8192 KBs in 256 32 KB blocks ...
 Transferring 8192 KBs in 128 64 KB blocks ...
@@ -316,57 +316,51 @@ Transferring 8192 KBs in 1 8192 KB blocks ...
 Writing 8192 KBs with block size (in bytes) below:
 
 Block_Size Avg Max Min End-End (MB/s)
-   32768 428.07 435.44 382.82 395.97 
-   65536 783.64 793.81 665.47 730.92 
-  131072 1325.34 1343.33 1165.79 1250.47 
-  262144 1984.22 2016.88 1776.27 1903.43 
-  524288 3507.84 3588.65 3165.04 3385.12 
- 1048576 4845.12 4982.59 4533.71 4730.02 
- 2097152 5741.51 5758.96 5719.79 5656.95 
- 4194304 6695.96 6869.04 6531.39 6652.06 
- 8388608 7585.54 7585.54 7585.54 7585.54 
+   32768 381.23 426.21 249.94 4775.26 
+   65536 510.22 546.47 406.00 7332.94 
+  131072 757.51 1073.87 701.72 13826.89 
+  262144 977.82 1954.74 869.33 16369.93 
+  524288 1272.50 3282.34 1037.37 14452.23 
+ 1048576 1746.77 4678.85 1083.52 8202.39 
+ 2097152 5797.93 5983.74 5546.83 20416.05 
+ 4194304 6436.09 6557.66 6318.95 12325.60 
+ 8388608 6919.11 6919.11 6919.11 6919.11 
 
 Reading 8192 KBs with block size (in bytes) below:
 
 Block_Size Avg Max Min End-End (MB/s)
-   32768 477.89 492.37 430.08 436.84 
-   65536 869.03 896.61 814.20 799.03 
-  131072 1464.26 1504.16 1384.23 1363.57 
-  262144 2201.40 2237.72 2136.42 2090.72 
-  524288 3869.46 3966.81 3728.03 3697.54 
- 1048576 5318.21 5457.59 5197.05 5171.01 
- 2097152 6325.13 6432.15 6175.55 6217.27 
- 4194304 7577.67 7609.52 7546.07 7526.88 
- 8388608 8441.18 8441.18 8441.18 8441.18 
+   32768 416.23 463.80 149.57 4051.34 
+   65536 588.92 634.71 261.48 6861.12 
+  131072 769.07 1089.64 397.59 12046.73 
+  262144 1017.03 2195.03 654.95 16790.08 
+  524288 1204.36 3581.23 815.31 13943.92 
+ 1048576 1512.05 4862.33 953.71 8371.75 
+ 2097152 2775.19 6196.34 1046.06 4133.37 
+ 4194304 2893.07 6699.52 1844.87 3673.20 
+ 8388608 2977.26 2977.26 2977.26 2977.26 
 
-Host write top speed = 7585.54 MB/s
-Host read top speed = 8441.18 MB/s
+Host write top speed = 20416.05 MB/s
+Host read top speed = 16790.08 MB/s
 
 
-HOST-TO-MEMORY BANDWIDTH = 8013 MB/s
+HOST-TO-MEMORY BANDWIDTH = 18603 MB/s
 
 
 *****************************************************************
 ********************* Host Read Write Test **********************
 *****************************************************************
 
 --- Running host read write test with device offset 0
-** WARNING: [aclde10_agilex0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
-**                 host ptr (0x1688a3f5) and/or dev offset (0x400) is not aligned to 4 bytes
 --- Running host read write test with device offset 3
-** WARNING: [aclde10_agilex0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
-**                 host ptr (0x1688a3f5) and/or dev offset (0x403) is not aligned to 4 bytes
-** WARNING: [aclde10_agilex0] NOT using DMA to transfer 1024 bytes from device to host because of lack of alignment
-**                 host ptr (0x16893cb8) and/or dev offset (0x403) is not aligned to 4 bytes
 
 HOST READ-WRITE TEST PASSED!
 
 *****************************************************************
 *******************  Kernel Clock Frequency Test  ***************
 *****************************************************************
 
-Measured Frequency    =   598.905 MHz 
-Quartus Compiled Frequency  =   600 MHz 
+Measured Frequency    =   511.062 MHz 
+Quartus Compiled Frequency  =   512 MHz 
 
 Measured Clock frequency is within 2 percent of Quartus compiled frequency. 
 
@@ -386,16 +380,16 @@ KERNEL_LAUNCH_TEST PASSED
 ********************  Kernel Latency  **************************
 *****************************************************************
 
-Processed 10000 kernels in 217.0312 ms
-Single kernel round trip time = 21.7031 us
-Throughput = 46.0763 kernels/ms
+Processed 10000 kernels in 118.6319 ms
+Single kernel round trip time = 11.8632 us
+Throughput = 84.2943 kernels/ms
 Kernel execution is complete
 
 *****************************************************************
 *************  Kernel-to-Memory Read Write Test  ***************
 *****************************************************************
 
-Maximum device global memory allocation size is 34359737344 bytes 
+Maximum device global memory allocation size is 17179868160 bytes 
 Finished host memory allocation for input and output data
 Creating device buffer
 Finished writing to device buffers 
@@ -404,10 +398,6 @@ Launching kernel with global offset : 0
 Launching kernel with global offset : 1073741824
 Launching kernel with global offset : 2147483648
 Launching kernel with global offset : 3221225472
-Launching kernel with global offset : 4294967296
-Launching kernel with global offset : 5368709120
-Launching kernel with global offset : 6442450944
-Launching kernel with global offset : 7516192768
 ... kernel finished execution. 
 Finished Verification
 KERNEL TO MEMORY READ WRITE TEST PASSED 
@@ -419,17 +409,17 @@ KERNEL TO MEMORY READ WRITE TEST PASSED
 Note: This test assumes that design was compiled with -Xsno-interleaving option
 
 
-Performing kernel transfers of 4096 MBs on the default global memory (address starting at 0)
+Performing kernel transfers of 4096 MiBs on the default global memory (address starting at 0)
 Launching kernel MemWriteStream ... 
 Launching kernel MemReadStream ... 
 Launching kernel MemReadWriteStream ... 
 
 Summarizing bandwidth in MB/s/bank for banks 1 to 8
- 19307.6  19312.5  19309.3  19309.4  19309.2  19309.4  19311.3  19309.2  MemWriteStream
- 19337.7  19339.8  19337.7  19341.4  19340.3  19338.3  19337.7  19339.3  MemReadStream
- 17657.3  17657.1  17657.5  17657.4  17656.7  17657.7  17657.6  17657.5  MemReadWriteStream
+ 8765.24  8765.28  8765.26  8765.29  8765.27  8765.24  8765.3  8765.28  MemWriteStream
+ 8786.28  8786.28  8786.27  8786.26  8786.27  8786.26  8786.3  8786.26  MemReadStream
+ 8059.25  8062.61  8061.29  8054.25  8058.78  8061.35  8062.6  8058.39  MemReadWriteStream
 
-KERNEL-TO-MEMORY BANDWIDTH = 18768.7 MB/s/bank
+KERNEL-TO-MEMORY BANDWIDTH = 8537.12 MB/s/bank
 
 *****************************************************************
 ***********************  USM Bandwidth  *************************
 
@@ -66,7 +66,7 @@ Performance results are based on testing as of August 30, 2023.
 
 | Device                                            | Throughput
 |:---                                               |:---
-| Terasic’s DE10-Agilex Development Board           | 378k matrices/s for real matrices of size 32x32
+| Intel® FPGA SmartNIC N6001-PL                     | 338k matrices/s for real matrices of size 32x32
 
 ## Key Implementation Details
 
@@ -294,11 +294,11 @@ You can apply the Cholesky decomposition to a number of matrices, as shown below
 ## Example Output
 
 ```
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 Generating 8 random real matrices of size 32x32 
 Computing the Cholesky decomposition of 8 matrices 819200 times
-   Total duration:   17.3307 s
-Throughput: 378.15k matrices/s
+   Total duration:   19.366 s
+Throughput: 338.407k matrices/s
 Verifying results...
 
 PASSED
 
@@ -79,11 +79,10 @@ Performance results are based on testing as of April 26, 2022.
 
 | Device                                            | Throughput
 |:---                                               |:---
-| Terasic’s DE10-Agilex Development Board           | 415k matrices/s for real matrices of size 32x32
+| Intel® FPGA SmartNIC N6001-PL                     | 389k matrices/s for real matrices of size 32x32
 
 ## Key Implementation Details
 
-In this reference design, the Cholesky decomposition algorithm is used to factor a real _n_ × _n_ matrix. The algorithm computes the vector dot product of two rows of the matrix. In our FPGA implementation, the dot product is computed in a loop over the row's _n_ elements. The loop is fully unrolled to maximize throughput. As a result, *n* real multiplication operations are performed in parallel on the FPGA, followed by sequential additions to compute the dot product result.
 
 With this optimization, our FPGA implementation requires _n_ DSPs to compute the real floating point dot product. The input matrix is also replicated two times in order to be able to read two full rows per cycle. The matrix size is constrained by the total FPGA DSP and RAM resources available.
 
@@ -320,11 +319,11 @@ You can apply the Cholesky-based inversion to 8 matrices repeated a number of ti
 ## Example Output
 
 ```
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 Generating 8 random real matrices of size 32x32 
 Computing the Cholesky-based inversion of 8 matrices 819200 times
-   Total duration:   15.7619 s
-Throughput: 415.789k matrices/s
+   Total duration:   16.8337 s
+Throughput: 389.315k matrices/s
 Verifying results...
 
 PASSED
 
@@ -61,9 +61,9 @@ Performance results are based on testing as of August 30, 2023.
 
 > **Note**: Refer to the [Performance Disclaimers](/DirectProgramming/C++SYCL_FPGA/README.md#performance-disclaimers) section for important performance information.
 
-| Device                                              | Throughput
-|:---                                                 |:---
-| Terasic’s DE10-Agilex Development Board             | 653 assets/s
+| Device                                          | Congifuration                         | Throughput
+|:---                                             |:---                                   |:---
+| Intel® FPGA SmartNIC N6001-PL                   | Outer unroll: 1; Inner unroll: 64     | 329 assets/s
 
 
 ## Key Implementation Details
@@ -296,14 +296,14 @@ This design measures the FPGA performance to determine how many assets can be pr
 ## Example Output
 
 ```
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ec00000)
 
 ============= Correctness Test ============= 
 Running analytical correctness checks... 
 CPU-FPGA Equivalence: PASS
 
 ============= Throughput Test =============
-   Avg throughput:   653.9 assets/s
+   Avg throughput:   329.5 assets/s
 ```
 
 ## License
 
@@ -265,36 +265,36 @@ Additionally, the `cmake` build system can be configured using the following par
 ## Example Output
 
 
-Example Output when running on the **Terasic DE10-Agilex Development Board**.
+Example Output when running on the **Intel® FPGA SmartNIC N6001-PL**.
 
 ```
 No program argument was passed, running all fft2d variants
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 Using USM device allocations
 Launching a 1048576 points 8-parallel FFT transform (ordered data layout)
-Processing time = 0.00296994s
-Throughput = 0.353063 Gpoints / sec (35.3063 Gflops)
+Processing time = 0.00187981s
+Throughput = 0.55781 Gpoints / sec (55.781 Gflops)
 Signal to noise ratio on output sample: 137.231
  --> PASSED
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 Using USM device allocations
 Launching a 1048576 points 8-parallel inverse FFT transform (ordered data layout)
-Processing time = 0.00277858s
-Throughput = 0.377378 Gpoints / sec (37.7378 Gflops)
+Processing time = 0.00184986s
+Throughput = 0.566842 Gpoints / sec (56.6842 Gflops)
 Signal to noise ratio on output sample: 136.861
  --> PASSED
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 Using USM device allocations
 Launching a 1048576 points 8-parallel FFT transform (alternative data layout)
-Processing time = 0.0027715s
-Throughput = 0.378343 Gpoints / sec (37.8343 Gflops)
+Processing time = 0.00185805s
+Throughput = 0.564343 Gpoints / sec (56.4343 Gflops)
 Signal to noise ratio on output sample: 137.436
  --> PASSED
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 Using USM device allocations
 Launching a 1048576 points 8-parallel inverse FFT transform (alternative data layout)
-Processing time = 0.00277509s
-Throughput = 0.377852 Gpoints / sec (37.7852 Gflops)
+Processing time = 0.00185293s
+Throughput = 0.565902 Gpoints / sec (56.5902 Gflops)
 Signal to noise ratio on output sample: 136.689
  --> PASSED
 ```
 
@@ -118,7 +118,7 @@ Performance results are based on testing as of August 30, 2023.
 
 | Device                                                                              | Throughput
 |:---                                                                                 |:---
-| Terasic’s DE10-Agilex Development Board                                             | 2 engines @ 4.6 GB/s
+| Intel® FPGA SmartNIC N6001-PL                                                       | 2 engines @ 7 GB/s
 
 ## Build the `GZIP` Design
 
@@ -307,7 +307,7 @@ Performance results are based on testing as of August 30, 2023.
 ## Example Output
 
 ```
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
 Launching High-Bandwidth DMA GZIP application with 2 engines
 outputSize: 145706366 Prepin: 0
 kMinBufferSize: 16384 isz: 145706110 kInOutPadding: 256
@@ -325,21 +325,21 @@ outputSize: 145706366 Prepin: 0
 kMinBufferSize: 16384 isz: 145706110 kInOutPadding: 256
 outputSize: 145706366 Prepin: 0
 kMinBufferSize: 16384 isz: 145706110 kInOutPadding: 256
-Throughput: 4.62197 GB/s
+Throughput: 6.99 GB/s
 
 TP breakdown for engine #0 (GB/s)
-CRC = 9.58499
-LZ77 = 9.22334
-Huffman Encoding = 4.51518
-DMA host-to-device = 8.92087
-DMA device-to-host = 9.85465
+CRC = 5.75029
+LZ77 = 3.51912
+Huffman Encoding = 3.5107
+DMA host-to-device = 9.26423
+DMA device-to-host = 7.4516
 
 TP breakdown for engine #1 (GB/s)
-CRC = 9.58543
-LZ77 = 9.23241
-Huffman Encoding = 4.50995
-DMA host-to-device = 8.93201
-DMA device-to-host = 9.86107
+CRC = 5.75794
+LZ77 = 3.52021
+Huffman Encoding = 3.50743
+DMA host-to-device = 9.36199
+DMA device-to-host = 8.74803
 
 Compression Ratio 43.9262%
 PASSED
 
@@ -60,7 +60,7 @@ Performance results are based on testing as of March 6, 2023.
 
 | Device                                            | Throughput
 |:---                                               |:---
-| Terasic’s DE10-Agilex Development Board           | 144k matrices/s for single-precision floating-point matrices of size 64 * 64, computed using a systolic array of 8 * 8 PEs (64 DSPs)
+| Intel® FPGA SmartNIC N6001-PL                     | 142k matrices/s for single-precision floating-point matrices of size 64 * 64, computed using a systolic array of 8 * 8 PEs (64 DSPs)
 
 ## Key Implementation Details
 
@@ -354,16 +354,16 @@ You can perform the multiplication of the set of matrices repeatedly. This step
 
 ## Example Output
 
-Example output when running on **Terasic’s DE10-Agilex Development Board** for the multiplication of 8 matrices 819200 times (each matrix consisting of 64x64 single-precision floating point numbers, computed using a systolic array of 8x8 PEs).
+Example output when running on **Intel® FPGA SmartNIC N6001-PL** for the multiplication of 8 matrices 819200 times (each matrix consisting of 64x64 single-precision floating point numbers, computed using a systolic array of 8x8 PEs).
 
 ```
-Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
+Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
  Matrix A size: 64 x 64 (tile: 8 x 64)
  Matrix B size: 64 x 64 (tile: 64 x 8)
  Systolic array size: 8 x 8 PEs
 Running matrix multiplication of 2 matrices 819200 times
-   Total duration:   11.3746 s
-Throughput: 144.04k matrices/s
+   Total duration:   11.4577 s
+Throughput: 142.995k matrices/s
 
 PASSED
 ```
@@ -372,4 +372,4 @@ PASSED
 
 Code samples are licensed under the MIT license. See [License.txt](/License.txt) for details.
 
-Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).
+Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).