Skip to content

Commit ea936e0

Browse files
authored
FPGA: Move all samples from the DE10 to the N6001 board (#2334)
This PR moves all the code samples HW references from the Terasic DE10 board to the N6001 board.
1 parent 6a8d7a9 commit ea936e0

File tree

27 files changed

+269
-281
lines changed

27 files changed

+269
-281
lines changed

DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/board_test/README.md

Lines changed: 41 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,7 @@ The tests listed above check the following interfaces in a platform:
264264
265265
## Example Output
266266
267-
Running on FPGA device (Terasic’s DE10-Agilex Development Board). Performance results are based on testing as of August 30, 2023.
267+
Running on FPGA device (Intel® FPGA SmartNIC N6001-PL). Performance results are based on testing as of May 10, 2024.
268268
269269
> **Note**: Refer to the [Performance Disclaimers](/DirectProgramming/C++SYCL_FPGA/README.md#performance-disclaimers) section for important performance information.
270270
@@ -287,21 +287,21 @@ The tests are:
287287
Note: Kernel Clock Frequency is run along with all tests except 1 (Host Speed and Host Read Write test)
288288

289289
Running all tests
290-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
290+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
291291

292-
clGetDeviceInfo CL_DEVICE_GLOBAL_MEM_SIZE = 34359737344
293-
clGetDeviceInfo CL_DEVICE_MAX_MEM_ALLOC_SIZE = 34359737344
294-
Device buffer size available for allocation = 34359737344 bytes
292+
clGetDeviceInfo CL_DEVICE_GLOBAL_MEM_SIZE = 17179869184
293+
clGetDeviceInfo CL_DEVICE_MAX_MEM_ALLOC_SIZE = 17179868160
294+
Device buffer size available for allocation = 17179868160 bytes
295295

296296
*****************************************************************
297297
*********************** Host Speed Test *************************
298298
*****************************************************************
299299

300-
Size of buffer created = 34359737344 bytes
301-
Writing 32767 MB to device global memory ... 8776.42 MB/s
302-
Reading 32767 MB from device global memory ... 9743.93 MB/s
300+
Size of buffer created = 17179868160 bytes
301+
Writing 16383 MiB to device global memory ... 7592.7 MB/s
302+
Reading 16383 MiB from device global memory ... 7628.8 MB/s
303303
Verifying data ...
304-
Successfully wrote and readback 32767 MB buffer
304+
Successfully wrote and readback 16383 MB buffer
305305

306306
Transferring 8192 KBs in 256 32 KB blocks ...
307307
Transferring 8192 KBs in 128 64 KB blocks ...
@@ -316,57 +316,51 @@ Transferring 8192 KBs in 1 8192 KB blocks ...
316316
Writing 8192 KBs with block size (in bytes) below:
317317

318318
Block_Size Avg Max Min End-End (MB/s)
319-
32768 428.07 435.44 382.82 395.97
320-
65536 783.64 793.81 665.47 730.92
321-
131072 1325.34 1343.33 1165.79 1250.47
322-
262144 1984.22 2016.88 1776.27 1903.43
323-
524288 3507.84 3588.65 3165.04 3385.12
324-
1048576 4845.12 4982.59 4533.71 4730.02
325-
2097152 5741.51 5758.96 5719.79 5656.95
326-
4194304 6695.96 6869.04 6531.39 6652.06
327-
8388608 7585.54 7585.54 7585.54 7585.54
319+
32768 381.23 426.21 249.94 4775.26
320+
65536 510.22 546.47 406.00 7332.94
321+
131072 757.51 1073.87 701.72 13826.89
322+
262144 977.82 1954.74 869.33 16369.93
323+
524288 1272.50 3282.34 1037.37 14452.23
324+
1048576 1746.77 4678.85 1083.52 8202.39
325+
2097152 5797.93 5983.74 5546.83 20416.05
326+
4194304 6436.09 6557.66 6318.95 12325.60
327+
8388608 6919.11 6919.11 6919.11 6919.11
328328

329329
Reading 8192 KBs with block size (in bytes) below:
330330

331331
Block_Size Avg Max Min End-End (MB/s)
332-
32768 477.89 492.37 430.08 436.84
333-
65536 869.03 896.61 814.20 799.03
334-
131072 1464.26 1504.16 1384.23 1363.57
335-
262144 2201.40 2237.72 2136.42 2090.72
336-
524288 3869.46 3966.81 3728.03 3697.54
337-
1048576 5318.21 5457.59 5197.05 5171.01
338-
2097152 6325.13 6432.15 6175.55 6217.27
339-
4194304 7577.67 7609.52 7546.07 7526.88
340-
8388608 8441.18 8441.18 8441.18 8441.18
332+
32768 416.23 463.80 149.57 4051.34
333+
65536 588.92 634.71 261.48 6861.12
334+
131072 769.07 1089.64 397.59 12046.73
335+
262144 1017.03 2195.03 654.95 16790.08
336+
524288 1204.36 3581.23 815.31 13943.92
337+
1048576 1512.05 4862.33 953.71 8371.75
338+
2097152 2775.19 6196.34 1046.06 4133.37
339+
4194304 2893.07 6699.52 1844.87 3673.20
340+
8388608 2977.26 2977.26 2977.26 2977.26
341341

342-
Host write top speed = 7585.54 MB/s
343-
Host read top speed = 8441.18 MB/s
342+
Host write top speed = 20416.05 MB/s
343+
Host read top speed = 16790.08 MB/s
344344

345345

346-
HOST-TO-MEMORY BANDWIDTH = 8013 MB/s
346+
HOST-TO-MEMORY BANDWIDTH = 18603 MB/s
347347

348348

349349
*****************************************************************
350350
********************* Host Read Write Test **********************
351351
*****************************************************************
352352

353353
--- Running host read write test with device offset 0
354-
** WARNING: [aclde10_agilex0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
355-
** host ptr (0x1688a3f5) and/or dev offset (0x400) is not aligned to 4 bytes
356354
--- Running host read write test with device offset 3
357-
** WARNING: [aclde10_agilex0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
358-
** host ptr (0x1688a3f5) and/or dev offset (0x403) is not aligned to 4 bytes
359-
** WARNING: [aclde10_agilex0] NOT using DMA to transfer 1024 bytes from device to host because of lack of alignment
360-
** host ptr (0x16893cb8) and/or dev offset (0x403) is not aligned to 4 bytes
361355

362356
HOST READ-WRITE TEST PASSED!
363357

364358
*****************************************************************
365359
******************* Kernel Clock Frequency Test ***************
366360
*****************************************************************
367361

368-
Measured Frequency = 598.905 MHz
369-
Quartus Compiled Frequency = 600 MHz
362+
Measured Frequency = 511.062 MHz
363+
Quartus Compiled Frequency = 512 MHz
370364

371365
Measured Clock frequency is within 2 percent of Quartus compiled frequency.
372366

@@ -386,16 +380,16 @@ KERNEL_LAUNCH_TEST PASSED
386380
******************** Kernel Latency **************************
387381
*****************************************************************
388382

389-
Processed 10000 kernels in 217.0312 ms
390-
Single kernel round trip time = 21.7031 us
391-
Throughput = 46.0763 kernels/ms
383+
Processed 10000 kernels in 118.6319 ms
384+
Single kernel round trip time = 11.8632 us
385+
Throughput = 84.2943 kernels/ms
392386
Kernel execution is complete
393387

394388
*****************************************************************
395389
************* Kernel-to-Memory Read Write Test ***************
396390
*****************************************************************
397391

398-
Maximum device global memory allocation size is 34359737344 bytes
392+
Maximum device global memory allocation size is 17179868160 bytes
399393
Finished host memory allocation for input and output data
400394
Creating device buffer
401395
Finished writing to device buffers
@@ -404,10 +398,6 @@ Launching kernel with global offset : 0
404398
Launching kernel with global offset : 1073741824
405399
Launching kernel with global offset : 2147483648
406400
Launching kernel with global offset : 3221225472
407-
Launching kernel with global offset : 4294967296
408-
Launching kernel with global offset : 5368709120
409-
Launching kernel with global offset : 6442450944
410-
Launching kernel with global offset : 7516192768
411401
... kernel finished execution.
412402
Finished Verification
413403
KERNEL TO MEMORY READ WRITE TEST PASSED
@@ -419,17 +409,17 @@ KERNEL TO MEMORY READ WRITE TEST PASSED
419409
Note: This test assumes that design was compiled with -Xsno-interleaving option
420410

421411

422-
Performing kernel transfers of 4096 MBs on the default global memory (address starting at 0)
412+
Performing kernel transfers of 4096 MiBs on the default global memory (address starting at 0)
423413
Launching kernel MemWriteStream ...
424414
Launching kernel MemReadStream ...
425415
Launching kernel MemReadWriteStream ...
426416

427417
Summarizing bandwidth in MB/s/bank for banks 1 to 8
428-
19307.6 19312.5 19309.3 19309.4 19309.2 19309.4 19311.3 19309.2 MemWriteStream
429-
19337.7 19339.8 19337.7 19341.4 19340.3 19338.3 19337.7 19339.3 MemReadStream
430-
17657.3 17657.1 17657.5 17657.4 17656.7 17657.7 17657.6 17657.5 MemReadWriteStream
418+
8765.24 8765.28 8765.26 8765.29 8765.27 8765.24 8765.3 8765.28 MemWriteStream
419+
8786.28 8786.28 8786.27 8786.26 8786.27 8786.26 8786.3 8786.26 MemReadStream
420+
8059.25 8062.61 8061.29 8054.25 8058.78 8061.35 8062.6 8058.39 MemReadWriteStream
431421

432-
KERNEL-TO-MEMORY BANDWIDTH = 18768.7 MB/s/bank
422+
KERNEL-TO-MEMORY BANDWIDTH = 8537.12 MB/s/bank
433423

434424
*****************************************************************
435425
*********************** USM Bandwidth *************************

DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ Performance results are based on testing as of August 30, 2023.
6666
6767
| Device | Throughput
6868
|:--- |:---
69-
| Terasic’s DE10-Agilex Development Board | 378k matrices/s for real matrices of size 32x32
69+
| Intel® FPGA SmartNIC N6001-PL | 338k matrices/s for real matrices of size 32x32
7070

7171
## Key Implementation Details
7272

@@ -294,11 +294,11 @@ You can apply the Cholesky decomposition to a number of matrices, as shown below
294294
## Example Output
295295
296296
```
297-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
297+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
298298
Generating 8 random real matrices of size 32x32
299299
Computing the Cholesky decomposition of 8 matrices 819200 times
300-
Total duration: 17.3307 s
301-
Throughput: 378.15k matrices/s
300+
Total duration: 19.366 s
301+
Throughput: 338.407k matrices/s
302302
Verifying results...
303303

304304
PASSED

DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky_inversion/README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -79,11 +79,10 @@ Performance results are based on testing as of April 26, 2022.
7979
8080
| Device | Throughput
8181
|:--- |:---
82-
| Terasic’s DE10-Agilex Development Board | 415k matrices/s for real matrices of size 32x32
82+
| Intel® FPGA SmartNIC N6001-PL | 389k matrices/s for real matrices of size 32x32
8383

8484
## Key Implementation Details
8585

86-
In this reference design, the Cholesky decomposition algorithm is used to factor a real _n_ × _n_ matrix. The algorithm computes the vector dot product of two rows of the matrix. In our FPGA implementation, the dot product is computed in a loop over the row's _n_ elements. The loop is fully unrolled to maximize throughput. As a result, *n* real multiplication operations are performed in parallel on the FPGA, followed by sequential additions to compute the dot product result.
8786

8887
With this optimization, our FPGA implementation requires _n_ DSPs to compute the real floating point dot product. The input matrix is also replicated two times in order to be able to read two full rows per cycle. The matrix size is constrained by the total FPGA DSP and RAM resources available.
8988

@@ -320,11 +319,11 @@ You can apply the Cholesky-based inversion to 8 matrices repeated a number of ti
320319
## Example Output
321320
322321
```
323-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
322+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
324323
Generating 8 random real matrices of size 32x32
325324
Computing the Cholesky-based inversion of 8 matrices 819200 times
326-
Total duration: 15.7619 s
327-
Throughput: 415.789k matrices/s
325+
Total duration: 16.8337 s
326+
Throughput: 389.315k matrices/s
328327
Verifying results...
329328

330329
PASSED

DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/crr/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,9 @@ Performance results are based on testing as of August 30, 2023.
6161

6262
> **Note**: Refer to the [Performance Disclaimers](/DirectProgramming/C++SYCL_FPGA/README.md#performance-disclaimers) section for important performance information.
6363
64-
| Device | Throughput
65-
|:--- |:---
66-
| Terasic’s DE10-Agilex Development Board | 653 assets/s
64+
| Device | Congifuration | Throughput
65+
|:--- |:--- |:---
66+
| Intel® FPGA SmartNIC N6001-PL | Outer unroll: 1; Inner unroll: 64 | 329 assets/s
6767

6868

6969
## Key Implementation Details
@@ -296,14 +296,14 @@ This design measures the FPGA performance to determine how many assets can be pr
296296
## Example Output
297297
298298
```
299-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
299+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ec00000)
300300

301301
============= Correctness Test =============
302302
Running analytical correctness checks...
303303
CPU-FPGA Equivalence: PASS
304304

305305
============= Throughput Test =============
306-
Avg throughput: 653.9 assets/s
306+
Avg throughput: 329.5 assets/s
307307
```
308308
309309
## License

DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/fft2d/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -265,36 +265,36 @@ Additionally, the `cmake` build system can be configured using the following par
265265
## Example Output
266266
267267
268-
Example Output when running on the **Terasic DE10-Agilex Development Board**.
268+
Example Output when running on the **Intel® FPGA SmartNIC N6001-PL**.
269269
270270
```
271271
No program argument was passed, running all fft2d variants
272-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
272+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
273273
Using USM device allocations
274274
Launching a 1048576 points 8-parallel FFT transform (ordered data layout)
275-
Processing time = 0.00296994s
276-
Throughput = 0.353063 Gpoints / sec (35.3063 Gflops)
275+
Processing time = 0.00187981s
276+
Throughput = 0.55781 Gpoints / sec (55.781 Gflops)
277277
Signal to noise ratio on output sample: 137.231
278278
--> PASSED
279-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
279+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
280280
Using USM device allocations
281281
Launching a 1048576 points 8-parallel inverse FFT transform (ordered data layout)
282-
Processing time = 0.00277858s
283-
Throughput = 0.377378 Gpoints / sec (37.7378 Gflops)
282+
Processing time = 0.00184986s
283+
Throughput = 0.566842 Gpoints / sec (56.6842 Gflops)
284284
Signal to noise ratio on output sample: 136.861
285285
--> PASSED
286-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
286+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
287287
Using USM device allocations
288288
Launching a 1048576 points 8-parallel FFT transform (alternative data layout)
289-
Processing time = 0.0027715s
290-
Throughput = 0.378343 Gpoints / sec (37.8343 Gflops)
289+
Processing time = 0.00185805s
290+
Throughput = 0.564343 Gpoints / sec (56.4343 Gflops)
291291
Signal to noise ratio on output sample: 137.436
292292
--> PASSED
293-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
293+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
294294
Using USM device allocations
295295
Launching a 1048576 points 8-parallel inverse FFT transform (alternative data layout)
296-
Processing time = 0.00277509s
297-
Throughput = 0.377852 Gpoints / sec (37.7852 Gflops)
296+
Processing time = 0.00185293s
297+
Throughput = 0.565902 Gpoints / sec (56.5902 Gflops)
298298
Signal to noise ratio on output sample: 136.689
299299
--> PASSED
300300
```

DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/gzip/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ Performance results are based on testing as of August 30, 2023.
118118
119119
| Device | Throughput
120120
|:--- |:---
121-
| Terasic’s DE10-Agilex Development Board | 2 engines @ 4.6 GB/s
121+
| Intel® FPGA SmartNIC N6001-PL | 2 engines @ 7 GB/s
122122

123123
## Build the `GZIP` Design
124124

@@ -307,7 +307,7 @@ Performance results are based on testing as of August 30, 2023.
307307
## Example Output
308308
309309
```
310-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
310+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
311311
Launching High-Bandwidth DMA GZIP application with 2 engines
312312
outputSize: 145706366 Prepin: 0
313313
kMinBufferSize: 16384 isz: 145706110 kInOutPadding: 256
@@ -325,21 +325,21 @@ outputSize: 145706366 Prepin: 0
325325
kMinBufferSize: 16384 isz: 145706110 kInOutPadding: 256
326326
outputSize: 145706366 Prepin: 0
327327
kMinBufferSize: 16384 isz: 145706110 kInOutPadding: 256
328-
Throughput: 4.62197 GB/s
328+
Throughput: 6.99 GB/s
329329

330330
TP breakdown for engine #0 (GB/s)
331-
CRC = 9.58499
332-
LZ77 = 9.22334
333-
Huffman Encoding = 4.51518
334-
DMA host-to-device = 8.92087
335-
DMA device-to-host = 9.85465
331+
CRC = 5.75029
332+
LZ77 = 3.51912
333+
Huffman Encoding = 3.5107
334+
DMA host-to-device = 9.26423
335+
DMA device-to-host = 7.4516
336336

337337
TP breakdown for engine #1 (GB/s)
338-
CRC = 9.58543
339-
LZ77 = 9.23241
340-
Huffman Encoding = 4.50995
341-
DMA host-to-device = 8.93201
342-
DMA device-to-host = 9.86107
338+
CRC = 5.75794
339+
LZ77 = 3.52021
340+
Huffman Encoding = 3.50743
341+
DMA host-to-device = 9.36199
342+
DMA device-to-host = 8.74803
343343

344344
Compression Ratio 43.9262%
345345
PASSED

DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/matmul/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Performance results are based on testing as of March 6, 2023.
6060
6161
| Device | Throughput
6262
|:--- |:---
63-
| Terasic’s DE10-Agilex Development Board | 144k matrices/s for single-precision floating-point matrices of size 64 * 64, computed using a systolic array of 8 * 8 PEs (64 DSPs)
63+
| Intel® FPGA SmartNIC N6001-PL | 142k matrices/s for single-precision floating-point matrices of size 64 * 64, computed using a systolic array of 8 * 8 PEs (64 DSPs)
6464

6565
## Key Implementation Details
6666

@@ -354,16 +354,16 @@ You can perform the multiplication of the set of matrices repeatedly. This step
354354
355355
## Example Output
356356
357-
Example output when running on **Terasic’s DE10-Agilex Development Board** for the multiplication of 8 matrices 819200 times (each matrix consisting of 64x64 single-precision floating point numbers, computed using a systolic array of 8x8 PEs).
357+
Example output when running on **Intel® FPGA SmartNIC N6001-PL** for the multiplication of 8 matrices 819200 times (each matrix consisting of 64x64 single-precision floating point numbers, computed using a systolic array of 8x8 PEs).
358358
359359
```
360-
Running on device: de10_agilex : Agilex Reference Platform (aclde10_agilex0)
360+
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
361361
Matrix A size: 64 x 64 (tile: 8 x 64)
362362
Matrix B size: 64 x 64 (tile: 64 x 8)
363363
Systolic array size: 8 x 8 PEs
364364
Running matrix multiplication of 2 matrices 819200 times
365-
Total duration: 11.3746 s
366-
Throughput: 144.04k matrices/s
365+
Total duration: 11.4577 s
366+
Throughput: 142.995k matrices/s
367367

368368
PASSED
369369
```
@@ -372,4 +372,4 @@ PASSED
372372
373373
Code samples are licensed under the MIT license. See [License.txt](/License.txt) for details.
374374
375-
Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).
375+
Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).

0 commit comments

Comments
 (0)