|
2 | 2 |
|
3 | 3 | This repository contains John D. McCalpin's [[https://www.cs.virginia.edu/stream/][STREAM benchmark]] and a |
4 | 4 | port of that benchmark to ISPC. The port to ISPC also uses dynamic |
5 | | -memory allocation. |
| 5 | +memory allocation, and it provides a straightforward way to |
| 6 | +ensure that streaming/"non-temporal" stores are used. |
6 | 7 |
|
7 | 8 | ** Building |
8 | 9 |
|
9 | | - There is a =Makefile= provided to build the two benchmark codes. |
10 | | - You may need to edit the =Makefile= to adjust compilers and |
11 | | - compiler flags for your system. To build just run =make= and the |
12 | | - executable files =build/stream= and =build/stream_ispc= should be |
13 | | - built. |
| 10 | +There is a =Makefile= provided to build the two benchmark codes. |
| 11 | +You may need to edit the =Makefile= to adjust compilers and |
| 12 | +compiler flags for your system. To build just run =make= and the |
| 13 | +executable files =build/stream=, =build/stream_ispc= |
| 14 | +=build/stream_ispc_loopy= should be built. Only the latter version |
| 15 | +uses streaming stores. |
14 | 16 |
|
15 | | -** Performance Results (aka =icc= vs =ispc=) |
| 17 | +** Performance Results |
16 | 18 |
|
17 | | - I compared =build/stream= and =build/stream_ispc= on a dual socket |
18 | | - Xeon CPU E5-2698 v3 system. I used the following build flags |
| 19 | +These results were obtained on a Raptor Lake (i7-1365U) laptop. |
19 | 20 |
|
20 | | - #+BEGIN_SRC |
21 | | - CC=icc |
22 | | - CFLAGS=-std=gnu99 -g -xHOST -O3 -ffreestanding -openmp |
| 21 | +Here is the version information for the compilers we are using: |
23 | 22 |
|
24 | | - CXX=icpc |
25 | | - CXXFLAGS=-g -xHOST -O3 -ffreestanding -openmp -DISPC_USE_OMP |
26 | | - |
27 | | - ISPC = ispc |
28 | | - ISPCFLAGS = --target=avx2-i32x8 --pic --opt=force-aligned-memory --werror |
29 | | - #+END_SRC |
30 | | - |
31 | | - Here is the version information for the compilers we are using: |
32 | | - |
33 | | - #+BEGIN_SRC sh :exports both |
34 | | - icc --version |
| 23 | +#+BEGIN_SRC sh :exports both |
| 24 | +gcc --version |
35 | 25 | #+END_SRC |
36 | | - #+results: |
37 | | - : icc (ICC) 16.0.1 20151021 |
38 | | - : Copyright (C) 1985-2015 Intel Corporation. All rights reserved. |
| 26 | +#+results: |
| 27 | +: gcc (Debian 14.2.0-16) 14.2.0 |
| 28 | +: Copyright (C) 2024 Free Software Foundation, Inc. |
| 29 | +: This is free software; see the source for copying conditions. There is NO |
| 30 | +: warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |
39 | 31 |
|
40 | | - #+BEGIN_SRC sh :exports both |
41 | | - ispc --version |
42 | | - #+END_SRC |
43 | | - #+results: |
44 | | - : Intel(r) SPMD Program Compiler (ispc), 1.9.0 (build commit 89dfbf2125fc2cba @ 20160212, LLVM 3.8) |
45 | | - |
46 | | - Here is the output from the STREAM benchmark compiled with =icc=: |
47 | | - #+BEGIN_SRC sh :exports both |
48 | | - OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream |
49 | | - #+END_SRC |
50 | | - #+results: |
51 | | - : ------------------------------------------------------------- |
52 | | - : STREAM version $Revision: 5.10 $ |
53 | | - : ------------------------------------------------------------- |
54 | | - : This system uses 4 bytes per array element. |
55 | | - : ------------------------------------------------------------- |
56 | | - : Array size = 80000000 (elements), Offset = 0 (elements) |
57 | | - : Memory per array = 305.2 MiB (= 0.3 GiB). |
58 | | - : Total memory required = 915.5 MiB (= 0.9 GiB). |
59 | | - : Each kernel will be executed 10 times. |
60 | | - : The *best* time for each kernel (excluding the first iteration) |
61 | | - : will be used to compute the reported bandwidth. |
62 | | - : ------------------------------------------------------------- |
63 | | - : |
64 | | - : OPENMP DISPLAY ENVIRONMENT BEGIN |
65 | | - : _OPENMP='201307' |
66 | | - : [host] OMP_CANCELLATION='FALSE' |
67 | | - : [host] OMP_DISPLAY_ENV='TRUE' |
68 | | - : [host] OMP_DYNAMIC='FALSE' |
69 | | - : [host] OMP_MAX_ACTIVE_LEVELS='2147483647' |
70 | | - : [host] OMP_NESTED='FALSE' |
71 | | - : [host] OMP_NUM_THREADS: value is not defined |
72 | | - : [host] OMP_PLACES='cores' |
73 | | - : [host] OMP_PROC_BIND='spread' |
74 | | - : [host] OMP_SCHEDULE='static' |
75 | | - : [host] OMP_STACKSIZE='4M' |
76 | | - : [host] OMP_THREAD_LIMIT='2147483647' |
77 | | - : [host] OMP_WAIT_POLICY='PASSIVE' |
78 | | - : OPENMP DISPLAY ENVIRONMENT END |
79 | | - : |
80 | | - : |
81 | | - : Number of Threads requested = 32 |
82 | | - : Number of Threads counted = 32 |
83 | | - : ------------------------------------------------------------- |
84 | | - : Your clock granularity/precision appears to be 1 microseconds. |
85 | | - : Each test below will take on the order of 5419 microseconds. |
86 | | - : (= 5419 clock ticks) |
87 | | - : Increase the size of the arrays if this shows that |
88 | | - : you are not getting at least 20 clock ticks per test. |
89 | | - : ------------------------------------------------------------- |
90 | | - : WARNING -- The above is only a rough guideline. |
91 | | - : For best results, please be sure you know the |
92 | | - : precision of your system timer. |
93 | | - : ------------------------------------------------------------- |
94 | | - : Function Best Rate MB/s Avg time Min time Max time |
95 | | - : Copy: 104902.7 0.006141 0.006101 0.006308 |
96 | | - : Scale: 106522.0 0.006039 0.006008 0.006146 |
97 | | - : Add: 112215.9 0.008605 0.008555 0.008762 |
98 | | - : Triad: 112097.2 0.008595 0.008564 0.008710 |
99 | | - : ------------------------------------------------------------- |
100 | | - : Solution Validates: avg error less than 1.000000e-06 on all three arrays |
101 | | - : Results Validation Verbose Results: |
102 | | - : Expected a(1), b(1), c(1): 1153300692992.000000 230660145152.000000 307546849280.000000 |
103 | | - : Observed a(1), b(1), c(1): 1153300824064.000000 230660161536.000000 307546882048.000000 |
104 | | - : Rel Errors on a, b, c: 1.136495e-07 7.103091e-08 1.065464e-07 |
105 | | - : ------------------------------------------------------------- |
| 32 | +#+BEGIN_SRC sh :exports both |
| 33 | +ispc --version |
| 34 | +#+END_SRC |
| 35 | +#+results: |
| 36 | +: Intel(r) Implicit SPMD Program Compiler (Intel(r) ISPC), 1.25.3 (build @ 20241223, LLVM 19.1.6) |
106 | 37 |
|
| 38 | +Here is the output from the STREAM benchmark compiled with =gcc=: |
| 39 | +#+BEGIN_SRC sh :exports both |
| 40 | +OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream |
| 41 | +#+END_SRC |
| 42 | +#+results: |
| 43 | +: OPENMP DISPLAY ENVIRONMENT BEGIN |
| 44 | +: _OPENMP = '201511' |
| 45 | +: [host] OMP_DYNAMIC = 'FALSE' |
| 46 | +: [host] OMP_NESTED = 'FALSE' |
| 47 | +: [host] OMP_NUM_THREADS = '1' |
| 48 | +: [host] OMP_SCHEDULE = 'DYNAMIC' |
| 49 | +: [host] OMP_PROC_BIND = 'FALSE' |
| 50 | +: [host] OMP_PLACES = '{0:2},{2:2},{4},{5},{6},{7},{8},{9},{10},{11}' |
| 51 | +: [host] OMP_STACKSIZE = '0' |
| 52 | +: [host] OMP_WAIT_POLICY = 'PASSIVE' |
| 53 | +: [host] OMP_THREAD_LIMIT = '4294967295' |
| 54 | +: [host] OMP_MAX_ACTIVE_LEVELS = '1' |
| 55 | +: [host] OMP_NUM_TEAMS = '0' |
| 56 | +: [host] OMP_TEAMS_THREAD_LIMIT = '0' |
| 57 | +: [all] OMP_CANCELLATION = 'FALSE' |
| 58 | +: [all] OMP_DEFAULT_DEVICE = '0' |
| 59 | +: [all] OMP_MAX_TASK_PRIORITY = '0' |
| 60 | +: [all] OMP_DISPLAY_AFFINITY = 'FALSE' |
| 61 | +: [host] OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A' |
| 62 | +: [host] OMP_ALLOCATOR = 'omp_default_mem_alloc' |
| 63 | +: [all] OMP_TARGET_OFFLOAD = 'DEFAULT' |
| 64 | +: OPENMP DISPLAY ENVIRONMENT END |
| 65 | +: ------------------------------------------------------------- |
| 66 | +: STREAM version $Revision: 5.10 $ |
| 67 | +: ------------------------------------------------------------- |
| 68 | +: This system uses 4 bytes per array element. |
| 69 | +: ------------------------------------------------------------- |
| 70 | +: Array size = 80000000 (elements), Offset = 0 (elements) |
| 71 | +: Memory per array = 305.2 MiB (= 0.3 GiB). |
| 72 | +: Total memory required = 915.5 MiB (= 0.9 GiB). |
| 73 | +: Each kernel will be executed 10 times. |
| 74 | +: The *best* time for each kernel (excluding the first iteration) |
| 75 | +: will be used to compute the reported bandwidth. |
| 76 | +: ------------------------------------------------------------- |
| 77 | +: Number of Threads requested = 12 |
| 78 | +: Number of Threads counted = 12 |
| 79 | +: ------------------------------------------------------------- |
| 80 | +: Your clock granularity/precision appears to be 1 microseconds. |
| 81 | +: Each test below will take on the order of 11368 microseconds. |
| 82 | +: (= 11368 clock ticks) |
| 83 | +: Increase the size of the arrays if this shows that |
| 84 | +: you are not getting at least 20 clock ticks per test. |
| 85 | +: ------------------------------------------------------------- |
| 86 | +: WARNING -- The above is only a rough guideline. |
| 87 | +: For best results, please be sure you know the |
| 88 | +: precision of your system timer. |
| 89 | +: ------------------------------------------------------------- |
| 90 | +: Function Best Rate MB/s Avg time Min time Max time |
| 91 | +: Copy: 45432.1 0.029872 0.014087 0.049874 |
| 92 | +: Scale: 44736.3 0.035883 0.014306 0.051508 |
| 93 | +: Add: 48758.6 0.035604 0.019689 0.070311 |
| 94 | +: Triad: 49037.1 0.032059 0.019577 0.060323 |
| 95 | +: ------------------------------------------------------------- |
| 96 | +: Solution Validates: avg error less than 1.000000e-06 on all three arrays |
| 97 | +: Results Validation Verbose Results: |
| 98 | +: Expected a(1), b(1), c(1): 1153300692992.000000 230660145152.000000 307546849280.000000 |
| 99 | +: Observed a(1), b(1), c(1): 1153300824064.000000 230660161536.000000 307546882048.000000 |
| 100 | +: Rel Errors on a, b, c: 2.383402e-08 1.489626e-08 2.234439e-08 |
| 101 | +: ------------------------------------------------------------- |
107 | 102 |
|
108 | | - Here is the output from the modified STREAM benchmark with kernels |
109 | | - compiled with =ispc=: |
110 | | - #+BEGIN_SRC sh :exports both |
111 | | - OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc |
112 | | - #+END_SRC |
113 | | - #+results: |
114 | | - : Array size = 80000000 (elements) |
115 | | - : Memory per array = 305.2 MiB (= 0.3 GiB). |
116 | | - : Total memory required = 915.5 MiB (= 0.9 GiB). |
117 | | - : Chunk size: 16384 |
118 | | - : Page size: 4096 |
119 | | - : Cache line size: 64 |
120 | | - : sizeof(STREAM_TYPE): 4 |
121 | | - : Each kernel will be executed 10 times. |
122 | | - : The *best* time for each kernel (excluding the first iteration) |
123 | | - : will be used to compute the reported bandwidth. |
124 | | - : ------------------------------------------------------------- |
125 | | - : |
126 | | - : OPENMP DISPLAY ENVIRONMENT BEGIN |
127 | | - : _OPENMP='201307' |
128 | | - : [host] OMP_CANCELLATION='FALSE' |
129 | | - : [host] OMP_DISPLAY_ENV='TRUE' |
130 | | - : [host] OMP_DYNAMIC='FALSE' |
131 | | - : [host] OMP_MAX_ACTIVE_LEVELS='2147483647' |
132 | | - : [host] OMP_NESTED='FALSE' |
133 | | - : [host] OMP_NUM_THREADS: value is not defined |
134 | | - : [host] OMP_PLACES='cores' |
135 | | - : [host] OMP_PROC_BIND='spread' |
136 | | - : [host] OMP_SCHEDULE='static' |
137 | | - : [host] OMP_STACKSIZE='4M' |
138 | | - : [host] OMP_THREAD_LIMIT='2147483647' |
139 | | - : [host] OMP_WAIT_POLICY='PASSIVE' |
140 | | - : OPENMP DISPLAY ENVIRONMENT END |
141 | | - : |
142 | | - : |
143 | | - : ------------------------------------------------------------- |
144 | | - : Each test below will take on the order of 6482 microseconds. |
145 | | - : ------------------------------------------------------------- |
146 | | - : ------------------------------------------------------------- |
147 | | - : Function Best Rate MB/s Avg time Min time Max time |
148 | | - : Copy: 75179.7 0.008546 0.008513 0.008603 |
149 | | - : Scale: 73558.4 0.008729 0.008701 0.008792 |
150 | | - : Add: 83152.5 0.011573 0.011545 0.011613 |
151 | | - : Triad: 83805.1 0.011485 0.011455 0.011520 |
152 | | - : ------------------------------------------------------------- |
153 | | - : Solution Validates: avg error less than 1.000000e-06 on all three arrays |
154 | | - : Results Validation Verbose Results: |
155 | | - : Expected a(1), b(1), c(1): 1153300692992.000000 230660145152.000000 307546849280.000000 |
156 | | - : Observed a(1), b(1), c(1): 1153300824064.000000 230660161536.000000 307546882048.000000 |
157 | | - : Rel Errors on a, b, c: 1.136495e-07 7.103091e-08 1.065464e-07 |
158 | | - : ------------------------------------------------------------- |
| 103 | +Here is the output from the modified STREAM benchmark with kernels |
| 104 | +compiled with =ispc= using =ispc='s high-level loop constructs and |
| 105 | +without streaming stores: |
| 106 | +#+BEGIN_SRC sh :exports both |
| 107 | +OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc |
| 108 | +#+END_SRC |
| 109 | +#+results: |
| 110 | +: ------------------------------------------------------------- |
| 111 | +: Function Best Rate MB/s Avg time Min time Max time |
| 112 | +: Copy: 49986.2 0.013017 0.012804 0.013866 |
| 113 | +: Scale: 49472.4 0.013018 0.012937 0.013115 |
| 114 | +: Add: 52357.1 0.018545 0.018336 0.019054 |
| 115 | +: Triad: 52421.3 0.019951 0.018313 0.030611 |
| 116 | +: ------------------------------------------------------------- |
159 | 117 |
|
160 | | - As we can see the =icc= version is about 1.3x faster than the |
161 | | - =ispc= version. I tried various memory alignment procedures but |
162 | | - could not improve the performance of the =ispc= version of the |
163 | | - benchmark. |
| 118 | +And here is the output from the modified STREAM benchmark with kernels |
| 119 | +compiled with =ispc= using kernels generated by [[https://github.com/inducer/loopy][loopy]], including the |
| 120 | +use of streaming stores: |
| 121 | +#+BEGIN_SRC sh :exports both |
| 122 | +OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc_loopy |
| 123 | +#+END_SRC |
| 124 | +#+results: |
| 125 | +: ------------------------------------------------------------- |
| 126 | +: Function Best Rate MB/s Avg time Min time Max time |
| 127 | +: Copy: 65615.9 0.015429 0.009754 0.030991 |
| 128 | +: Scale: 66276.3 0.015031 0.009657 0.031078 |
| 129 | +: Add: 63480.7 0.017027 0.015123 0.029783 |
| 130 | +: Triad: 63078.4 0.019036 0.015219 0.028874 |
| 131 | +: ------------------------------------------------------------- |
0 commit comments