Skip to content

Commit b0cbb59

Browse files
committed
Update README for new results
1 parent 3b6edab commit b0cbb59

File tree

1 file changed

+128
-145
lines changed

1 file changed

+128
-145
lines changed

README.org

Lines changed: 128 additions & 145 deletions
Original file line numberDiff line numberDiff line change
@@ -2,162 +2,145 @@
22

33
This repository contains John D. McCalpin's [[https://www.cs.virginia.edu/stream/][STREAM benchmark]] and a
44
port of that benchmark to ISPC. The port to ISPC also uses dynamic
5-
memory allocation.
5+
memory allocation, and it provides a straightforward way to
6+
ensure that streaming/"non-temporal" stores are used.
67

78
** Building
89

9-
There is a =Makefile= provided to build the two benchmark codes.
10-
You may need to edit the =Makefile= to adjust compilers and
11-
compiler flags for your system. To build just run =make= and the
12-
executable files =build/stream= and =build/stream_ispc= should be
13-
built.
10+
There is a =Makefile= provided to build the two benchmark codes.
11+
You may need to edit the =Makefile= to adjust compilers and
12+
compiler flags for your system. To build just run =make= and the
13+
executable files =build/stream=, =build/stream_ispc=
14+
=build/stream_ispc_loopy= should be built. Only the latter version
15+
uses streaming stores.
1416

15-
** Performance Results (aka =icc= vs =ispc=)
17+
** Performance Results
1618

17-
I compared =build/stream= and =build/stream_ispc= on a dual socket
18-
Xeon CPU E5-2698 v3 system. I used the following build flags
19+
These results were obtained on a Raptor Lake (i7-1365U) laptop.
1920

20-
#+BEGIN_SRC
21-
CC=icc
22-
CFLAGS=-std=gnu99 -g -xHOST -O3 -ffreestanding -openmp
21+
Here is the version information for the compilers we are using:
2322

24-
CXX=icpc
25-
CXXFLAGS=-g -xHOST -O3 -ffreestanding -openmp -DISPC_USE_OMP
26-
27-
ISPC = ispc
28-
ISPCFLAGS = --target=avx2-i32x8 --pic --opt=force-aligned-memory --werror
23+
#+BEGIN_SRC sh :exports both
24+
gcc --version
2925
#+END_SRC
26+
#+results:
27+
: gcc (Debian 14.2.0-16) 14.2.0
28+
: Copyright (C) 2024 Free Software Foundation, Inc.
29+
: This is free software; see the source for copying conditions. There is NO
30+
: warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
3031

31-
Here is the version information for the compilers we are using:
32-
33-
#+BEGIN_SRC sh :exports both
34-
icc --version
35-
#+END_SRC
36-
#+results:
37-
: icc (ICC) 16.0.1 20151021
38-
: Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
32+
#+BEGIN_SRC sh :exports both
33+
ispc --version
34+
#+END_SRC
35+
#+results:
36+
: Intel(r) Implicit SPMD Program Compiler (Intel(r) ISPC), 1.25.3 (build @ 20241223, LLVM 19.1.6)
3937

40-
#+BEGIN_SRC sh :exports both
41-
ispc --version
42-
#+END_SRC
43-
#+results:
44-
: Intel(r) SPMD Program Compiler (ispc), 1.9.0 (build commit 89dfbf2125fc2cba @ 20160212, LLVM 3.8)
38+
Here is the benchmark result from the STREAM benchmark compiled with =gcc=:
39+
#+BEGIN_SRC sh :exports both
40+
OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream
41+
#+END_SRC
42+
#+results:
43+
: -------------------------------------------------------------
44+
: Function Best Rate MB/s Avg time Min time Max time
45+
: Copy: 45432.1 0.029872 0.014087 0.049874
46+
: Scale: 44736.3 0.035883 0.014306 0.051508
47+
: Add: 48758.6 0.035604 0.019689 0.070311
48+
: Triad: 49037.1 0.032059 0.019577 0.060323
49+
: -------------------------------------------------------------
4550

46-
Here is the output from the STREAM benchmark compiled with =icc=:
47-
#+BEGIN_SRC sh :exports both
48-
OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream
49-
#+END_SRC
50-
#+results:
51-
: -------------------------------------------------------------
52-
: STREAM version $Revision: 5.10 $
53-
: -------------------------------------------------------------
54-
: This system uses 4 bytes per array element.
55-
: -------------------------------------------------------------
56-
: Array size = 80000000 (elements), Offset = 0 (elements)
57-
: Memory per array = 305.2 MiB (= 0.3 GiB).
58-
: Total memory required = 915.5 MiB (= 0.9 GiB).
59-
: Each kernel will be executed 10 times.
60-
: The *best* time for each kernel (excluding the first iteration)
61-
: will be used to compute the reported bandwidth.
62-
: -------------------------------------------------------------
63-
:
64-
: OPENMP DISPLAY ENVIRONMENT BEGIN
65-
: _OPENMP='201307'
66-
: [host] OMP_CANCELLATION='FALSE'
67-
: [host] OMP_DISPLAY_ENV='TRUE'
68-
: [host] OMP_DYNAMIC='FALSE'
69-
: [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
70-
: [host] OMP_NESTED='FALSE'
71-
: [host] OMP_NUM_THREADS: value is not defined
72-
: [host] OMP_PLACES='cores'
73-
: [host] OMP_PROC_BIND='spread'
74-
: [host] OMP_SCHEDULE='static'
75-
: [host] OMP_STACKSIZE='4M'
76-
: [host] OMP_THREAD_LIMIT='2147483647'
77-
: [host] OMP_WAIT_POLICY='PASSIVE'
78-
: OPENMP DISPLAY ENVIRONMENT END
79-
:
80-
:
81-
: Number of Threads requested = 32
82-
: Number of Threads counted = 32
83-
: -------------------------------------------------------------
84-
: Your clock granularity/precision appears to be 1 microseconds.
85-
: Each test below will take on the order of 5419 microseconds.
86-
: (= 5419 clock ticks)
87-
: Increase the size of the arrays if this shows that
88-
: you are not getting at least 20 clock ticks per test.
89-
: -------------------------------------------------------------
90-
: WARNING -- The above is only a rough guideline.
91-
: For best results, please be sure you know the
92-
: precision of your system timer.
93-
: -------------------------------------------------------------
94-
: Function Best Rate MB/s Avg time Min time Max time
95-
: Copy: 104902.7 0.006141 0.006101 0.006308
96-
: Scale: 106522.0 0.006039 0.006008 0.006146
97-
: Add: 112215.9 0.008605 0.008555 0.008762
98-
: Triad: 112097.2 0.008595 0.008564 0.008710
99-
: -------------------------------------------------------------
100-
: Solution Validates: avg error less than 1.000000e-06 on all three arrays
101-
: Results Validation Verbose Results:
102-
: Expected a(1), b(1), c(1): 1153300692992.000000 230660145152.000000 307546849280.000000
103-
: Observed a(1), b(1), c(1): 1153300824064.000000 230660161536.000000 307546882048.000000
104-
: Rel Errors on a, b, c: 1.136495e-07 7.103091e-08 1.065464e-07
105-
: -------------------------------------------------------------
51+
Here is the result from the modified STREAM benchmark with kernels
52+
compiled with =ispc= using =ispc='s high-level loop constructs and
53+
without streaming stores:
54+
#+BEGIN_SRC sh :exports both
55+
OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc
56+
#+END_SRC
57+
#+results:
58+
: -------------------------------------------------------------
59+
: Function Best Rate MB/s Avg time Min time Max time
60+
: Copy: 49986.2 0.013017 0.012804 0.013866
61+
: Scale: 49472.4 0.013018 0.012937 0.013115
62+
: Add: 52357.1 0.018545 0.018336 0.019054
63+
: Triad: 52421.3 0.019951 0.018313 0.030611
64+
: -------------------------------------------------------------
10665

66+
And here are the the results from the modified STREAM benchmark with kernels
67+
compiled with =ispc= using kernels generated by [[https://github.com/inducer/loopy][loopy]], including the
68+
use of streaming stores:
69+
#+BEGIN_SRC sh :exports both
70+
OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc_loopy
71+
#+END_SRC
72+
#+results:
73+
: -------------------------------------------------------------
74+
: Function Best Rate MB/s Avg time Min time Max time
75+
: Copy: 65615.9 0.015429 0.009754 0.030991
76+
: Scale: 66276.3 0.015031 0.009657 0.031078
77+
: Add: 63480.7 0.017027 0.015123 0.029783
78+
: Triad: 63078.4 0.019036 0.015219 0.028874
79+
: -------------------------------------------------------------
10780

108-
Here is the output from the modified STREAM benchmark with kernels
109-
compiled with =ispc=:
110-
#+BEGIN_SRC sh :exports both
111-
OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc
112-
#+END_SRC
113-
#+results:
114-
: Array size = 80000000 (elements)
115-
: Memory per array = 305.2 MiB (= 0.3 GiB).
116-
: Total memory required = 915.5 MiB (= 0.9 GiB).
117-
: Chunk size: 16384
118-
: Page size: 4096
119-
: Cache line size: 64
120-
: sizeof(STREAM_TYPE): 4
121-
: Each kernel will be executed 10 times.
122-
: The *best* time for each kernel (excluding the first iteration)
123-
: will be used to compute the reported bandwidth.
124-
: -------------------------------------------------------------
125-
:
126-
: OPENMP DISPLAY ENVIRONMENT BEGIN
127-
: _OPENMP='201307'
128-
: [host] OMP_CANCELLATION='FALSE'
129-
: [host] OMP_DISPLAY_ENV='TRUE'
130-
: [host] OMP_DYNAMIC='FALSE'
131-
: [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
132-
: [host] OMP_NESTED='FALSE'
133-
: [host] OMP_NUM_THREADS: value is not defined
134-
: [host] OMP_PLACES='cores'
135-
: [host] OMP_PROC_BIND='spread'
136-
: [host] OMP_SCHEDULE='static'
137-
: [host] OMP_STACKSIZE='4M'
138-
: [host] OMP_THREAD_LIMIT='2147483647'
139-
: [host] OMP_WAIT_POLICY='PASSIVE'
140-
: OPENMP DISPLAY ENVIRONMENT END
141-
:
142-
:
143-
: -------------------------------------------------------------
144-
: Each test below will take on the order of 6482 microseconds.
145-
: -------------------------------------------------------------
146-
: -------------------------------------------------------------
147-
: Function Best Rate MB/s Avg time Min time Max time
148-
: Copy: 75179.7 0.008546 0.008513 0.008603
149-
: Scale: 73558.4 0.008729 0.008701 0.008792
150-
: Add: 83152.5 0.011573 0.011545 0.011613
151-
: Triad: 83805.1 0.011485 0.011455 0.011520
152-
: -------------------------------------------------------------
153-
: Solution Validates: avg error less than 1.000000e-06 on all three arrays
154-
: Results Validation Verbose Results:
155-
: Expected a(1), b(1), c(1): 1153300692992.000000 230660145152.000000 307546849280.000000
156-
: Observed a(1), b(1), c(1): 1153300824064.000000 230660161536.000000 307546882048.000000
157-
: Rel Errors on a, b, c: 1.136495e-07 7.103091e-08 1.065464e-07
158-
: -------------------------------------------------------------
81+
** Full Sample Output
15982

160-
As we can see the =icc= version is about 1.3x faster than the
161-
=ispc= version. I tried various memory alignment procedures but
162-
could not improve the performance of the =ispc= version of the
163-
benchmark.
83+
For completeness, here is the full output from the STREAM benchmark compiled with =gcc=:
84+
#+BEGIN_SRC sh :exports both
85+
OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream
86+
#+END_SRC
87+
#+results:
88+
: OPENMP DISPLAY ENVIRONMENT BEGIN
89+
: _OPENMP = '201511'
90+
: [host] OMP_DYNAMIC = 'FALSE'
91+
: [host] OMP_NESTED = 'FALSE'
92+
: [host] OMP_NUM_THREADS = '1'
93+
: [host] OMP_SCHEDULE = 'DYNAMIC'
94+
: [host] OMP_PROC_BIND = 'FALSE'
95+
: [host] OMP_PLACES = '{0:2},{2:2},{4},{5},{6},{7},{8},{9},{10},{11}'
96+
: [host] OMP_STACKSIZE = '0'
97+
: [host] OMP_WAIT_POLICY = 'PASSIVE'
98+
: [host] OMP_THREAD_LIMIT = '4294967295'
99+
: [host] OMP_MAX_ACTIVE_LEVELS = '1'
100+
: [host] OMP_NUM_TEAMS = '0'
101+
: [host] OMP_TEAMS_THREAD_LIMIT = '0'
102+
: [all] OMP_CANCELLATION = 'FALSE'
103+
: [all] OMP_DEFAULT_DEVICE = '0'
104+
: [all] OMP_MAX_TASK_PRIORITY = '0'
105+
: [all] OMP_DISPLAY_AFFINITY = 'FALSE'
106+
: [host] OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
107+
: [host] OMP_ALLOCATOR = 'omp_default_mem_alloc'
108+
: [all] OMP_TARGET_OFFLOAD = 'DEFAULT'
109+
: OPENMP DISPLAY ENVIRONMENT END
110+
: -------------------------------------------------------------
111+
: STREAM version $Revision: 5.10 $
112+
: -------------------------------------------------------------
113+
: This system uses 4 bytes per array element.
114+
: -------------------------------------------------------------
115+
: Array size = 80000000 (elements), Offset = 0 (elements)
116+
: Memory per array = 305.2 MiB (= 0.3 GiB).
117+
: Total memory required = 915.5 MiB (= 0.9 GiB).
118+
: Each kernel will be executed 10 times.
119+
: The *best* time for each kernel (excluding the first iteration)
120+
: will be used to compute the reported bandwidth.
121+
: -------------------------------------------------------------
122+
: Number of Threads requested = 12
123+
: Number of Threads counted = 12
124+
: -------------------------------------------------------------
125+
: Your clock granularity/precision appears to be 1 microseconds.
126+
: Each test below will take on the order of 11368 microseconds.
127+
: (= 11368 clock ticks)
128+
: Increase the size of the arrays if this shows that
129+
: you are not getting at least 20 clock ticks per test.
130+
: -------------------------------------------------------------
131+
: WARNING -- The above is only a rough guideline.
132+
: For best results, please be sure you know the
133+
: precision of your system timer.
134+
: -------------------------------------------------------------
135+
: Function Best Rate MB/s Avg time Min time Max time
136+
: Copy: 45432.1 0.029872 0.014087 0.049874
137+
: Scale: 44736.3 0.035883 0.014306 0.051508
138+
: Add: 48758.6 0.035604 0.019689 0.070311
139+
: Triad: 49037.1 0.032059 0.019577 0.060323
140+
: -------------------------------------------------------------
141+
: Solution Validates: avg error less than 1.000000e-06 on all three arrays
142+
: Results Validation Verbose Results:
143+
: Expected a(1), b(1), c(1): 1153300692992.000000 230660145152.000000 307546849280.000000
144+
: Observed a(1), b(1), c(1): 1153300824064.000000 230660161536.000000 307546882048.000000
145+
: Rel Errors on a, b, c: 2.383402e-08 1.489626e-08 2.234439e-08
146+
: -------------------------------------------------------------

0 commit comments

Comments
 (0)