STREAM Bandwidth Example for ISPC

This repository contains John D. McCalpin’s STREAM benchmark and a port of that benchmark to ISPC. The port to ISPC also uses dynamic memory allocation, and it provides a straightforward way to ensure that streaming/”non-temporal” stores are used.

Building

There is a Makefile provided to build the benchmark codes. You may need to edit the Makefile to adjust compilers and compiler flags for your system. To build just run make and the executable files build/stream, build/stream_ispc build/stream_ispc_loopy should be built. Only the latter version uses streaming stores.

Performance Results

These results were obtained on a Raptor Lake (i7-1365U) laptop.

Here is the version information for the compilers we are using:

gcc --version

gcc (Debian 14.2.0-16) 14.2.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ispc --version

Intel(r) Implicit SPMD Program Compiler (Intel(r) ISPC), 1.25.3 (build  @ 20241223, LLVM 19.1.6)

Here is the benchmark result from the STREAM benchmark compiled with gcc:

OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           45432.1     0.029872     0.014087     0.049874
Scale:          44736.3     0.035883     0.014306     0.051508
Add:            48758.6     0.035604     0.019689     0.070311
Triad:          49037.1     0.032059     0.019577     0.060323
-------------------------------------------------------------

Here is the result from the modified STREAM benchmark with kernels compiled with ispc using ispc’s high-level loop constructs and without streaming stores:

OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           49986.2     0.013017     0.012804     0.013866
Scale:          49472.4     0.013018     0.012937     0.013115
Add:            52357.1     0.018545     0.018336     0.019054
Triad:          52421.3     0.019951     0.018313     0.030611
-------------------------------------------------------------

And here are the the results from the modified STREAM benchmark with kernels compiled with ispc using kernels generated by loopy, including the use of streaming stores:

OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream_ispc_loopy

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           65615.9     0.015429     0.009754     0.030991
Scale:          66276.3     0.015031     0.009657     0.031078
Add:            63480.7     0.017027     0.015123     0.029783
Triad:          63078.4     0.019036     0.015219     0.028874
-------------------------------------------------------------

Full Sample Output

For completeness, here is the full output from the STREAM benchmark compiled with gcc:

OMP_PLACES=cores OMP_DISPLAY_ENV=true ./build/stream

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  [host] OMP_DYNAMIC = 'FALSE'
  [host] OMP_NESTED = 'FALSE'
  [host] OMP_NUM_THREADS = '1'
  [host] OMP_SCHEDULE = 'DYNAMIC'
  [host] OMP_PROC_BIND = 'FALSE'
  [host] OMP_PLACES = '{0:2},{2:2},{4},{5},{6},{7},{8},{9},{10},{11}'
  [host] OMP_STACKSIZE = '0'
  [host] OMP_WAIT_POLICY = 'PASSIVE'
  [host] OMP_THREAD_LIMIT = '4294967295'
  [host] OMP_MAX_ACTIVE_LEVELS = '1'
  [host] OMP_NUM_TEAMS = '0'
  [host] OMP_TEAMS_THREAD_LIMIT = '0'
  [all] OMP_CANCELLATION = 'FALSE'
  [all] OMP_DEFAULT_DEVICE = '0'
  [all] OMP_MAX_TASK_PRIORITY = '0'
  [all] OMP_DISPLAY_AFFINITY = 'FALSE'
  [host] OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  [host] OMP_ALLOCATOR = 'omp_default_mem_alloc'
  [all] OMP_TARGET_OFFLOAD = 'DEFAULT'
OPENMP DISPLAY ENVIRONMENT END
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 4 bytes per array element.
-------------------------------------------------------------
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 305.2 MiB (= 0.3 GiB).
Total memory required = 915.5 MiB (= 0.9 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 12
Number of Threads counted = 12
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 11368 microseconds.
   (= 11368 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           45432.1     0.029872     0.014087     0.049874
Scale:          44736.3     0.035883     0.014306     0.051508
Add:            48758.6     0.035604     0.019689     0.070311
Triad:          49037.1     0.032059     0.019577     0.060323
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-06 on all three arrays
Results Validation Verbose Results:
    Expected a(1), b(1), c(1): 1153300692992.000000 230660145152.000000 307546849280.000000
    Observed a(1), b(1), c(1): 1153300824064.000000 230660161536.000000 307546882048.000000
    Rel Errors on a, b, c:     2.383402e-08 1.489626e-08 2.234439e-08
-------------------------------------------------------------

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.envrc		.envrc
.gitignore		.gitignore
Makefile		Makefile
README.org		README.org
flake.lock		flake.lock
flake.nix		flake.nix
gen-loopy.py		gen-loopy.py
stream.c		stream.c
stream_ispc.c		stream_ispc.c
stream_tasks.ispc		stream_tasks.ispc
tasksys.cpp		tasksys.cpp
wtime.h		wtime.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STREAM Bandwidth Example for ISPC

Building

Performance Results

Full Sample Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STREAM Bandwidth Example for ISPC

Building

Performance Results

Full Sample Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages