Skip to content

Commit b09f87b

Browse files
committed
More precise instructions, corrected TFlops/s estimate for Intel Data Center GPU Max series
1 parent c6a154e commit b09f87b

File tree

2 files changed

+79
-9
lines changed

2 files changed

+79
-9
lines changed

README.md

Lines changed: 77 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,19 @@ Works with any GPU in Windows, Linux, macOS and Android.
1616
## How to use?
1717

1818
### Windows
19-
- Compile+Run
20-
- open `OpenCL-Benchmark.vcxproj` in Visual Studio Community
21-
- click compile+run
22-
- Run
19+
- Download and install [Visual Studio Community](https://visualstudio.microsoft.com/de/vs/community/). In Visual Studio Installer, add:
20+
- Desktop development with C++
21+
- MSVC v142
22+
- Windows 10 SDK
23+
- Open [`OpenCL-Benchmark.sln`](OpenCL-Benchmark.sln) in [Visual Studio Community](https://visualstudio.microsoft.com/de/vs/community/).
24+
- Compile and run by clicking the <kbd>► Local Windows Debugger</kbd> button.
25+
- To run outside of [Visual Studio Community](https://visualstudio.microsoft.com/de/vs/community/), open Windows CMD in the `OpenCL-Benchmark` folder (type `cmd` in File Explorer in the directory field and press <kbd>Enter</kbd>), then run
2326
```
2427
OpenCL-Benchmark.exe
2528
```
2629

2730
### Linux
28-
- Compile+Run
31+
- Download, compile and run:
2932
```
3033
git clone https://github.com/ProjectPhysX/OpenCL-Benchmark.git
3134
cd OpenCL-Benchmark
@@ -38,7 +41,7 @@ Works with any GPU in Windows, Linux, macOS and Android.
3841
```
3942

4043
### macOS
41-
- Compile+Run
44+
- ownload, compile and run:
4245
```
4346
git clone https://github.com/ProjectPhysX/OpenCL-Benchmark.git
4447
cd OpenCL-Benchmark
@@ -135,11 +138,37 @@ Works with any GPU in Windows, Linux, macOS and Android.
135138
| Device ID 3 | gfx906:sramecc+:xnack- |
136139
| Device ID 4 | gfx906:sramecc+:xnack- |
137140
| Device ID 5 | gfx906:sramecc+:xnack- |
138-
| Device ID 6 | gfx906:sramecc+:xnack- |
141+
| Device ID 6 | gfx90a:sramecc+:xnack- |
139142
| Device ID 7 | gfx906:sramecc+:xnack- |
140143
|----------------'------------------------------------------------------------|
141144
|----------------.------------------------------------------------------------|
142-
| Device ID | 0 |
145+
| Device ID | 6 |
146+
| Device Name | gfx90a:sramecc+:xnack- |
147+
| Device Vendor | Advanced Micro Devices, Inc. |
148+
| Device Driver | 3513.0 (HSA1.1,LC) |
149+
| OpenCL Version | OpenCL C 2.0 |
150+
| Compute Units | 104 at 1700 MHz (6656 cores, 22.630 TFLOPs/s) |
151+
| Memory, Cache | 65520 MB, 16 KB global / 64 KB local |
152+
| Buffer Limits | 55692 MB global, 57028608 KB constant |
153+
|----------------'------------------------------------------------------------|
154+
| Info: OpenCL C code successfully compiled. |
155+
| FP64 compute 17.356 TFLOPs/s (2/3 ) |
156+
| FP32 compute 19.995 TFLOPs/s ( 1x ) |
157+
| FP16 compute 39.680 TFLOPs/s ( 2x ) |
158+
| INT64 compute 1.511 TIOPs/s (1/16) |
159+
| INT32 compute 18.441 TIOPs/s (2/3 ) |
160+
| INT16 compute 19.679 TIOPs/s ( 1x ) |
161+
| INT8 compute 13.294 TIOPs/s (2/3 ) |
162+
| Memory Bandwidth ( coalesced read ) 967.34 GB/s |
163+
| Memory Bandwidth ( coalesced write) 979.27 GB/s |
164+
| Memory Bandwidth (misaligned read ) 1302.28 GB/s |
165+
| Memory Bandwidth (misaligned write) 634.80 GB/s |
166+
| PCIe Bandwidth (send ) 17.66 GB/s |
167+
| PCIe Bandwidth ( receive ) 17.60 GB/s |
168+
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 17.30 GB/s |
169+
|-----------------------------------------------------------------------------|
170+
|----------------.------------------------------------------------------------|
171+
| Device ID | 7 |
143172
| Device Name | gfx906:sramecc+:xnack- |
144173
| Device Vendor | Advanced Micro Devices, Inc. |
145174
| Device Driver | 3513.0 (HSA1.1,LC) |
@@ -171,6 +200,46 @@ Works with any GPU in Windows, Linux, macOS and Android.
171200
```
172201
.-----------------------------------------------------------------------------.
173202
|----------------.------------------------------------------------------------|
203+
| Device ID 0 | Intel(R) FPGA Emulation Device |
204+
| Device ID 1 | Intel(R) Xeon(R) Platinum 8480+ |
205+
| Device ID 2 | Intel(R) Data Center GPU Max 1100 |
206+
| Device ID 3 | Intel(R) Data Center GPU Max 1100 |
207+
| Device ID 4 | Intel(R) Data Center GPU Max 1100 |
208+
| Device ID 5 | Intel(R) Data Center GPU Max 1100 |
209+
|----------------'------------------------------------------------------------|
210+
|----------------.------------------------------------------------------------|
211+
| Device ID | 2 |
212+
| Device Name | Intel(R) Data Center GPU Max 1100 |
213+
| Device Vendor | Intel(R) Corporation |
214+
| Device Driver | 23.17.26241.33 |
215+
| OpenCL Version | OpenCL C 1.2 |
216+
| Compute Units | 448 at 1550 MHz (7168 cores, 22.221 TFLOPs/s) |
217+
| Memory, Cache | 46679 MB, 196608 KB global / 128 KB local |
218+
| Buffer Limits | 46679 MB global, 47799500 KB constant |
219+
|----------------'------------------------------------------------------------|
220+
| Info: OpenCL C code successfully compiled. |
221+
| FP64 compute 16.314 TFLOPs/s ( 1x ) |
222+
| FP32 compute 21.932 TFLOPs/s ( 2x ) |
223+
| FP16 compute 41.413 TFLOPs/s ( 4x ) |
224+
| INT64 compute 1.082 TIOPs/s (1/12) |
225+
| INT32 compute 7.028 TIOPs/s (2/3 ) |
226+
| INT16 compute 60.320 TIOPs/s ( 4x ) |
227+
| INT8 compute 38.822 TIOPs/s ( 4x ) |
228+
| Memory Bandwidth ( coalesced read ) 739.91 GB/s |
229+
| Memory Bandwidth ( coalesced write) 888.49 GB/s |
230+
| Memory Bandwidth (misaligned read ) 718.96 GB/s |
231+
| Memory Bandwidth (misaligned write) 319.63 GB/s |
232+
| PCIe Bandwidth (send ) 25.42 GB/s |
233+
| PCIe Bandwidth ( receive ) 23.10 GB/s |
234+
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 32.31 GB/s |
235+
|-----------------------------------------------------------------------------|
236+
|-----------------------------------------------------------------------------|
237+
| Done. Press Enter to exit. |
238+
'-----------------------------------------------------------------------------'
239+
```
240+
```
241+
.-----------------------------------------------------------------------------.
242+
|----------------.------------------------------------------------------------|
174243
| Device ID 0 | NVIDIA TITAN Xp |
175244
| Device ID 1 | Tesla K40m |
176245
| Device ID 2 | Tesla K40m |

src/opencl.hpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,10 @@ struct Device_Info {
5757
const bool nvidia_64_cores_per_cu = contains_any(to_lower(name), {"p100", "v100", "a100", "a30", " 16", " 20", "titan v", "titan rtx", "quadro t", "tesla t", "quadro rtx"}) && !contains(to_lower(name), "rtx a"); // identify P100, Volta, Turing, A100, A30
5858
const bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported
5959
const bool amd_256_cores_per_dualcu = contains(to_lower(name), "gfx11"); // identify RDNA3 GPUs where dual CUs are reported
60+
const bool intel_16_cores_per_cu = contains(to_lower(name), "gpu max"); // identify PVC GPUs
6061
const float nvidia = (float)(contains(to_lower(vendor), "nvidia"))*(nvidia_64_cores_per_cu?64.0f:nvidia_192_cores_per_cu?192.0f:128.0f); // Nvidia GPUs have 192 cores/CU (Kepler), 128 cores/CU (Maxwell, Pascal, Ampere, Hopper, Ada) or 64 cores/CU (P100, Volta, Turing, A100, A30)
6162
const float amd = (float)(contains_any(to_lower(vendor), {"amd", "advanced"}))*(is_gpu?(amd_256_cores_per_dualcu?256.0f:amd_128_cores_per_dualcu?128.0f:64.0f):0.5f); // AMD GPUs have 64 cores/CU (GCN, CDNA), 128 cores/dualCU (RDNA, RDNA2) or 256 cores/dualCU (RDNA3), AMD CPUs (with SMT) have 1/2 core/CU
62-
const float intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?8.0f:0.5f); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs (with HT) have 1/2 core/CU
63+
const float intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?(intel_16_cores_per_cu?16.0f:8.0f):0.5f); // Intel GPUs have 16 cores/CU (PVC) or 8 cores/CU (integrated/Arc), Intel CPUs (with HT) have 1/2 core/CU
6364
const float apple = (float)(contains(to_lower(vendor), "apple"))*(128.0f); // Apple ARM GPUs usually have 128 cores/CU
6465
const float arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU
6566
cores = to_uint((float)compute_units*(nvidia+amd+intel+apple+arm)); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)

0 commit comments

Comments
 (0)