Skip to content

Commit e950dfd

Browse files
Merge branch 'sycl' into refactorcache
2 parents 1040963 + 7f634b9 commit e950dfd

File tree

12 files changed

+400
-11
lines changed

12 files changed

+400
-11
lines changed

.github/workflows/sycl-linux-precommit.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ jobs:
8181

8282
compat_read_exclude:
8383
name: Read compatibility testing exclude list
84-
runs-on: [Linux, build]
84+
runs-on: [Linux, aux-tasks]
8585
outputs:
8686
FILTER_6_2: ${{ steps.result.outputs.FILTER_6_2 }}
8787
FILTER_6_3: ${{ steps.result.outputs.FILTER_6_3 }}
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# System Performance Tuning Guide
2+
3+
This guide provides recommendations for optimizing system performance when running SYCL and Unified Runtime benchmarks.
4+
For framework-specific information, see [README.md](README.md) and [CONTRIB.md](CONTRIB.md).
5+
6+
## Table of Contents
7+
8+
- [Overview](#overview)
9+
- [System Configuration](#system-configuration)
10+
- [CPU Tuning](#cpu-tuning)
11+
- [GPU Configuration](#gpu-configuration)
12+
- [Perf Configuration](#perf-configuration)
13+
- [Environment Variables](#environment-variables)
14+
15+
## Overview
16+
17+
Performance benchmarking requires a stable and optimized system environment to produce reliable and reproducible results. This guide covers essential system tuning steps for reducing run-to-run variance in benchmark results.
18+
19+
## System Configuration
20+
21+
### Kernel Parameters
22+
23+
Add the following to `/etc/default/grub` in `GRUB_CMDLINE_LINUX`:
24+
```
25+
# Disable CPU frequency scaling
26+
# intel_pstate=disable
27+
28+
# Isolate CPUs for benchmark workloads (example: reserve cores 2-7), preventing other processes
29+
# from using them.
30+
# isolcpus=2-7
31+
32+
GRUB_CMDLINE_LINUX="intel_pstate=disable isolcpus=2-7 <other_options>"
33+
```
34+
35+
Update GRUB and reboot:
36+
```bash
37+
sudo update-grub
38+
sudo reboot
39+
```
40+
41+
## CPU Tuning
42+
43+
### CPU Frequency Scaling
44+
45+
The performance governor ensures that the CPU runs at maximum frequency.
46+
```bash
47+
# Set performance governor for all CPUs
48+
sudo cpupower frequency-set --governor performance
49+
# Apply changes to system
50+
sudo sysctl --system
51+
52+
# Check current governor
53+
sudo cpupower frequency-info
54+
```
55+
56+
To preserve these settings after reboot, create a systemd service which runs the above commands at startup:
57+
```bash
58+
# Create a systemd service file
59+
sudo vim /etc/systemd/system/cpupower_governor.service
60+
```
61+
Add the following content:
62+
```
63+
[Unit]
64+
Description=Set CPU governor to Performance
65+
After=multi-user.target
66+
67+
[Service]
68+
Type=oneshot
69+
ExecStart=/usr/bin/cpupower frequency-set --governor performance && sysctl --system
70+
71+
[Install]
72+
WantedBy=multi-user.target
73+
```
74+
Enable and start the service:
75+
```bash
76+
sudo systemctl enable cpupower_governor.service
77+
sudo systemctl start cpupower_governor.service
78+
```
79+
80+
### CPU Affinity
81+
82+
Bind benchmark processes to specific CPU cores to reduce context switching and improve cache locality.
83+
Make sure that isolated CPUs are located on the same NUMA node as the GPU being used.
84+
```bash
85+
# Run benchmark on specific CPU cores
86+
taskset -c 2-7 ./main.py ~/benchmarks_workdir/ --sycl ~/llvm/build/
87+
```
88+
89+
## GPU Configuration
90+
91+
### GPU Frequency Control
92+
Setting the GPU to run at maximum frequency can significantly improve benchmark performance and stability.
93+
94+
First, find which card relates to the GPU you want to tune (e.g., card1). List of known Device IDs for
95+
Intel GPU cards can be found at https://dgpu-docs.intel.com/devices/hardware-table.html#gpus-with-supported-drivers.
96+
```bash
97+
# Print card1 Device ID
98+
cat /sys/class/drm/card1/device/vendor # Should be 0x8086 for Intel
99+
cat /sys/class/drm/card1/device/device # Device ID
100+
```
101+
102+
Verify the max frequency is set to the true max. For Arc B580, the maximum frequency is 2850 MHz. To see this value, run “cat /sys/class/drm/card1/device/tile0/gt0/freq0/max_freq”. If the above value is not equal to the max frequency, set it as such:
103+
```bash
104+
# Arc B580 (Battlemage)
105+
echo 2850 > /sys/class/drm/card1/device/tile0/gt0/freq0/max_freq
106+
107+
# Set the min frequency to the max frequency, so it is fixed
108+
echo 2850 > /sys/class/drm/card1/device/tile0/gt0/freq0/min_freq
109+
```
110+
111+
```bash
112+
# Check GPU frequencies for GPU Max 1100 (Ponte Vecchio)
113+
cat /sys/class/drm/card1/gt_max_freq_mhz
114+
cat /sys/class/drm/card1/gt_min_freq_mhz
115+
116+
# Set maximum GPU frequency
117+
max_freq=$(cat /sys/class/drm/card1/gt_max_freq_mhz)
118+
echo $max_freq | sudo tee /sys/class/drm/card1/gt_min_freq_mhz
119+
```
120+
121+
The result can be verified using tools such as oneprof or unitrace to track frequency over time for some arbitrary benchmark (many iterations of a small problem size is recommended). The frequency should remain fixed assuming thermal throttling does not occur.
122+
123+
## Driver version
124+
Make sure you are using the latest driver (Ubuntu)
125+
```bash
126+
sudo apt update && sudo apt upgrade
127+
```
128+
129+
## Perf Configuration
130+
131+
Some benchmarks (e.g., ur_submit_kernel) require `perf` to have proper permissions to sample hardware counters. This section covers the necessary setup steps.
132+
133+
### Installing Perf
134+
135+
First, ensure `perf` is installed for your kernel version. On Ubuntu/Debian:
136+
```bash
137+
sudo apt install linux-tools-$(uname -r) linux-tools-common
138+
```
139+
140+
### Configuring Perf Permissions
141+
142+
By default, `perf` requires root privileges to access hardware performance counters.
143+
Make the setting persistent across reboots by adding it to sysctl configuration:
144+
```bash
145+
echo 'kernel.perf_event_paranoid = -1' | sudo tee -a /etc/sysctl.d/99-perf.conf
146+
147+
# Apply immediately
148+
sudo sysctl -p
149+
```
150+
151+
### Verifying Perf Setup
152+
Test that `perf` is working correctly:
153+
```bash
154+
perf record -e cycles sleep 1
155+
perf report
156+
157+
# Test hardware counter access
158+
perf stat -e cycles,instructions ls
159+
```
160+
161+
**Note:** You may see warnings about kernel address maps being restricted or symbol resolution issues. These warnings don't prevent `perf` from collecting performance data - they only affect the quality of symbol resolution in kernel space. For user-space profiling (which covers most benchmark scenarios), `perf` will work correctly despite these warnings.
162+
163+
## Environment Variables
164+
165+
### Level Zero Environment Variables
166+
Use GPU affinity to bind benchmarks to a specific GPU. Use CPUs from the same NUMA node as the GPU to reduce latency.
167+
```bash
168+
export ZE_AFFINITY_MASK=0
169+
```
170+
171+
### SYCL Runtime Variables
172+
For consistency, limit available devices to a specific gpu runtime. For Level Zero, it is recommended to use v2 version of the runtime library.
173+
```bash
174+
export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
175+
export SYCL_UR_USE_LEVEL_ZERO_V2=1
176+
```

devops/scripts/benchmarks/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,11 @@ IGC (Ubuntu):
143143
`$ sudo apt-get install flex bison libz-dev cmake libc6 libstdc++6 python3-pip`
144144

145145

146+
## Performance Tuning
147+
148+
For stable benchmark results and system configuration recommendations, see the
149+
[Performance Tuning Guide](PERFORMANCE_TUNING.md).
150+
146151
## Contribution
147152

148153
The requirements and instructions above are for building the project from source

llvm/lib/SYCLLowerIR/LowerWGLocalMemory.cpp

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,11 +184,40 @@ lowerDynamicLocalMemCallDirect(CallInst *CI, Triple TT,
184184

185185
static void lowerLocalMemCall(Function *LocalMemAllocFunc,
186186
std::function<void(CallInst *CI)> TransformCall) {
187+
static SmallPtrSet<Function *, 16> FuncsCache;
187188
SmallVector<CallInst *, 4> DelCalls;
188189
for (User *U : LocalMemAllocFunc->users()) {
189190
auto *CI = cast<CallInst>(U);
190191
TransformCall(CI);
191192
DelCalls.push_back(CI);
193+
// Now, take each kernel that calls the builtins that allocate local memory,
194+
// either directly or through a series of function calls that eventually end
195+
// up in a direct call to the builtin, and attach the
196+
// work-group-memory-static attribute to the kernel if not already attached.
197+
// This is needed because free function kernels do not have the attribute
198+
// added by the library as is the case with other types of kernels.
199+
if (!FuncsCache.insert(CI->getFunction()).second)
200+
continue; // We have already traversed call graph from this function.
201+
202+
SmallVector<Function *, 8> WorkList;
203+
WorkList.push_back(CI->getFunction());
204+
while (!WorkList.empty()) {
205+
Function *F = WorkList.back();
206+
WorkList.pop_back();
207+
208+
// Mark kernel as using scratch memory if it isn't marked already.
209+
if (F->getCallingConv() == CallingConv::SPIR_KERNEL &&
210+
!F->hasFnAttribute(WORK_GROUP_STATIC_ATTR))
211+
F->addFnAttr(WORK_GROUP_STATIC_ATTR);
212+
213+
for (auto *FU : F->users()) {
214+
if (auto *UCI = dyn_cast<CallInst>(FU)) {
215+
if (FuncsCache.insert(UCI->getFunction()).second)
216+
WorkList.push_back(UCI->getFunction());
217+
} // Even though there could be other uses of a Function, we don't
218+
// care about them because we are only concerned about call graph.
219+
}
220+
}
192221
}
193222

194223
for (auto *CI : DelCalls) {

llvm/test/SYCLLowerIR/work_group_static.ll

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,29 @@ entry:
2222
ret void
2323
}
2424

25+
; Function Attrs: convergent norecurse
26+
; CHECK: @__sycl_kernel_B{{.*}} #[[ATTRS:[0-9]+]]
27+
define weak_odr dso_local spir_kernel void @__sycl_kernel_B(ptr addrspace(1) %0) local_unnamed_addr #1 !kernel_arg_addr_space !5 {
28+
entry:
29+
%1 = tail call spir_func ptr addrspace(3) @__sycl_dynamicLocalMemoryPlaceholder(i64 128) #1
30+
ret void
31+
}
32+
33+
; Function Attrs: convergent norecurse
34+
; CHECK: @__sycl_kernel_C{{.*}} #[[ATTRS]]
35+
define weak_odr dso_local spir_kernel void @__sycl_kernel_C(ptr addrspace(1) %0) local_unnamed_addr #1 !kernel_arg_addr_space !5 {
36+
entry:
37+
%1 = tail call spir_func ptr addrspace(3) @__sycl_allocateLocalMemory(i64 128, i64 4) #1
38+
ret void
39+
}
40+
41+
; Function Attrs: convergent
42+
declare dso_local spir_func ptr addrspace(3) @__sycl_allocateLocalMemory(i64, i64) local_unnamed_addr #1
43+
2544
; Function Attrs: convergent
2645
declare dso_local spir_func ptr addrspace(3) @__sycl_dynamicLocalMemoryPlaceholder(i64) local_unnamed_addr #1
2746

47+
; CHECK: #[[ATTRS]] = {{.*}} "sycl-work-group-static"
2848
attributes #0 = { convergent norecurse "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" "unsafe-fp-math"="false" "use-soft-float"="false" "sycl-work-group-static"="1" }
2949
attributes #1 = { convergent norecurse }
3050

sycl/include/sycl/properties/accessor_properties.hpp

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -188,8 +188,8 @@ struct is_property<ext::intel::property::buffer_location> : std::true_type {};
188188

189189
template <typename T>
190190
struct is_property_of<property::noinit, T>
191-
: std::bool_constant<detail::acc_properties::is_accessor<T>::value ||
192-
detail::acc_properties::is_host_accessor<T>::value> {};
191+
: std::bool_constant<detail::acc_properties::is_accessor_v<T> ||
192+
detail::acc_properties::is_host_accessor_v<T>> {};
193193

194194
template <typename T>
195195
struct is_property_of<property::no_init, T>
@@ -201,15 +201,15 @@ struct is_property_of<property::no_init, T>
201201

202202
template <typename T>
203203
struct is_property_of<ext::oneapi::property::no_offset, T>
204-
: std::bool_constant<detail::acc_properties::is_accessor<T>::value> {};
204+
: std::bool_constant<detail::acc_properties::is_accessor_v<T>> {};
205205

206206
template <typename T>
207207
struct is_property_of<ext::oneapi::property::no_alias, T>
208-
: std::bool_constant<detail::acc_properties::is_accessor<T>::value> {};
208+
: std::bool_constant<detail::acc_properties::is_accessor_v<T>> {};
209209

210210
template <typename T>
211211
struct is_property_of<ext::intel::property::buffer_location, T>
212-
: std::bool_constant<detail::acc_properties::is_accessor<T>::value> {};
212+
: std::bool_constant<detail::acc_properties::is_accessor_v<T>> {};
213213

214214
namespace detail {
215215
template <int I>

0 commit comments

Comments
 (0)