Skip to content

Commit afc23e3

Browse files
committed
Set temporary single CPU affinity before cgroup cpuset transition.
This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset. The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread. Introduces isolated CPU affinity transition OCI runtime annotation org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore the behavior during runc exec. Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes. Signed-off-by: Cédric Clerget <[email protected]>
1 parent d0f803e commit afc23e3

File tree

14 files changed

+954
-2
lines changed

14 files changed

+954
-2
lines changed
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
## Isolated CPU affinity transition
2+
3+
The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
4+
in 5.7 has affected a deterministic scheduling behavior by distributing tasks
5+
across CPU cores within a cgroups cpuset. It means that `runc exec` might be
6+
impacted under some circumstances, by example when a container has been
7+
created within a cgroup cpuset entirely composed of isolated CPU cores
8+
usually sets either with `nohz_full` and/or `isolcpus` kernel boot parameters.
9+
10+
Some containerized real-time applications are relying on this deterministic
11+
behavior and uses the first CPU core to run a slow thread while other CPU
12+
cores are fully used by the real-time threads with SCHED_FIFO policy.
13+
Such applications can prevent runc process from joining a container when the
14+
runc process is randomly scheduled on a CPU core owned by a real-time thread.
15+
16+
Runc introduces a way to restore this behavior by adding the following
17+
annotation to the container runtime spec (`config.json`):
18+
19+
`org.opencontainers.runc.exec.isolated-cpu-affinity-transition`
20+
21+
This annotation can take one of those values:
22+
23+
* `temporary` to temporarily set the runc process CPU affinity to the first
24+
isolated CPU core of the container cgroup cpuset.
25+
* `definitive`: to definitively set the runc process CPU affinity to the first
26+
isolated CPU core of the container cgroup cpuset.
27+
28+
For example:
29+
30+
```json
31+
"annotations": {
32+
"org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"
33+
}
34+
```
35+
36+
__WARNING:__ `definitive` requires a kernel >= 6.2, also works with RHEL 9 and
37+
above.
38+
39+
### How it works?
40+
41+
When enabled and during `runc exec`, runc is looking for the `nohz_full` kernel
42+
boot parameter value and considers the CPUs in the list as isolated, it doesn't
43+
look for `isolcpus` boot parameter, it just assumes that `isolcpus` value is
44+
identical to `nohz_full` when specified. If `nohz_full` parameter is not found,
45+
runc also attempts to read the list from `/sys/devices/system/cpu/nohz_full`.
46+
47+
Once it gets the isolated CPU list, it returns an eligible CPU core within the
48+
container cgroup cpuset based on those heuristics:
49+
50+
* when there is not cpuset cores: no eligible CPU
51+
* when there is not isolated cores: no eligible CPU
52+
* when cpuset cores are not in isolated core list: no eligible CPU
53+
* when cpuset cores are all isolated cores: return the first CPU of the cpuset
54+
* when cpuset cores are mixed between housekeeping/isolated cores: return the
55+
first housekeeping CPU not in isolated CPUs.
56+
57+
The returned CPU core is then used to set the `runc init` CPU affinity before
58+
the container cgroup cpuset transition.
59+
60+
#### Transition example
61+
62+
`nohz_full` has the isolated cores `4-7`. A container has been created with
63+
the cgroup cpuset `4-7` to only run on the isolated CPU cores 4 to 7.
64+
`runc exec` is called by a process with CPU affinity set to `0-3`
65+
66+
* with `temporary` transition:
67+
68+
runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4-7)
69+
70+
* with `definitive` transition:
71+
72+
runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4)
73+
74+
The difference between `temporary` and `definitive` is the container process
75+
affinity, `definitive` will constraint the container process to run on the
76+
first isolated CPU core of the cgroup cpuset, while `temporary` restore the
77+
CPU affinity to match the container cgroup cpuset.
78+
79+
`definitive` transition might be helpful when `nohz_full` is used without
80+
`isolcpus` to avoid runc and container process to be a noisy neighbour for
81+
real-time applications.
82+
83+
### How to use it with Kubernetes?
84+
85+
Kubernetes doesn't manage container directly, instead it uses the Container Runtime
86+
Interface (CRI) to communicate with a software implementing this interface and responsible
87+
to manage the lifecycle of containers. There are popular CRI implementations like Containerd
88+
and CRI-O. Those implementations allows to pass pod annotations to the container runtime
89+
via the container runtime spec. Currently runc is the runtime used by default for both.
90+
91+
#### Containerd configuration
92+
93+
Containerd CRI uses runc by default but requires an extra step to pass the annotation to runc.
94+
You have to whitelist `org.opencontainers.runc.exec.isolated-cpu-affinity-transition` as a pod
95+
annotation allowed to be passed to the container runtime in `/etc/containerd/config.toml`:
96+
97+
```toml
98+
[plugins."io.containerd.grpc.v1.cri".containerd]
99+
default_runtime_name = "runc"
100+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
101+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
102+
runtime_type = "io.containerd.runc.v2"
103+
base_runtime_spec = "/etc/containerd/cri-base.json"
104+
pod_annotations = ["org.opencontainers.runc.exec.isolated-cpu-affinity-transition"]
105+
```
106+
107+
#### CRI-O configuration
108+
109+
CRI-O doesn't require any extra step, however some annotations could be excluded by
110+
configuration.
111+
112+
#### Pod deployment example
113+
114+
```yaml
115+
apiVersion: v1
116+
kind: Pod
117+
metadata:
118+
name: demo-pod
119+
annotations:
120+
org.opencontainers.runc.exec.isolated-cpu-affinity-transition: "temporary"
121+
spec:
122+
containers:
123+
- name: demo
124+
image: registry.com/demo:latest
125+
```

features.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ var featuresCommand = cli.Command{
6868
"bundle",
6969
"org.systemd.property.", // prefix form
7070
"org.criu.config",
71+
"org.opencontainers.runc.exec.isolated-cpu-affinity-transition",
7172
},
7273
}
7374

libcontainer/cgroups/cgroups.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,8 @@ type Manager interface {
7171

7272
// OOMKillCount reports OOM kill count for the cgroup.
7373
OOMKillCount() (uint64, error)
74+
75+
// GetEffectiveCPUs returns the effective CPUs of the cgroup, an empty
76+
// value means that the cgroups cpuset subsystem/controller is not enabled.
77+
GetEffectiveCPUs() string
7478
}

libcontainer/cgroups/fs/fs.go

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ import (
44
"errors"
55
"fmt"
66
"os"
7+
"path/filepath"
8+
"strings"
79
"sync"
810

911
"golang.org/x/sys/unix"
@@ -263,3 +265,28 @@ func (m *Manager) OOMKillCount() (uint64, error) {
263265

264266
return c, err
265267
}
268+
269+
func (m *Manager) GetEffectiveCPUs() string {
270+
return GetEffectiveCPUs(m.Path("cpuset"), m.cgroups)
271+
}
272+
273+
func GetEffectiveCPUs(cpusetPath string, cgroups *configs.Cgroup) string {
274+
// Fast path.
275+
if cgroups.CpusetCpus != "" {
276+
return cgroups.CpusetCpus
277+
} else if !strings.HasPrefix(cpusetPath, defaultCgroupRoot) {
278+
return ""
279+
}
280+
281+
// Iterates until it goes to the cgroup root path.
282+
// It's required for containers in which cpuset controller
283+
// is not enabled, in this case a parent cgroup is used.
284+
for path := cpusetPath; path != defaultCgroupRoot; path = filepath.Dir(path) {
285+
cpus, err := fscommon.GetCgroupParamString(path, "cpuset.effective_cpus")
286+
if err == nil {
287+
return cpus
288+
}
289+
}
290+
291+
return ""
292+
}

libcontainer/cgroups/fs2/fs2.go

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,13 @@ import (
44
"errors"
55
"fmt"
66
"os"
7+
"path/filepath"
78
"strings"
89

910
"github.com/opencontainers/runc/libcontainer/cgroups"
1011
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
1112
"github.com/opencontainers/runc/libcontainer/configs"
13+
"github.com/opencontainers/runc/libcontainer/utils"
1214
)
1315

1416
type parseError = fscommon.ParseError
@@ -32,6 +34,9 @@ func NewManager(config *configs.Cgroup, dirPath string) (*Manager, error) {
3234
if err != nil {
3335
return nil, err
3436
}
37+
} else {
38+
// Clean path for safety.
39+
dirPath = utils.CleanPath(dirPath)
3540
}
3641

3742
m := &Manager{
@@ -316,3 +321,26 @@ func CheckMemoryUsage(dirPath string, r *configs.Resources) error {
316321

317322
return nil
318323
}
324+
325+
func (m *Manager) GetEffectiveCPUs() string {
326+
// Fast path.
327+
if m.config.CpusetCpus != "" {
328+
return m.config.CpusetCpus
329+
} else if !strings.HasPrefix(m.dirPath, UnifiedMountpoint) {
330+
return ""
331+
}
332+
333+
// Iterates until it goes outside of the cgroup root path.
334+
// It's required for containers in which cpuset controller
335+
// is not enabled, in this case a parent cgroup is used.
336+
outsidePath := filepath.Dir(UnifiedMountpoint)
337+
338+
for path := m.dirPath; path != outsidePath; path = filepath.Dir(path) {
339+
cpus, err := fscommon.GetCgroupParamString(path, "cpuset.cpus.effective")
340+
if err == nil {
341+
return cpus
342+
}
343+
}
344+
345+
return ""
346+
}

libcontainer/cgroups/systemd/v1.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -411,3 +411,7 @@ func (m *LegacyManager) Exists() bool {
411411
func (m *LegacyManager) OOMKillCount() (uint64, error) {
412412
return fs.OOMKillCount(m.Path("memory"))
413413
}
414+
415+
func (m *LegacyManager) GetEffectiveCPUs() string {
416+
return fs.GetEffectiveCPUs(m.Path("cpuset"), m.cgroups)
417+
}

libcontainer/cgroups/systemd/v2.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -514,3 +514,7 @@ func (m *UnifiedManager) Exists() bool {
514514
func (m *UnifiedManager) OOMKillCount() (uint64, error) {
515515
return m.fsMgr.OOMKillCount()
516516
}
517+
518+
func (m *UnifiedManager) GetEffectiveCPUs() string {
519+
return m.fsMgr.GetEffectiveCPUs()
520+
}

libcontainer/container_linux_test.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,10 @@ func (m *mockCgroupManager) GetFreezerState() (configs.FreezerState, error) {
6969
return configs.Thawed, nil
7070
}
7171

72+
func (m *mockCgroupManager) GetEffectiveCPUs() string {
73+
return ""
74+
}
75+
7276
type mockProcess struct {
7377
_pid int
7478
started uint64

0 commit comments

Comments
 (0)