Skip to content

Commit e76d28b

Browse files
Waiman-Longhtejun
authored andcommitted
cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. When both are called concurrently, the cgroup_rstat_updated() run time can spike to a pretty high value due to high cpu_lock hold time in cgroup_rstat_flush_locked(). This can be problematic if the task calling cgroup_rstat_updated() is a realtime task running on an isolated CPU with a strict latency requirement. The cgroup_rstat_updated() call can happen when there is a page fault even though the task is running in user space most of the time. The percpu cpu_lock is used to protect the update tree - updated_next and updated_children. This protection is only needed when cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing operation which can take a much longer time does not need that protection as it is already protected by cgroup_rstat_lock. To reduce the cpu_lock hold time, we need to perform all the cgroup_rstat_cpu_pop_updated() calls up front with the lock released afterward before doing any flushing. This patch adds a new cgroup_rstat_updated_list() function to return a singly linked list of cgroups to be flushed. Some instrumentation code are added to measure the cpu_lock hold time right after lock acquisition to after releasing the lock. Parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. The maximum cpu_lock hold time before and after the patch are 100us and 29us respectively. So the worst case time is reduced to about 30% of the original. However, there may be some OS or hardware noises like NMI or SMI in the test system that can worsen the worst case value. Those noises are usually tuned out in a real production environment to get a better result. OTOH, the lock hold time frequency distribution should give a better idea of the performance benefit of the patch. Below were the frequency distribution before and after the patch: Hold time Before patch After patch --------- ------------ ----------- 0-01 us 804,139 13,738,708 01-05 us 9,772,767 1,177,194 05-10 us 4,595,028 4,984 10-15 us 303,481 3,562 15-20 us 78,971 1,314 20-25 us 24,583 18 25-30 us 6,908 12 30-40 us 8,015 40-50 us 2,192 50-60 us 316 60-70 us 43 70-80 us 7 80-90 us 2 >90 us 3 Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
1 parent 72c6303 commit e76d28b

File tree

2 files changed

+35
-15
lines changed

2 files changed

+35
-15
lines changed

include/linux/cgroup-defs.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,13 @@ struct cgroup {
496496
struct cgroup_rstat_cpu __percpu *rstat_cpu;
497497
struct list_head rstat_css_list;
498498

499+
/*
500+
* A singly-linked list of cgroup structures to be rstat flushed.
501+
* This is a scratch field to be used exclusively by
502+
* cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
503+
*/
504+
struct cgroup *rstat_flush_next;
505+
499506
/* cgroup basic resource statistics */
500507
struct cgroup_base_stat last_bstat;
501508
struct cgroup_base_stat bstat;

kernel/cgroup/rstat.c

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,32 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
145145
return pos;
146146
}
147147

148+
/* Return a list of updated cgroups to be flushed */
149+
static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
150+
{
151+
raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
152+
struct cgroup *head, *tail, *next;
153+
unsigned long flags;
154+
155+
/*
156+
* The _irqsave() is needed because cgroup_rstat_lock is
157+
* spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
158+
* this lock with the _irq() suffix only disables interrupts on
159+
* a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
160+
* interrupts on both configurations. The _irqsave() ensures
161+
* that interrupts are always disabled and later restored.
162+
*/
163+
raw_spin_lock_irqsave(cpu_lock, flags);
164+
head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
165+
while (tail) {
166+
next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
167+
tail->rstat_flush_next = next;
168+
tail = next;
169+
}
170+
raw_spin_unlock_irqrestore(cpu_lock, flags);
171+
return head;
172+
}
173+
148174
/*
149175
* A hook for bpf stat collectors to attach to and flush their stats.
150176
* Together with providing bpf kfuncs for cgroup_rstat_updated() and
@@ -179,21 +205,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
179205
lockdep_assert_held(&cgroup_rstat_lock);
180206

181207
for_each_possible_cpu(cpu) {
182-
raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock,
183-
cpu);
184-
struct cgroup *pos = NULL;
185-
unsigned long flags;
208+
struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
186209

187-
/*
188-
* The _irqsave() is needed because cgroup_rstat_lock is
189-
* spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
190-
* this lock with the _irq() suffix only disables interrupts on
191-
* a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
192-
* interrupts on both configurations. The _irqsave() ensures
193-
* that interrupts are always disabled and later restored.
194-
*/
195-
raw_spin_lock_irqsave(cpu_lock, flags);
196-
while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
210+
for (; pos; pos = pos->rstat_flush_next) {
197211
struct cgroup_subsys_state *css;
198212

199213
cgroup_base_stat_flush(pos, cpu);
@@ -205,7 +219,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
205219
css->ss->css_rstat_flush(css, cpu);
206220
rcu_read_unlock();
207221
}
208-
raw_spin_unlock_irqrestore(cpu_lock, flags);
209222

210223
/* play nice and yield if necessary */
211224
if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {

0 commit comments

Comments
 (0)