Skip to content

Commit 7e019dc

Browse files
compudjPeter Zijlstra
authored andcommitted
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
commit 223baf9 ("sched: Fix performance regression introduced by mm_cid") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay. These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive. However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data. Introduce the following changes to improve per-mm-cid cache locality: - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID, - Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus _not_ a viable approach for this), - Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access), - Add a per-mm max_nr_cid to keep track of the highest number of concurrency IDs allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by: min(mm->nr_cpus_allowed, mm->mm_users) When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead. Spreading allocation to use all the cid values within the range [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ] improves cache locality while preserving mm_cid compactness within the expected user limits, - In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask, - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask. * Benchmarks Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core. Testing configurations: 8-core/1-L3: Use 8 cores within a single L3 24-core/24-L3: Use 24 cores, 1 core per L3 192-core/24-L3: Use 192 cores (all cores in the system) 384-thread/24-L3: Use 384 HW threads (all HW threads in the system) Intermittent workload delays between threads: 200ms, 10ms. Hardware: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Caches (sum of all): L1d: 6 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 192 MiB (192 instances) L3: 768 MiB (24 instances) Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid). Intermittent workload delay: 200ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 1374 19289 1336 14.4x 24-core/24-L3 2423 26721 1594 16.7x 192-core/24-L3 2291 15826 2153 7.3x 384-thread/24-L3 1874 13234 1907 6.9x Intermittent workload delay: 10ms per-cpu mm_cid cache-local mm_cid cache-local speedup (ns) (ns) (ns) 8-core/1-L3 662 756 686 1.1x 24-core/24-L3 1378 3648 1035 3.5x 192-core/24-L3 1439 10833 1482 7.3x 384-thread/24-L3 1503 10570 1556 6.8x [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs" patch series with a simpler and more general approach. ] [ This patch applies on top of v6.12-rc1. ] Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/
1 parent 8e113df commit 7e019dc

File tree

5 files changed

+112
-34
lines changed

5 files changed

+112
-34
lines changed

fs/exec.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -990,7 +990,7 @@ static int exec_mmap(struct mm_struct *mm)
990990
active_mm = tsk->active_mm;
991991
tsk->active_mm = mm;
992992
tsk->mm = mm;
993-
mm_init_cid(mm);
993+
mm_init_cid(mm, tsk);
994994
/*
995995
* This prevents preemption while active_mm is being loaded and
996996
* it and mm are being updated, which could cause problems for

include/linux/mm_types.h

Lines changed: 63 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -782,6 +782,7 @@ struct vm_area_struct {
782782
struct mm_cid {
783783
u64 time;
784784
int cid;
785+
int recent_cid;
785786
};
786787
#endif
787788

@@ -852,6 +853,27 @@ struct mm_struct {
852853
* When the next mm_cid scan is due (in jiffies).
853854
*/
854855
unsigned long mm_cid_next_scan;
856+
/**
857+
* @nr_cpus_allowed: Number of CPUs allowed for mm.
858+
*
859+
* Number of CPUs allowed in the union of all mm's
860+
* threads allowed CPUs.
861+
*/
862+
unsigned int nr_cpus_allowed;
863+
/**
864+
* @max_nr_cid: Maximum number of concurrency IDs allocated.
865+
*
866+
* Track the highest number of concurrency IDs allocated for the
867+
* mm.
868+
*/
869+
atomic_t max_nr_cid;
870+
/**
871+
* @cpus_allowed_lock: Lock protecting mm cpus_allowed.
872+
*
873+
* Provide mutual exclusion for mm cpus_allowed and
874+
* mm nr_cpus_allowed updates.
875+
*/
876+
raw_spinlock_t cpus_allowed_lock;
855877
#endif
856878
#ifdef CONFIG_MMU
857879
atomic_long_t pgtables_bytes; /* size of all page tables */
@@ -1170,36 +1192,53 @@ static inline int mm_cid_clear_lazy_put(int cid)
11701192
return cid & ~MM_CID_LAZY_PUT;
11711193
}
11721194

1195+
/*
1196+
* mm_cpus_allowed: Union of all mm's threads allowed CPUs.
1197+
*/
1198+
static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
1199+
{
1200+
unsigned long bitmap = (unsigned long)mm;
1201+
1202+
bitmap += offsetof(struct mm_struct, cpu_bitmap);
1203+
/* Skip cpu_bitmap */
1204+
bitmap += cpumask_size();
1205+
return (struct cpumask *)bitmap;
1206+
}
1207+
11731208
/* Accessor for struct mm_struct's cidmask. */
11741209
static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
11751210
{
1176-
unsigned long cid_bitmap = (unsigned long)mm;
1211+
unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
11771212

1178-
cid_bitmap += offsetof(struct mm_struct, cpu_bitmap);
1179-
/* Skip cpu_bitmap */
1213+
/* Skip mm_cpus_allowed */
11801214
cid_bitmap += cpumask_size();
11811215
return (struct cpumask *)cid_bitmap;
11821216
}
11831217

1184-
static inline void mm_init_cid(struct mm_struct *mm)
1218+
static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
11851219
{
11861220
int i;
11871221

11881222
for_each_possible_cpu(i) {
11891223
struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
11901224

11911225
pcpu_cid->cid = MM_CID_UNSET;
1226+
pcpu_cid->recent_cid = MM_CID_UNSET;
11921227
pcpu_cid->time = 0;
11931228
}
1229+
mm->nr_cpus_allowed = p->nr_cpus_allowed;
1230+
atomic_set(&mm->max_nr_cid, 0);
1231+
raw_spin_lock_init(&mm->cpus_allowed_lock);
1232+
cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
11941233
cpumask_clear(mm_cidmask(mm));
11951234
}
11961235

1197-
static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
1236+
static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
11981237
{
11991238
mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
12001239
if (!mm->pcpu_cid)
12011240
return -ENOMEM;
1202-
mm_init_cid(mm);
1241+
mm_init_cid(mm, p);
12031242
return 0;
12041243
}
12051244
#define mm_alloc_cid(...) alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
@@ -1212,16 +1251,31 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
12121251

12131252
static inline unsigned int mm_cid_size(void)
12141253
{
1215-
return cpumask_size();
1254+
return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */
1255+
}
1256+
1257+
static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
1258+
{
1259+
struct cpumask *mm_allowed = mm_cpus_allowed(mm);
1260+
1261+
if (!mm)
1262+
return;
1263+
/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
1264+
raw_spin_lock(&mm->cpus_allowed_lock);
1265+
cpumask_or(mm_allowed, mm_allowed, cpumask);
1266+
WRITE_ONCE(mm->nr_cpus_allowed, cpumask_weight(mm_allowed));
1267+
raw_spin_unlock(&mm->cpus_allowed_lock);
12161268
}
12171269
#else /* CONFIG_SCHED_MM_CID */
1218-
static inline void mm_init_cid(struct mm_struct *mm) { }
1219-
static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
1270+
static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
1271+
static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
12201272
static inline void mm_destroy_cid(struct mm_struct *mm) { }
1273+
12211274
static inline unsigned int mm_cid_size(void)
12221275
{
12231276
return 0;
12241277
}
1278+
static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
12251279
#endif /* CONFIG_SCHED_MM_CID */
12261280

12271281
struct mmu_gather;

kernel/fork.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1298,7 +1298,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
12981298
if (init_new_context(p, mm))
12991299
goto fail_nocontext;
13001300

1301-
if (mm_alloc_cid(mm))
1301+
if (mm_alloc_cid(mm, p))
13021302
goto fail_cid;
13031303

13041304
if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,

kernel/sched/core.c

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2696,6 +2696,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
26962696
put_prev_task(rq, p);
26972697

26982698
p->sched_class->set_cpus_allowed(p, ctx);
2699+
mm_set_cpus_allowed(p->mm, ctx->new_mask);
26992700

27002701
if (queued)
27012702
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
@@ -10243,6 +10244,7 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq,
1024310244
*/
1024410245
if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
1024510246
return -1;
10247+
WRITE_ONCE(src_pcpu_cid->recent_cid, MM_CID_UNSET);
1024610248
return src_cid;
1024710249
}
1024810250

@@ -10255,7 +10257,8 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
1025510257
{
1025610258
struct mm_cid *src_pcpu_cid, *dst_pcpu_cid;
1025710259
struct mm_struct *mm = t->mm;
10258-
int src_cid, dst_cid, src_cpu;
10260+
int src_cid, src_cpu;
10261+
bool dst_cid_is_set;
1025910262
struct rq *src_rq;
1026010263

1026110264
lockdep_assert_rq_held(dst_rq);
@@ -10272,19 +10275,19 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
1027210275
* allocation closest to 0 in cases where few threads migrate around
1027310276
* many CPUs.
1027410277
*
10275-
* If destination cid is already set, we may have to just clear
10276-
* the src cid to ensure compactness in frequent migrations
10277-
* scenarios.
10278+
* If destination cid or recent cid is already set, we may have
10279+
* to just clear the src cid to ensure compactness in frequent
10280+
* migrations scenarios.
1027810281
*
1027910282
* It is not useful to clear the src cid when the number of threads is
1028010283
* greater or equal to the number of allowed CPUs, because user-space
1028110284
* can expect that the number of allowed cids can reach the number of
1028210285
* allowed CPUs.
1028310286
*/
1028410287
dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
10285-
dst_cid = READ_ONCE(dst_pcpu_cid->cid);
10286-
if (!mm_cid_is_unset(dst_cid) &&
10287-
atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
10288+
dst_cid_is_set = !mm_cid_is_unset(READ_ONCE(dst_pcpu_cid->cid)) ||
10289+
!mm_cid_is_unset(READ_ONCE(dst_pcpu_cid->recent_cid));
10290+
if (dst_cid_is_set && atomic_read(&mm->mm_users) >= READ_ONCE(mm->nr_cpus_allowed))
1028810291
return;
1028910292
src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, src_cpu);
1029010293
src_rq = cpu_rq(src_cpu);
@@ -10295,13 +10298,14 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
1029510298
src_cid);
1029610299
if (src_cid == -1)
1029710300
return;
10298-
if (!mm_cid_is_unset(dst_cid)) {
10301+
if (dst_cid_is_set) {
1029910302
__mm_cid_put(mm, src_cid);
1030010303
return;
1030110304
}
1030210305
/* Move src_cid to dst cpu. */
1030310306
mm_cid_snapshot_time(dst_rq, mm);
1030410307
WRITE_ONCE(dst_pcpu_cid->cid, src_cid);
10308+
WRITE_ONCE(dst_pcpu_cid->recent_cid, src_cid);
1030510309
}
1030610310

1030710311
static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_cid,
@@ -10540,7 +10544,7 @@ void sched_mm_cid_after_execve(struct task_struct *t)
1054010544
* Matches barrier in sched_mm_cid_remote_clear_old().
1054110545
*/
1054210546
smp_mb();
10543-
t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm);
10547+
t->last_mm_cid = t->mm_cid = mm_cid_get(rq, t, mm);
1054410548
}
1054510549
rseq_set_notify_resume(t);
1054610550
}

kernel/sched/sched.h

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3596,24 +3596,41 @@ static inline void mm_cid_put(struct mm_struct *mm)
35963596
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
35973597
}
35983598

3599-
static inline int __mm_cid_try_get(struct mm_struct *mm)
3599+
static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
36003600
{
3601-
struct cpumask *cpumask;
3602-
int cid;
3601+
struct cpumask *cidmask = mm_cidmask(mm);
3602+
struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
3603+
int cid = __this_cpu_read(pcpu_cid->recent_cid);
36033604

3604-
cpumask = mm_cidmask(mm);
3605+
/* Try to re-use recent cid. This improves cache locality. */
3606+
if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
3607+
return cid;
3608+
/*
3609+
* Expand cid allocation if the maximum number of concurrency
3610+
* IDs allocated (max_nr_cid) is below the number cpus allowed
3611+
* and number of threads. Expanding cid allocation as much as
3612+
* possible improves cache locality.
3613+
*/
3614+
cid = atomic_read(&mm->max_nr_cid);
3615+
while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
3616+
if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
3617+
continue;
3618+
if (!cpumask_test_and_set_cpu(cid, cidmask))
3619+
return cid;
3620+
}
36053621
/*
3622+
* Find the first available concurrency id.
36063623
* Retry finding first zero bit if the mask is temporarily
36073624
* filled. This only happens during concurrent remote-clear
36083625
* which owns a cid without holding a rq lock.
36093626
*/
36103627
for (;;) {
3611-
cid = cpumask_first_zero(cpumask);
3612-
if (cid < nr_cpu_ids)
3628+
cid = cpumask_first_zero(cidmask);
3629+
if (cid < READ_ONCE(mm->nr_cpus_allowed))
36133630
break;
36143631
cpu_relax();
36153632
}
3616-
if (cpumask_test_and_set_cpu(cid, cpumask))
3633+
if (cpumask_test_and_set_cpu(cid, cidmask))
36173634
return -1;
36183635

36193636
return cid;
@@ -3631,7 +3648,8 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
36313648
WRITE_ONCE(pcpu_cid->time, rq->clock);
36323649
}
36333650

3634-
static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
3651+
static inline int __mm_cid_get(struct rq *rq, struct task_struct *t,
3652+
struct mm_struct *mm)
36353653
{
36363654
int cid;
36373655

@@ -3641,13 +3659,13 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
36413659
* guarantee forward progress.
36423660
*/
36433661
if (!READ_ONCE(use_cid_lock)) {
3644-
cid = __mm_cid_try_get(mm);
3662+
cid = __mm_cid_try_get(t, mm);
36453663
if (cid >= 0)
36463664
goto end;
36473665
raw_spin_lock(&cid_lock);
36483666
} else {
36493667
raw_spin_lock(&cid_lock);
3650-
cid = __mm_cid_try_get(mm);
3668+
cid = __mm_cid_try_get(t, mm);
36513669
if (cid >= 0)
36523670
goto unlock;
36533671
}
@@ -3667,7 +3685,7 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
36673685
* all newcoming allocations observe the use_cid_lock flag set.
36683686
*/
36693687
do {
3670-
cid = __mm_cid_try_get(mm);
3688+
cid = __mm_cid_try_get(t, mm);
36713689
cpu_relax();
36723690
} while (cid < 0);
36733691
/*
@@ -3684,7 +3702,8 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
36843702
return cid;
36853703
}
36863704

3687-
static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
3705+
static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
3706+
struct mm_struct *mm)
36883707
{
36893708
struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
36903709
struct cpumask *cpumask;
@@ -3701,8 +3720,9 @@ static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
37013720
if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
37023721
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
37033722
}
3704-
cid = __mm_cid_get(rq, mm);
3723+
cid = __mm_cid_get(rq, t, mm);
37053724
__this_cpu_write(pcpu_cid->cid, cid);
3725+
__this_cpu_write(pcpu_cid->recent_cid, cid);
37063726

37073727
return cid;
37083728
}
@@ -3755,7 +3775,7 @@ static inline void switch_mm_cid(struct rq *rq,
37553775
prev->mm_cid = -1;
37563776
}
37573777
if (next->mm_cid_active)
3758-
next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next->mm);
3778+
next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
37593779
}
37603780

37613781
#else /* !CONFIG_SCHED_MM_CID: */

0 commit comments

Comments
 (0)