Skip to content

Commit 1f82822

Browse files
shakeelbtorvalds
authored andcommitted
memcg: flush lruvec stats in the refault
Prior to the commit 7e1c0d6 ("memcg: switch lruvec stats to rstat") and the commit aa48e47 ("memcg: infrastructure to flush memcg stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus * 32) at worst and for unbounded amount of time. The commit aa48e47 moved the lruvec stats to rstat infrastructure and the commit 7e1c0d6 bounded the error for all the lruvec stats to (nr_cpus * 32) at worst for at most 2 seconds. More specifically it decoupled the number of stats and the number of cgroups from the error rate. However this reduction in error comes with the cost of triggering the slowpath of stats update more frequently. Previously in the slowpath the kernel adds the stats up the memcg tree. After aa48e47, the kernel triggers the asyn lruvec stats flush through queue_work(). This causes regression reports from 0day kernel bot [1] as well as from phoronix test suite [2]. We tried two options to fix the regression: 1) Increase the threshold to trigger the slowpath in lruvec stats update codepath from 32 to 512. 2) Remove the slowpath from lruvec stats update codepath and instead flush the stats in the page refault codepath. The assumption is that the kernel timely flush the stats, so, the update tree would be small in the refault codepath to not cause the preformance impact. Following are the results of will-it-scale/page_fault[1|2|3] benchmark on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with aa48e47 and 7e1c0d6 reverted (3) 5.15-rc1 with option-1 (4) 5.15-rc1 with option-2. test (1) (2) (3) (4) pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%) pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%) pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%) From the above result, it seems like the option-2 not only solves the regression but also improves the performance for at least these benchmarks. Feng Tang (intel) ran the aim7 benchmark with these two options and confirms that option-1 reduces the regression but option-2 removes the regression. Michael Larabel (phoronix) ran multiple benchmarks with these options and reported the results at [3] and it shows for most benchmarks option-2 removes the regression introduced by the commit aa48e47 ("memcg: infrastructure to flush memcg stats"). Based on the experiment results, this patch proposed the option-2 as the solution to resolve the regression. Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1] Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2] Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3] Fixes: aa48e47 ("memcg: infrastructure to flush memcg stats") Signed-off-by: Shakeel Butt <[email protected]> Tested-by: Michael Larabel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Feng Tang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hillf Danton <[email protected]>, Cc: Michal Koutný <[email protected]> Cc: Andrew Morton <[email protected]>, Signed-off-by: Linus Torvalds <[email protected]>
1 parent 58e2cf5 commit 1f82822

File tree

2 files changed

+1
-10
lines changed

2 files changed

+1
-10
lines changed

mm/memcontrol.c

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -106,9 +106,6 @@ static bool do_memsw_account(void)
106106
/* memcg and lruvec stats flushing */
107107
static void flush_memcg_stats_dwork(struct work_struct *w);
108108
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
109-
static void flush_memcg_stats_work(struct work_struct *w);
110-
static DECLARE_WORK(stats_flush_work, flush_memcg_stats_work);
111-
static DEFINE_PER_CPU(unsigned int, stats_flush_threshold);
112109
static DEFINE_SPINLOCK(stats_flush_lock);
113110

114111
#define THRESHOLDS_EVENTS_TARGET 128
@@ -682,8 +679,6 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
682679

683680
/* Update lruvec */
684681
__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
685-
if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
686-
queue_work(system_unbound_wq, &stats_flush_work);
687682
}
688683

689684
/**
@@ -5361,11 +5356,6 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
53615356
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
53625357
}
53635358

5364-
static void flush_memcg_stats_work(struct work_struct *w)
5365-
{
5366-
mem_cgroup_flush_stats();
5367-
}
5368-
53695359
static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
53705360
{
53715361
struct mem_cgroup *memcg = mem_cgroup_from_css(css);

mm/workingset.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -352,6 +352,7 @@ void workingset_refault(struct page *page, void *shadow)
352352

353353
inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
354354

355+
mem_cgroup_flush_stats();
355356
/*
356357
* Compare the distance to the existing workingset size. We
357358
* don't activate pages that couldn't stay resident even if

0 commit comments

Comments
 (0)