Skip to content

Commit 8aa4206

Browse files
Yu Zhaoakpm00
authored andcommitted
mm/mglru: respect min_ttl_ms with memcgs
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru: try to stop at high watermarks"), it was discovered that the memcg LRU can breach the thrashing protection imposed by min_ttl_ms. Before the memcg LRU: kswapd() shrink_node_memcgs() mem_cgroup_iter() inc_max_seq() // always hit a different memcg lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation After the memcg LRU: kswapd() shrink_many() restart: iterate the memcg LRU: inc_max_seq() // occasionally hit the same memcg if raced with lru_gen_rotate_memcg(): goto restart lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation Specifically, when the restart happens in shrink_many(), it needs to stick with the (memcg LRU) generation it began with. In other words, it should neither re-read memcg_lru->seq nor age an lruvec of a different generation. Otherwise it can hit the same memcg multiple times without giving lru_gen_age_node() a chance to check the timestamp of that memcg's oldest generation (against min_ttl_ms). [1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: e4dde56 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <[email protected]> Tested-by: T.J. Mercier <[email protected]> Cc: Charan Teja Kalla <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jaroslav Pulchart <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 5095a2b commit 8aa4206

File tree

2 files changed

+33
-27
lines changed

2 files changed

+33
-27
lines changed

include/linux/mmzone.h

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -505,33 +505,37 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
505505
* the old generation, is incremented when all its bins become empty.
506506
*
507507
* There are four operations:
508-
* 1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in its
508+
* 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
509509
* current generation (old or young) and updates its "seg" to "head";
510-
* 2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in its
510+
* 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
511511
* current generation (old or young) and updates its "seg" to "tail";
512-
* 3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in the old
512+
* 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
513513
* generation, updates its "gen" to "old" and resets its "seg" to "default";
514-
* 4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin in the
514+
* 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
515515
* young generation, updates its "gen" to "young" and resets its "seg" to
516516
* "default".
517517
*
518518
* The events that trigger the above operations are:
519519
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
520-
* 2. The first attempt to reclaim an memcg below low, which triggers
520+
* 2. The first attempt to reclaim a memcg below low, which triggers
521521
* MEMCG_LRU_TAIL;
522-
* 3. The first attempt to reclaim an memcg below reclaimable size threshold,
522+
* 3. The first attempt to reclaim a memcg below reclaimable size threshold,
523523
* which triggers MEMCG_LRU_TAIL;
524-
* 4. The second attempt to reclaim an memcg below reclaimable size threshold,
524+
* 4. The second attempt to reclaim a memcg below reclaimable size threshold,
525525
* which triggers MEMCG_LRU_YOUNG;
526-
* 5. Attempting to reclaim an memcg below min, which triggers MEMCG_LRU_YOUNG;
526+
* 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
527527
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
528-
* 7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
528+
* 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
529529
*
530-
* Note that memcg LRU only applies to global reclaim, and the round-robin
531-
* incrementing of their max_seq counters ensures the eventual fairness to all
532-
* eligible memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
530+
* Notes:
531+
* 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
532+
* of their max_seq counters ensures the eventual fairness to all eligible
533+
* memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
534+
* 2. There are only two valid generations: old (seq) and young (seq+1).
535+
* MEMCG_NR_GENS is set to three so that when reading the generation counter
536+
* locklessly, a stale value (seq-1) does not wraparound to young.
533537
*/
534-
#define MEMCG_NR_GENS 2
538+
#define MEMCG_NR_GENS 3
535539
#define MEMCG_NR_BINS 8
536540

537541
struct lru_gen_memcg {

mm/vmscan.c

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4089,6 +4089,9 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
40894089
else
40904090
VM_WARN_ON_ONCE(true);
40914091

4092+
WRITE_ONCE(lruvec->lrugen.seg, seg);
4093+
WRITE_ONCE(lruvec->lrugen.gen, new);
4094+
40924095
hlist_nulls_del_rcu(&lruvec->lrugen.list);
40934096

40944097
if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
@@ -4099,9 +4102,6 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
40994102
pgdat->memcg_lru.nr_memcgs[old]--;
41004103
pgdat->memcg_lru.nr_memcgs[new]++;
41014104

4102-
lruvec->lrugen.gen = new;
4103-
WRITE_ONCE(lruvec->lrugen.seg, seg);
4104-
41054105
if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
41064106
WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
41074107

@@ -4124,11 +4124,11 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg)
41244124

41254125
gen = get_memcg_gen(pgdat->memcg_lru.seq);
41264126

4127+
lruvec->lrugen.gen = gen;
4128+
41274129
hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
41284130
pgdat->memcg_lru.nr_memcgs[gen]++;
41294131

4130-
lruvec->lrugen.gen = gen;
4131-
41324132
spin_unlock_irq(&pgdat->memcg_lru.lock);
41334133
}
41344134
}
@@ -4635,7 +4635,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
46354635
DEFINE_MAX_SEQ(lruvec);
46364636

46374637
if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
4638-
return 0;
4638+
return -1;
46394639

46404640
if (!should_run_aging(lruvec, max_seq, sc, can_swap, &nr_to_scan))
46414641
return nr_to_scan;
@@ -4710,7 +4710,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
47104710
cond_resched();
47114711
}
47124712

4713-
/* whether try_to_inc_max_seq() was successful */
4713+
/* whether this lruvec should be rotated */
47144714
return nr_to_scan < 0;
47154715
}
47164716

@@ -4764,13 +4764,13 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
47644764
struct lruvec *lruvec;
47654765
struct lru_gen_folio *lrugen;
47664766
struct mem_cgroup *memcg;
4767-
const struct hlist_nulls_node *pos;
4767+
struct hlist_nulls_node *pos;
47684768

4769+
gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
47694770
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
47704771
restart:
47714772
op = 0;
47724773
memcg = NULL;
4773-
gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
47744774

47754775
rcu_read_lock();
47764776

@@ -4781,6 +4781,10 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
47814781
}
47824782

47834783
mem_cgroup_put(memcg);
4784+
memcg = NULL;
4785+
4786+
if (gen != READ_ONCE(lrugen->gen))
4787+
continue;
47844788

47854789
lruvec = container_of(lrugen, struct lruvec, lrugen);
47864790
memcg = lruvec_memcg(lruvec);
@@ -4865,16 +4869,14 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
48654869
if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH)
48664870
return;
48674871
/*
4868-
* Determine the initial priority based on ((total / MEMCG_NR_GENS) >>
4869-
* priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, where the
4870-
* estimated reclaimed_to_scanned_ratio = inactive / total.
4872+
* Determine the initial priority based on
4873+
* (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim,
4874+
* where reclaimed_to_scanned_ratio = inactive / total.
48714875
*/
48724876
reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE);
48734877
if (get_swappiness(lruvec, sc))
48744878
reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON);
48754879

4876-
reclaimable /= MEMCG_NR_GENS;
4877-
48784880
/* round down reclaimable and round up sc->nr_to_reclaim */
48794881
priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1);
48804882

0 commit comments

Comments
 (0)