Skip to content

Commit f183324

Browse files
rgushchindennisszhou
authored andcommitted
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"): In our [Facebook] production experience the percpu memory allocator is sometimes struggling with returning the memory to the system. A typical example is a creation of several thousands memory cgroups (each has several chunks of the percpu data used for vmstats, vmevents, ref counters etc). Deletion and complete releasing of these cgroups doesn't always lead to a shrinkage of the percpu memory, so that sometimes there are several GB's of memory wasted. The underlying problem is the fragmentation: to release an underlying chunk all percpu allocations should be released first. The percpu allocator tends to top up chunks to improve the utilization. It means new small-ish allocations (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, effectively pinning them in memory. This patchset solves this problem by implementing a partial depopulation of percpu chunks: chunks with many empty pages are being asynchronously depopulated and the pages are returned to the system. To illustrate the problem the following script can be used: -- cd /sys/fs/cgroup mkdir percpu_test echo "+memory" > percpu_test/cgroup.subtree_control cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do mkdir percpu_test/cg_"${i}" for j in `seq 1 10`; do mkdir percpu_test/cg_"${i}"_"${j}" done done cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do for j in `seq 1 10`; do rmdir percpu_test/cg_"${i}"_"${j}" done done sleep 10 cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do rmdir percpu_test/cg_"${i}" done rmdir percpu_test -- It creates 11000 memory cgroups and removes every 10 out of 11. It prints the initial size of the percpu memory, the size after creating all cgroups and the size after deleting most of them. Results: vanilla: ./percpu_test.sh Percpu: 7488 kB Percpu: 481152 kB Percpu: 481152 kB with this patchset applied: ./percpu_test.sh Percpu: 7488 kB Percpu: 481408 kB Percpu: 135552 kB The total size of the percpu memory was reduced by more than 3.5 times. This patch: This patch implements partial depopulation of percpu chunks. As of now, a chunk can be depopulated only as a part of the final destruction, if there are no more outstanding allocations. However to minimize a memory waste it might be useful to depopulate a partially filed chunk, if a small number of outstanding allocations prevents the chunk from being fully reclaimed. This patch implements the following depopulation process: it scans over the chunk pages, looks for a range of empty and populated pages and performs the depopulation. To avoid races with new allocations, the chunk is previously isolated. After the depopulation the chunk is sidelined to a special list or freed. New allocations prefer using active chunks to sidelined chunks. If a sidelined chunk is used, it is reintegrated to the active lists. The depopulation is scheduled on the free path if the chunk is all of the following: 1) has more than 1/4 of total pages free and populated 2) the system has enough free percpu pages aside of this chunk 3) isn't the reserved chunk 4) isn't the first chunk If it's already depopulated but got free populated pages, it's a good target too. The chunk is moved to a special slot, pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work item is scheduled. On isolation, these pages are removed from the pcpu_nr_empty_pop_pages. It is constantly replaced to the to_depopulate_slot when it meets these qualifications. pcpu_reclaim_populated() iterates over the to_depopulate_slot until it becomes empty. The depopulation is performed in the reverse direction to keep populated pages close to the beginning. Depopulated chunks are sidelined to preferentially avoid them for new allocations. When no active chunk can suffice a new allocation, sidelined chunks are first checked before creating a new chunk. Signed-off-by: Roman Gushchin <[email protected]> Co-developed-by: Dennis Zhou <[email protected]> Signed-off-by: Dennis Zhou <[email protected]> Tested-by: Pratik Sampat <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
1 parent 1c29a3c commit f183324

File tree

5 files changed

+211
-20
lines changed

5 files changed

+211
-20
lines changed

mm/percpu-internal.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,8 @@ struct pcpu_chunk {
6767

6868
void *data; /* chunk data */
6969
bool immutable; /* no [de]population allowed */
70+
bool isolated; /* isolated from active chunk
71+
slots */
7072
int start_offset; /* the overlap with the previous
7173
region to have a page aligned
7274
base_addr */
@@ -87,6 +89,8 @@ extern spinlock_t pcpu_lock;
8789

8890
extern struct list_head *pcpu_chunk_lists;
8991
extern int pcpu_nr_slots;
92+
extern int pcpu_sidelined_slot;
93+
extern int pcpu_to_depopulate_slot;
9094
extern int pcpu_nr_empty_pop_pages[];
9195

9296
extern struct pcpu_chunk *pcpu_first_chunk;

mm/percpu-km.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,3 +118,8 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
118118

119119
return 0;
120120
}
121+
122+
static bool pcpu_should_reclaim_chunk(struct pcpu_chunk *chunk)
123+
{
124+
return false;
125+
}

mm/percpu-stats.c

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -219,13 +219,15 @@ static int percpu_stats_show(struct seq_file *m, void *v)
219219
for (slot = 0; slot < pcpu_nr_slots; slot++) {
220220
list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
221221
list) {
222-
if (chunk == pcpu_first_chunk) {
222+
if (chunk == pcpu_first_chunk)
223223
seq_puts(m, "Chunk: <- First Chunk\n");
224-
chunk_map_stats(m, chunk, buffer);
225-
} else {
224+
else if (slot == pcpu_to_depopulate_slot)
225+
seq_puts(m, "Chunk (to_depopulate)\n");
226+
else if (slot == pcpu_sidelined_slot)
227+
seq_puts(m, "Chunk (sidelined):\n");
228+
else
226229
seq_puts(m, "Chunk:\n");
227-
chunk_map_stats(m, chunk, buffer);
228-
}
230+
chunk_map_stats(m, chunk, buffer);
229231
}
230232
}
231233
}

mm/percpu-vm.c

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -377,3 +377,33 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
377377
/* no extra restriction */
378378
return 0;
379379
}
380+
381+
/**
382+
* pcpu_should_reclaim_chunk - determine if a chunk should go into reclaim
383+
* @chunk: chunk of interest
384+
*
385+
* This is the entry point for percpu reclaim. If a chunk qualifies, it is then
386+
* isolated and managed in separate lists at the back of pcpu_slot: sidelined
387+
* and to_depopulate respectively. The to_depopulate list holds chunks slated
388+
* for depopulation. They no longer contribute to pcpu_nr_empty_pop_pages once
389+
* they are on this list. Once depopulated, they are moved onto the sidelined
390+
* list which enables them to be pulled back in for allocation if no other chunk
391+
* can suffice the allocation.
392+
*/
393+
static bool pcpu_should_reclaim_chunk(struct pcpu_chunk *chunk)
394+
{
395+
/* do not reclaim either the first chunk or reserved chunk */
396+
if (chunk == pcpu_first_chunk || chunk == pcpu_reserved_chunk)
397+
return false;
398+
399+
/*
400+
* If it is isolated, it may be on the sidelined list so move it back to
401+
* the to_depopulate list. If we hit at least 1/4 pages empty pages AND
402+
* there is no system-wide shortage of empty pages aside from this
403+
* chunk, move it to the to_depopulate list.
404+
*/
405+
return ((chunk->isolated && chunk->nr_empty_pop_pages) ||
406+
(pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
407+
PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages &&
408+
chunk->nr_empty_pop_pages >= chunk->nr_pages / 4));
409+
}

mm/percpu.c

Lines changed: 165 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,8 @@ static int pcpu_nr_units __ro_after_init;
136136
static int pcpu_atom_size __ro_after_init;
137137
int pcpu_nr_slots __ro_after_init;
138138
int pcpu_free_slot __ro_after_init;
139+
int pcpu_sidelined_slot __ro_after_init;
140+
int pcpu_to_depopulate_slot __ro_after_init;
139141
static size_t pcpu_chunk_struct_size __ro_after_init;
140142

141143
/* cpus with the lowest and highest unit addresses */
@@ -562,10 +564,41 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
562564
{
563565
int nslot = pcpu_chunk_slot(chunk);
564566

567+
/* leave isolated chunks in-place */
568+
if (chunk->isolated)
569+
return;
570+
565571
if (oslot != nslot)
566572
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
567573
}
568574

575+
static void pcpu_isolate_chunk(struct pcpu_chunk *chunk)
576+
{
577+
enum pcpu_chunk_type type = pcpu_chunk_type(chunk);
578+
struct list_head *pcpu_slot = pcpu_chunk_list(type);
579+
580+
lockdep_assert_held(&pcpu_lock);
581+
582+
if (!chunk->isolated) {
583+
chunk->isolated = true;
584+
pcpu_nr_empty_pop_pages[type] -= chunk->nr_empty_pop_pages;
585+
}
586+
list_move(&chunk->list, &pcpu_slot[pcpu_to_depopulate_slot]);
587+
}
588+
589+
static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
590+
{
591+
enum pcpu_chunk_type type = pcpu_chunk_type(chunk);
592+
593+
lockdep_assert_held(&pcpu_lock);
594+
595+
if (chunk->isolated) {
596+
chunk->isolated = false;
597+
pcpu_nr_empty_pop_pages[type] += chunk->nr_empty_pop_pages;
598+
pcpu_chunk_relocate(chunk, -1);
599+
}
600+
}
601+
569602
/*
570603
* pcpu_update_empty_pages - update empty page counters
571604
* @chunk: chunk of interest
@@ -578,7 +611,7 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
578611
static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
579612
{
580613
chunk->nr_empty_pop_pages += nr;
581-
if (chunk != pcpu_reserved_chunk)
614+
if (chunk != pcpu_reserved_chunk && !chunk->isolated)
582615
pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
583616
}
584617

@@ -1778,7 +1811,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
17781811

17791812
restart:
17801813
/* search through normal chunks */
1781-
for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
1814+
for (slot = pcpu_size_to_slot(size); slot <= pcpu_free_slot; slot++) {
17821815
list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
17831816
off = pcpu_find_block_fit(chunk, bits, bit_align,
17841817
is_atomic);
@@ -1789,9 +1822,10 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
17891822
}
17901823

17911824
off = pcpu_alloc_area(chunk, bits, bit_align, off);
1792-
if (off >= 0)
1825+
if (off >= 0) {
1826+
pcpu_reintegrate_chunk(chunk);
17931827
goto area_found;
1794-
1828+
}
17951829
}
17961830
}
17971831

@@ -1952,10 +1986,13 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
19521986
/**
19531987
* pcpu_balance_free - manage the amount of free chunks
19541988
* @type: chunk type
1989+
* @empty_only: free chunks only if there are no populated pages
19551990
*
1956-
* Reclaim all fully free chunks except for the first one.
1991+
* If empty_only is %false, reclaim all fully free chunks regardless of the
1992+
* number of populated pages. Otherwise, only reclaim chunks that have no
1993+
* populated pages.
19571994
*/
1958-
static void pcpu_balance_free(enum pcpu_chunk_type type)
1995+
static void pcpu_balance_free(enum pcpu_chunk_type type, bool empty_only)
19591996
{
19601997
LIST_HEAD(to_free);
19611998
struct list_head *pcpu_slot = pcpu_chunk_list(type);
@@ -1975,7 +2012,8 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
19752012
if (chunk == list_first_entry(free_head, struct pcpu_chunk, list))
19762013
continue;
19772014

1978-
list_move(&chunk->list, &to_free);
2015+
if (!empty_only || chunk->nr_empty_pop_pages == 0)
2016+
list_move(&chunk->list, &to_free);
19792017
}
19802018

19812019
spin_unlock_irq(&pcpu_lock);
@@ -2083,20 +2121,121 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
20832121
}
20842122
}
20852123

2124+
/**
2125+
* pcpu_reclaim_populated - scan over to_depopulate chunks and free empty pages
2126+
* @type: chunk type
2127+
*
2128+
* Scan over chunks in the depopulate list and try to release unused populated
2129+
* pages back to the system. Depopulated chunks are sidelined to prevent
2130+
* repopulating these pages unless required. Fully free chunks are reintegrated
2131+
* and freed accordingly (1 is kept around). If we drop below the empty
2132+
* populated pages threshold, reintegrate the chunk if it has empty free pages.
2133+
* Each chunk is scanned in the reverse order to keep populated pages close to
2134+
* the beginning of the chunk.
2135+
*/
2136+
static void pcpu_reclaim_populated(enum pcpu_chunk_type type)
2137+
{
2138+
struct list_head *pcpu_slot = pcpu_chunk_list(type);
2139+
struct pcpu_chunk *chunk;
2140+
struct pcpu_block_md *block;
2141+
int i, end;
2142+
2143+
spin_lock_irq(&pcpu_lock);
2144+
2145+
restart:
2146+
/*
2147+
* Once a chunk is isolated to the to_depopulate list, the chunk is no
2148+
* longer discoverable to allocations whom may populate pages. The only
2149+
* other accessor is the free path which only returns area back to the
2150+
* allocator not touching the populated bitmap.
2151+
*/
2152+
while (!list_empty(&pcpu_slot[pcpu_to_depopulate_slot])) {
2153+
chunk = list_first_entry(&pcpu_slot[pcpu_to_depopulate_slot],
2154+
struct pcpu_chunk, list);
2155+
WARN_ON(chunk->immutable);
2156+
2157+
/*
2158+
* Scan chunk's pages in the reverse order to keep populated
2159+
* pages close to the beginning of the chunk.
2160+
*/
2161+
for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) {
2162+
/* no more work to do */
2163+
if (chunk->nr_empty_pop_pages == 0)
2164+
break;
2165+
2166+
/* reintegrate chunk to prevent atomic alloc failures */
2167+
if (pcpu_nr_empty_pop_pages[type] <
2168+
PCPU_EMPTY_POP_PAGES_HIGH) {
2169+
pcpu_reintegrate_chunk(chunk);
2170+
goto restart;
2171+
}
2172+
2173+
/*
2174+
* If the page is empty and populated, start or
2175+
* extend the (i, end) range. If i == 0, decrease
2176+
* i and perform the depopulation to cover the last
2177+
* (first) page in the chunk.
2178+
*/
2179+
block = chunk->md_blocks + i;
2180+
if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
2181+
test_bit(i, chunk->populated)) {
2182+
if (end == -1)
2183+
end = i;
2184+
if (i > 0)
2185+
continue;
2186+
i--;
2187+
}
2188+
2189+
/* depopulate if there is an active range */
2190+
if (end == -1)
2191+
continue;
2192+
2193+
spin_unlock_irq(&pcpu_lock);
2194+
pcpu_depopulate_chunk(chunk, i + 1, end + 1);
2195+
cond_resched();
2196+
spin_lock_irq(&pcpu_lock);
2197+
2198+
pcpu_chunk_depopulated(chunk, i + 1, end + 1);
2199+
2200+
/* reset the range and continue */
2201+
end = -1;
2202+
}
2203+
2204+
if (chunk->free_bytes == pcpu_unit_size)
2205+
pcpu_reintegrate_chunk(chunk);
2206+
else
2207+
list_move(&chunk->list,
2208+
&pcpu_slot[pcpu_sidelined_slot]);
2209+
}
2210+
2211+
spin_unlock_irq(&pcpu_lock);
2212+
}
2213+
20862214
/**
20872215
* pcpu_balance_workfn - manage the amount of free chunks and populated pages
20882216
* @work: unused
20892217
*
2090-
* Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
2218+
* For each chunk type, manage the number of fully free chunks and the number of
2219+
* populated pages. An important thing to consider is when pages are freed and
2220+
* how they contribute to the global counts.
20912221
*/
20922222
static void pcpu_balance_workfn(struct work_struct *work)
20932223
{
20942224
enum pcpu_chunk_type type;
20952225

2226+
/*
2227+
* pcpu_balance_free() is called twice because the first time we may
2228+
* trim pages in the active pcpu_nr_empty_pop_pages which may cause us
2229+
* to grow other chunks. This then gives pcpu_reclaim_populated() time
2230+
* to move fully free chunks to the active list to be freed if
2231+
* appropriate.
2232+
*/
20962233
for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
20972234
mutex_lock(&pcpu_alloc_mutex);
2098-
pcpu_balance_free(type);
2235+
pcpu_balance_free(type, false);
2236+
pcpu_reclaim_populated(type);
20992237
pcpu_balance_populated(type);
2238+
pcpu_balance_free(type, true);
21002239
mutex_unlock(&pcpu_alloc_mutex);
21012240
}
21022241
}
@@ -2137,15 +2276,22 @@ void free_percpu(void __percpu *ptr)
21372276

21382277
pcpu_memcg_free_hook(chunk, off, size);
21392278

2140-
/* if there are more than one fully free chunks, wake up grim reaper */
2141-
if (chunk->free_bytes == pcpu_unit_size) {
2279+
/*
2280+
* If there are more than one fully free chunks, wake up grim reaper.
2281+
* If the chunk is isolated, it may be in the process of being
2282+
* reclaimed. Let reclaim manage cleaning up of that chunk.
2283+
*/
2284+
if (!chunk->isolated && chunk->free_bytes == pcpu_unit_size) {
21422285
struct pcpu_chunk *pos;
21432286

21442287
list_for_each_entry(pos, &pcpu_slot[pcpu_free_slot], list)
21452288
if (pos != chunk) {
21462289
need_balance = true;
21472290
break;
21482291
}
2292+
} else if (pcpu_should_reclaim_chunk(chunk)) {
2293+
pcpu_isolate_chunk(chunk);
2294+
need_balance = true;
21492295
}
21502296

21512297
trace_percpu_free_percpu(chunk->base_addr, off, ptr);
@@ -2560,11 +2706,15 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
25602706
pcpu_stats_save_ai(ai);
25612707

25622708
/*
2563-
* Allocate chunk slots. The additional last slot is for
2564-
* empty chunks.
2709+
* Allocate chunk slots. The slots after the active slots are:
2710+
* sidelined_slot - isolated, depopulated chunks
2711+
* free_slot - fully free chunks
2712+
* to_depopulate_slot - isolated, chunks to depopulate
25652713
*/
2566-
pcpu_free_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
2567-
pcpu_nr_slots = pcpu_free_slot + 1;
2714+
pcpu_sidelined_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
2715+
pcpu_free_slot = pcpu_sidelined_slot + 1;
2716+
pcpu_to_depopulate_slot = pcpu_free_slot + 1;
2717+
pcpu_nr_slots = pcpu_to_depopulate_slot + 1;
25682718
pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
25692719
sizeof(pcpu_chunk_lists[0]) *
25702720
PCPU_NR_CHUNK_TYPES,

0 commit comments

Comments
 (0)