Skip to content

Commit 9471f1f

Browse files
committed
Merge branch 'expand-stack'
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <[email protected]> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2 parents 3a8a670 + a425ac5 commit 9471f1f

File tree

49 files changed

+439
-468
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+439
-468
lines changed

arch/alpha/Kconfig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ config ALPHA
3030
select HAS_IOPORT
3131
select HAVE_ARCH_AUDITSYSCALL
3232
select HAVE_MOD_ARCH_SPECIFIC
33+
select LOCK_MM_AND_FIND_VMA
3334
select MODULES_USE_ELF_RELA
3435
select ODD_RT_SIGACTION
3536
select OLD_SIGSUSPEND

arch/alpha/mm/fault.c

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -119,20 +119,12 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
119119
flags |= FAULT_FLAG_USER;
120120
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
121121
retry:
122-
mmap_read_lock(mm);
123-
vma = find_vma(mm, address);
122+
vma = lock_mm_and_find_vma(mm, address, regs);
124123
if (!vma)
125-
goto bad_area;
126-
if (vma->vm_start <= address)
127-
goto good_area;
128-
if (!(vma->vm_flags & VM_GROWSDOWN))
129-
goto bad_area;
130-
if (expand_stack(vma, address))
131-
goto bad_area;
124+
goto bad_area_nosemaphore;
132125

133126
/* Ok, we have a good vm_area for this memory access, so
134127
we can handle it. */
135-
good_area:
136128
si_code = SEGV_ACCERR;
137129
if (cause < 0) {
138130
if (!(vma->vm_flags & VM_EXEC))
@@ -192,6 +184,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
192184
bad_area:
193185
mmap_read_unlock(mm);
194186

187+
bad_area_nosemaphore:
195188
if (user_mode(regs))
196189
goto do_sigsegv;
197190

arch/arc/Kconfig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ config ARC
4141
select HAVE_PERF_EVENTS
4242
select HAVE_SYSCALL_TRACEPOINTS
4343
select IRQ_DOMAIN
44+
select LOCK_MM_AND_FIND_VMA
4445
select MODULES_USE_ELF_RELA
4546
select OF
4647
select OF_EARLY_FLATTREE

arch/arc/mm/fault.c

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -113,15 +113,9 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
113113

114114
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
115115
retry:
116-
mmap_read_lock(mm);
117-
118-
vma = find_vma(mm, address);
116+
vma = lock_mm_and_find_vma(mm, address, regs);
119117
if (!vma)
120-
goto bad_area;
121-
if (unlikely(address < vma->vm_start)) {
122-
if (!(vma->vm_flags & VM_GROWSDOWN) || expand_stack(vma, address))
123-
goto bad_area;
124-
}
118+
goto bad_area_nosemaphore;
125119

126120
/*
127121
* vm_area is good, now check permissions for this memory access
@@ -161,6 +155,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
161155
bad_area:
162156
mmap_read_unlock(mm);
163157

158+
bad_area_nosemaphore:
164159
/*
165160
* Major/minor page fault accounting
166161
* (in case of retry we only land here once)

arch/arm/Kconfig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ config ARM
127127
select HAVE_VIRT_CPU_ACCOUNTING_GEN
128128
select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
129129
select IRQ_FORCED_THREADING
130+
select LOCK_MM_AND_FIND_VMA
130131
select MODULES_USE_ELF_REL
131132
select NEED_DMA_MAP_STATE
132133
select OF_EARLY_FLATTREE if OF

arch/arm/mm/fault.c

Lines changed: 14 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -235,37 +235,11 @@ static inline bool is_permission_fault(unsigned int fsr)
235235
return false;
236236
}
237237

238-
static vm_fault_t __kprobes
239-
__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int flags,
240-
unsigned long vma_flags, struct pt_regs *regs)
241-
{
242-
struct vm_area_struct *vma = find_vma(mm, addr);
243-
if (unlikely(!vma))
244-
return VM_FAULT_BADMAP;
245-
246-
if (unlikely(vma->vm_start > addr)) {
247-
if (!(vma->vm_flags & VM_GROWSDOWN))
248-
return VM_FAULT_BADMAP;
249-
if (addr < FIRST_USER_ADDRESS)
250-
return VM_FAULT_BADMAP;
251-
if (expand_stack(vma, addr))
252-
return VM_FAULT_BADMAP;
253-
}
254-
255-
/*
256-
* ok, we have a good vm_area for this memory access, check the
257-
* permissions on the VMA allow for the fault which occurred.
258-
*/
259-
if (!(vma->vm_flags & vma_flags))
260-
return VM_FAULT_BADACCESS;
261-
262-
return handle_mm_fault(vma, addr & PAGE_MASK, flags, regs);
263-
}
264-
265238
static int __kprobes
266239
do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
267240
{
268241
struct mm_struct *mm = current->mm;
242+
struct vm_area_struct *vma;
269243
int sig, code;
270244
vm_fault_t fault;
271245
unsigned int flags = FAULT_FLAG_DEFAULT;
@@ -304,31 +278,21 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
304278

305279
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
306280

307-
/*
308-
* As per x86, we may deadlock here. However, since the kernel only
309-
* validly references user space from well defined areas of the code,
310-
* we can bug out early if this is from code which shouldn't.
311-
*/
312-
if (!mmap_read_trylock(mm)) {
313-
if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
314-
goto no_context;
315281
retry:
316-
mmap_read_lock(mm);
317-
} else {
318-
/*
319-
* The above down_read_trylock() might have succeeded in
320-
* which case, we'll have missed the might_sleep() from
321-
* down_read()
322-
*/
323-
might_sleep();
324-
#ifdef CONFIG_DEBUG_VM
325-
if (!user_mode(regs) &&
326-
!search_exception_tables(regs->ARM_pc))
327-
goto no_context;
328-
#endif
282+
vma = lock_mm_and_find_vma(mm, addr, regs);
283+
if (unlikely(!vma)) {
284+
fault = VM_FAULT_BADMAP;
285+
goto bad_area;
329286
}
330287

331-
fault = __do_page_fault(mm, addr, flags, vm_flags, regs);
288+
/*
289+
* ok, we have a good vm_area for this memory access, check the
290+
* permissions on the VMA allow for the fault which occurred.
291+
*/
292+
if (!(vma->vm_flags & vm_flags))
293+
fault = VM_FAULT_BADACCESS;
294+
else
295+
fault = handle_mm_fault(vma, addr & PAGE_MASK, flags, regs);
332296

333297
/* If we need to retry but a fatal signal is pending, handle the
334298
* signal first. We do not need to release the mmap_lock because
@@ -359,6 +323,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
359323
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
360324
return 0;
361325

326+
bad_area:
362327
/*
363328
* If we are in kernel mode at this point, we
364329
* have no context to handle this fault with.

arch/arm64/Kconfig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,7 @@ config ARM64
231231
select IRQ_DOMAIN
232232
select IRQ_FORCED_THREADING
233233
select KASAN_VMALLOC if KASAN
234+
select LOCK_MM_AND_FIND_VMA
234235
select MODULES_USE_ELF_RELA
235236
select NEED_DMA_MAP_STATE
236237
select NEED_SG_DMA_LENGTH

arch/arm64/mm/fault.c

Lines changed: 8 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -497,27 +497,14 @@ static void do_bad_area(unsigned long far, unsigned long esr,
497497
#define VM_FAULT_BADMAP ((__force vm_fault_t)0x010000)
498498
#define VM_FAULT_BADACCESS ((__force vm_fault_t)0x020000)
499499

500-
static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
500+
static vm_fault_t __do_page_fault(struct mm_struct *mm,
501+
struct vm_area_struct *vma, unsigned long addr,
501502
unsigned int mm_flags, unsigned long vm_flags,
502503
struct pt_regs *regs)
503504
{
504-
struct vm_area_struct *vma = find_vma(mm, addr);
505-
506-
if (unlikely(!vma))
507-
return VM_FAULT_BADMAP;
508-
509505
/*
510506
* Ok, we have a good vm_area for this memory access, so we can handle
511507
* it.
512-
*/
513-
if (unlikely(vma->vm_start > addr)) {
514-
if (!(vma->vm_flags & VM_GROWSDOWN))
515-
return VM_FAULT_BADMAP;
516-
if (expand_stack(vma, addr))
517-
return VM_FAULT_BADMAP;
518-
}
519-
520-
/*
521508
* Check that the permissions on the VMA allow for the fault which
522509
* occurred.
523510
*/
@@ -631,31 +618,15 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
631618
}
632619
lock_mmap:
633620
#endif /* CONFIG_PER_VMA_LOCK */
634-
/*
635-
* As per x86, we may deadlock here. However, since the kernel only
636-
* validly references user space from well defined areas of the code,
637-
* we can bug out early if this is from code which shouldn't.
638-
*/
639-
if (!mmap_read_trylock(mm)) {
640-
if (!user_mode(regs) && !search_exception_tables(regs->pc))
641-
goto no_context;
621+
642622
retry:
643-
mmap_read_lock(mm);
644-
} else {
645-
/*
646-
* The above mmap_read_trylock() might have succeeded in which
647-
* case, we'll have missed the might_sleep() from down_read().
648-
*/
649-
might_sleep();
650-
#ifdef CONFIG_DEBUG_VM
651-
if (!user_mode(regs) && !search_exception_tables(regs->pc)) {
652-
mmap_read_unlock(mm);
653-
goto no_context;
654-
}
655-
#endif
623+
vma = lock_mm_and_find_vma(mm, addr, regs);
624+
if (unlikely(!vma)) {
625+
fault = VM_FAULT_BADMAP;
626+
goto done;
656627
}
657628

658-
fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs);
629+
fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs);
659630

660631
/* Quick path to respond to signals */
661632
if (fault_signal_pending(fault, regs)) {
@@ -674,9 +645,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
674645
}
675646
mmap_read_unlock(mm);
676647

677-
#ifdef CONFIG_PER_VMA_LOCK
678648
done:
679-
#endif
680649
/*
681650
* Handle the "normal" (no error) case first.
682651
*/

arch/csky/Kconfig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ config CSKY
9797
select HAVE_STACKPROTECTOR
9898
select HAVE_SYSCALL_TRACEPOINTS
9999
select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
100+
select LOCK_MM_AND_FIND_VMA
100101
select MAY_HAVE_SPARSE_IRQ
101102
select MODULES_USE_ELF_RELA if MODULES
102103
select OF

arch/csky/mm/fault.c

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -97,13 +97,12 @@ static inline void mm_fault_error(struct pt_regs *regs, unsigned long addr, vm_f
9797
BUG();
9898
}
9999

100-
static inline void bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr)
100+
static inline void bad_area_nosemaphore(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr)
101101
{
102102
/*
103103
* Something tried to access memory that isn't in our memory map.
104104
* Fix it, but check if it's kernel or user first.
105105
*/
106-
mmap_read_unlock(mm);
107106
/* User mode accesses just cause a SIGSEGV */
108107
if (user_mode(regs)) {
109108
do_trap(regs, SIGSEGV, code, addr);
@@ -238,32 +237,21 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
238237
if (is_write(regs))
239238
flags |= FAULT_FLAG_WRITE;
240239
retry:
241-
mmap_read_lock(mm);
242-
vma = find_vma(mm, addr);
240+
vma = lock_mm_and_find_vma(mm, address, regs);
243241
if (unlikely(!vma)) {
244-
bad_area(regs, mm, code, addr);
245-
return;
246-
}
247-
if (likely(vma->vm_start <= addr))
248-
goto good_area;
249-
if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
250-
bad_area(regs, mm, code, addr);
251-
return;
252-
}
253-
if (unlikely(expand_stack(vma, addr))) {
254-
bad_area(regs, mm, code, addr);
242+
bad_area_nosemaphore(regs, mm, code, addr);
255243
return;
256244
}
257245

258246
/*
259247
* Ok, we have a good vm_area for this memory access, so
260248
* we can handle it.
261249
*/
262-
good_area:
263250
code = SEGV_ACCERR;
264251

265252
if (unlikely(access_error(regs, vma))) {
266-
bad_area(regs, mm, code, addr);
253+
mmap_read_unlock(mm);
254+
bad_area_nosemaphore(regs, mm, code, addr);
267255
return;
268256
}
269257

0 commit comments

Comments
 (0)