Skip to content

Commit 8cb1ae1

Browse files
committed
Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner: - Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanced Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1 * tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) Documentation/x86: Add documentation for using dynamic XSTATE features x86/fpu: Include vmalloc.h for vzalloc() selftests/x86/amx: Add context switch test selftests/x86/amx: Add test cases for AMX state management x86/fpu/amx: Enable the AMX feature in 64-bit mode x86/fpu: Add XFD handling for dynamic states x86/fpu: Calculate the default sizes independently x86/fpu/amx: Define AMX state components and have it used for boot-time checks x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers x86/fpu/xstate: Add fpstate_realloc()/free() x86/fpu/xstate: Add XFD #NM handler x86/fpu: Update XFD state where required x86/fpu: Add sanity checks for XFD x86/fpu: Add XFD state to fpstate x86/msr-index: Add MSRs for XFD x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit x86/fpu: Reset permission and fpstate on exec() x86/fpu: Prepare fpu_clone() for dynamically enabled features x86/fpu/signal: Prepare for variable sigframe length x86/signal: Use fpu::__state_user_size for sigalt stack validation ...
2 parents 7d20dd3 + d7a9590 commit 8cb1ae1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+3458
-1575
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5497,6 +5497,15 @@
54975497
stifb= [HW]
54985498
Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]]
54995499

5500+
strict_sas_size=
5501+
[X86]
5502+
Format: <bool>
5503+
Enable or disable strict sigaltstack size checks
5504+
against the required signal frame size which
5505+
depends on the supported FPU features. This can
5506+
be used to filter out binaries which have
5507+
not yet been made aware of AT_MINSIGSTKSZ.
5508+
55005509
sunrpc.min_resvport=
55015510
sunrpc.max_resvport=
55025511
[NFS,SUNRPC]

Documentation/x86/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,4 @@ x86-specific Documentation
3737
sgx
3838
features
3939
elf_auxvec
40+
xstate

Documentation/x86/xstate.rst

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
Using XSTATE features in user space applications
2+
================================================
3+
4+
The x86 architecture supports floating-point extensions which are
5+
enumerated via CPUID. Applications consult CPUID and use XGETBV to
6+
evaluate which features have been enabled by the kernel XCR0.
7+
8+
Up to AVX-512 and PKRU states, these features are automatically enabled by
9+
the kernel if available. Features like AMX TILE_DATA (XSTATE component 18)
10+
are enabled by XCR0 as well, but the first use of related instruction is
11+
trapped by the kernel because by default the required large XSTATE buffers
12+
are not allocated automatically.
13+
14+
Using dynamically enabled XSTATE features in user space applications
15+
--------------------------------------------------------------------
16+
17+
The kernel provides an arch_prctl(2) based mechanism for applications to
18+
request the usage of such features. The arch_prctl(2) options related to
19+
this are:
20+
21+
-ARCH_GET_XCOMP_SUPP
22+
23+
arch_prctl(ARCH_GET_XCOMP_SUPP, &features);
24+
25+
ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of
26+
type uint64_t. The second argument is a pointer to that storage.
27+
28+
-ARCH_GET_XCOMP_PERM
29+
30+
arch_prctl(ARCH_GET_XCOMP_PERM, &features);
31+
32+
ARCH_GET_XCOMP_PERM stores the features for which the userspace process
33+
has permission in userspace storage of type uint64_t. The second argument
34+
is a pointer to that storage.
35+
36+
-ARCH_REQ_XCOMP_PERM
37+
38+
arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr);
39+
40+
ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled
41+
feature or a feature set. A feature set can be mapped to a facility, e.g.
42+
AMX, and can require one or more XSTATE components to be enabled.
43+
44+
The feature argument is the number of the highest XSTATE component which
45+
is required for a facility to work.
46+
47+
When requesting permission for a feature, the kernel checks the
48+
availability. The kernel ensures that sigaltstacks in the process's tasks
49+
are large enough to accommodate the resulting large signal frame. It
50+
enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent
51+
sigaltstack(2) calls. If an installed sigaltstack is smaller than the
52+
resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also,
53+
sigaltstack(2) results in -ENOMEM if the requested altstack is too small
54+
for the permitted features.
55+
56+
Permission, when granted, is valid per process. Permissions are inherited
57+
on fork(2) and cleared on exec(3).
58+
59+
The first use of an instruction related to a dynamically enabled feature is
60+
trapped by the kernel. The trap handler checks whether the process has
61+
permission to use the feature. If the process has no permission then the
62+
kernel sends SIGILL to the application. If the process has permission then
63+
the handler allocates a larger xstate buffer for the task so the large
64+
state can be context switched. In the unlikely cases that the allocation
65+
fails, the kernel sends SIGSEGV.

arch/Kconfig

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1288,6 +1288,9 @@ config ARCH_HAS_ELFCORE_COMPAT
12881288
config ARCH_HAS_PARANOID_L1D_FLUSH
12891289
bool
12901290

1291+
config DYNAMIC_SIGFRAME
1292+
bool
1293+
12911294
source "kernel/gcov/Kconfig"
12921295

12931296
source "scripts/gcc-plugins/Kconfig"

arch/x86/Kconfig

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ config X86
125125
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
126126
select CLOCKSOURCE_WATCHDOG
127127
select DCACHE_WORD_ACCESS
128+
select DYNAMIC_SIGFRAME
128129
select EDAC_ATOMIC_SCRUB
129130
select EDAC_SUPPORT
130131
select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
@@ -2399,6 +2400,22 @@ config MODIFY_LDT_SYSCALL
23992400

24002401
Saying 'N' here may make sense for embedded or server kernels.
24012402

2403+
config STRICT_SIGALTSTACK_SIZE
2404+
bool "Enforce strict size checking for sigaltstack"
2405+
depends on DYNAMIC_SIGFRAME
2406+
help
2407+
For historical reasons MINSIGSTKSZ is a constant which became
2408+
already too small with AVX512 support. Add a mechanism to
2409+
enforce strict checking of the sigaltstack size against the
2410+
real size of the FPU frame. This option enables the check
2411+
by default. It can also be controlled via the kernel command
2412+
line option 'strict_sas_size' independent of this config
2413+
switch. Enabling it might break existing applications which
2414+
allocate a too small sigaltstack but 'work' because they
2415+
never get a signal delivered.
2416+
2417+
Say 'N' unless you want to really enforce this check.
2418+
24022419
source "kernel/livepatch/Kconfig"
24032420

24042421
endmenu

arch/x86/events/perf_event.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515
#include <linux/perf_event.h>
1616

17+
#include <asm/fpu/xstate.h>
1718
#include <asm/intel_ds.h>
1819
#include <asm/cpu.h>
1920

arch/x86/ia32/ia32_signal.c

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
#include <linux/syscalls.h>
2525
#include <asm/ucontext.h>
2626
#include <linux/uaccess.h>
27-
#include <asm/fpu/internal.h>
2827
#include <asm/fpu/signal.h>
2928
#include <asm/ptrace.h>
3029
#include <asm/ia32_unistd.h>
@@ -57,16 +56,16 @@ static inline void reload_segments(struct sigcontext_32 *sc)
5756
/*
5857
* Do a signal return; undo the signal stack.
5958
*/
60-
static int ia32_restore_sigcontext(struct pt_regs *regs,
61-
struct sigcontext_32 __user *usc)
59+
static bool ia32_restore_sigcontext(struct pt_regs *regs,
60+
struct sigcontext_32 __user *usc)
6261
{
6362
struct sigcontext_32 sc;
6463

6564
/* Always make any pending restarted system calls return -EINTR */
6665
current->restart_block.fn = do_no_restart_syscall;
6766

6867
if (unlikely(copy_from_user(&sc, usc, sizeof(sc))))
69-
return -EFAULT;
68+
return false;
7069

7170
/* Get only the ia32 registers. */
7271
regs->bx = sc.bx;
@@ -111,7 +110,7 @@ COMPAT_SYSCALL_DEFINE0(sigreturn)
111110

112111
set_current_blocked(&set);
113112

114-
if (ia32_restore_sigcontext(regs, &frame->sc))
113+
if (!ia32_restore_sigcontext(regs, &frame->sc))
115114
goto badframe;
116115
return regs->ax;
117116

@@ -135,7 +134,7 @@ COMPAT_SYSCALL_DEFINE0(rt_sigreturn)
135134

136135
set_current_blocked(&set);
137136

138-
if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext))
137+
if (!ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext))
139138
goto badframe;
140139

141140
if (compat_restore_altstack(&frame->uc.uc_stack))
@@ -220,8 +219,8 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
220219

221220
sp = fpu__alloc_mathframe(sp, 1, &fx_aligned, &math_size);
222221
*fpstate = (struct _fpstate_32 __user *) sp;
223-
if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
224-
math_size) < 0)
222+
if (!copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
223+
math_size))
225224
return (void __user *) -1L;
226225

227226
sp -= frame_size;

arch/x86/include/asm/asm.h

Lines changed: 20 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -122,28 +122,19 @@
122122

123123
#ifdef __KERNEL__
124124

125+
# include <asm/extable_fixup_types.h>
126+
125127
/* Exception table entry */
126128
#ifdef __ASSEMBLY__
127-
# define _ASM_EXTABLE_HANDLE(from, to, handler) \
129+
130+
# define _ASM_EXTABLE_TYPE(from, to, type) \
128131
.pushsection "__ex_table","a" ; \
129132
.balign 4 ; \
130133
.long (from) - . ; \
131134
.long (to) - . ; \
132-
.long (handler) - . ; \
135+
.long type ; \
133136
.popsection
134137

135-
# define _ASM_EXTABLE(from, to) \
136-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
137-
138-
# define _ASM_EXTABLE_UA(from, to) \
139-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_uaccess)
140-
141-
# define _ASM_EXTABLE_CPY(from, to) \
142-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_copy)
143-
144-
# define _ASM_EXTABLE_FAULT(from, to) \
145-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
146-
147138
# ifdef CONFIG_KPROBES
148139
# define _ASM_NOKPROBE(entry) \
149140
.pushsection "_kprobe_blacklist","aw" ; \
@@ -155,27 +146,15 @@
155146
# endif
156147

157148
#else /* ! __ASSEMBLY__ */
158-
# define _EXPAND_EXTABLE_HANDLE(x) #x
159-
# define _ASM_EXTABLE_HANDLE(from, to, handler) \
149+
150+
# define _ASM_EXTABLE_TYPE(from, to, type) \
160151
" .pushsection \"__ex_table\",\"a\"\n" \
161152
" .balign 4\n" \
162153
" .long (" #from ") - .\n" \
163154
" .long (" #to ") - .\n" \
164-
" .long (" _EXPAND_EXTABLE_HANDLE(handler) ") - .\n" \
155+
" .long " __stringify(type) " \n" \
165156
" .popsection\n"
166157

167-
# define _ASM_EXTABLE(from, to) \
168-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
169-
170-
# define _ASM_EXTABLE_UA(from, to) \
171-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_uaccess)
172-
173-
# define _ASM_EXTABLE_CPY(from, to) \
174-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_copy)
175-
176-
# define _ASM_EXTABLE_FAULT(from, to) \
177-
_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
178-
179158
/* For C file, we already have NOKPROBE_SYMBOL macro */
180159

181160
/*
@@ -188,6 +167,17 @@ register unsigned long current_stack_pointer asm(_ASM_SP);
188167
#define ASM_CALL_CONSTRAINT "+r" (current_stack_pointer)
189168
#endif /* __ASSEMBLY__ */
190169

191-
#endif /* __KERNEL__ */
170+
#define _ASM_EXTABLE(from, to) \
171+
_ASM_EXTABLE_TYPE(from, to, EX_TYPE_DEFAULT)
192172

173+
#define _ASM_EXTABLE_UA(from, to) \
174+
_ASM_EXTABLE_TYPE(from, to, EX_TYPE_UACCESS)
175+
176+
#define _ASM_EXTABLE_CPY(from, to) \
177+
_ASM_EXTABLE_TYPE(from, to, EX_TYPE_COPY)
178+
179+
#define _ASM_EXTABLE_FAULT(from, to) \
180+
_ASM_EXTABLE_TYPE(from, to, EX_TYPE_FAULT)
181+
182+
#endif /* __KERNEL__ */
193183
#endif /* _ASM_X86_ASM_H */

arch/x86/include/asm/cpufeatures.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -277,6 +277,7 @@
277277
#define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
278278
#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 instruction */
279279
#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */
280+
#define X86_FEATURE_XFD (10*32+ 4) /* "" eXtended Feature Disabling */
280281

281282
/*
282283
* Extended auxiliary flags: Linux defined - for features scattered in various
@@ -298,6 +299,7 @@
298299
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
299300
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
300301
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
302+
#define X86_FEATURE_AMX_TILE (18*32+24) /* AMX tile Support */
301303

302304
/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */
303305
#define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */

arch/x86/include/asm/extable.h

Lines changed: 28 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,18 @@
11
/* SPDX-License-Identifier: GPL-2.0 */
22
#ifndef _ASM_X86_EXTABLE_H
33
#define _ASM_X86_EXTABLE_H
4+
5+
#include <asm/extable_fixup_types.h>
6+
47
/*
5-
* The exception table consists of triples of addresses relative to the
6-
* exception table entry itself. The first address is of an instruction
7-
* that is allowed to fault, the second is the target at which the program
8-
* should continue. The third is a handler function to deal with the fault
9-
* caused by the instruction in the first field.
8+
* The exception table consists of two addresses relative to the
9+
* exception table entry itself and a type selector field.
10+
*
11+
* The first address is of an instruction that is allowed to fault, the
12+
* second is the target at which the program should continue.
13+
*
14+
* The type entry is used by fixup_exception() to select the handler to
15+
* deal with the fault caused by the instruction in the first field.
1016
*
1117
* All the routines below use bits of fixup code that are out of line
1218
* with the main instruction path. This means when everything is well,
@@ -15,7 +21,7 @@
1521
*/
1622

1723
struct exception_table_entry {
18-
int insn, fixup, handler;
24+
int insn, fixup, type;
1925
};
2026
struct pt_regs;
2127

@@ -25,21 +31,27 @@ struct pt_regs;
2531
do { \
2632
(a)->fixup = (b)->fixup + (delta); \
2733
(b)->fixup = (tmp).fixup - (delta); \
28-
(a)->handler = (b)->handler + (delta); \
29-
(b)->handler = (tmp).handler - (delta); \
34+
(a)->type = (b)->type; \
35+
(b)->type = (tmp).type; \
3036
} while (0)
3137

32-
enum handler_type {
33-
EX_HANDLER_NONE,
34-
EX_HANDLER_FAULT,
35-
EX_HANDLER_UACCESS,
36-
EX_HANDLER_OTHER
37-
};
38-
3938
extern int fixup_exception(struct pt_regs *regs, int trapnr,
4039
unsigned long error_code, unsigned long fault_addr);
4140
extern int fixup_bug(struct pt_regs *regs, int trapnr);
42-
extern enum handler_type ex_get_fault_handler_type(unsigned long ip);
41+
extern int ex_get_fixup_type(unsigned long ip);
4342
extern void early_fixup_exception(struct pt_regs *regs, int trapnr);
4443

44+
#ifdef CONFIG_X86_MCE
45+
extern void ex_handler_msr_mce(struct pt_regs *regs, bool wrmsr);
46+
#else
47+
static inline void ex_handler_msr_mce(struct pt_regs *regs, bool wrmsr) { }
48+
#endif
49+
50+
#if defined(CONFIG_BPF_JIT) && defined(CONFIG_X86_64)
51+
bool ex_handler_bpf(const struct exception_table_entry *x, struct pt_regs *regs);
52+
#else
53+
static inline bool ex_handler_bpf(const struct exception_table_entry *x,
54+
struct pt_regs *regs) { return false; }
55+
#endif
56+
4557
#endif

0 commit comments

Comments
 (0)