Skip to content

Commit 720c857

Browse files
committed
Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2 parents ca7e917 + c416b5b commit 720c857

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+1372
-121
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1525,6 +1525,12 @@
15251525
Warning: use of this parameter will taint the kernel
15261526
and may cause unknown problems.
15271527

1528+
fred= [X86-64]
1529+
Enable/disable Flexible Return and Event Delivery.
1530+
Format: { on | off }
1531+
on: enable FRED when it's present.
1532+
off: disable FRED, the default setting.
1533+
15281534
ftrace=[tracer]
15291535
[FTRACE] will set and start the specified tracer
15301536
as early as possible in order to facilitate early
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=========================================
4+
Flexible Return and Event Delivery (FRED)
5+
=========================================
6+
7+
Overview
8+
========
9+
10+
The FRED architecture defines simple new transitions that change
11+
privilege level (ring transitions). The FRED architecture was
12+
designed with the following goals:
13+
14+
1) Improve overall performance and response time by replacing event
15+
delivery through the interrupt descriptor table (IDT event
16+
delivery) and event return by the IRET instruction with lower
17+
latency transitions.
18+
19+
2) Improve software robustness by ensuring that event delivery
20+
establishes the full supervisor context and that event return
21+
establishes the full user context.
22+
23+
The new transitions defined by the FRED architecture are FRED event
24+
delivery and, for returning from events, two FRED return instructions.
25+
FRED event delivery can effect a transition from ring 3 to ring 0, but
26+
it is used also to deliver events incident to ring 0. One FRED
27+
instruction (ERETU) effects a return from ring 0 to ring 3, while the
28+
other (ERETS) returns while remaining in ring 0. Collectively, FRED
29+
event delivery and the FRED return instructions are FRED transitions.
30+
31+
In addition to these transitions, the FRED architecture defines a new
32+
instruction (LKGS) for managing the state of the GS segment register.
33+
The LKGS instruction can be used by 64-bit operating systems that do
34+
not use the new FRED transitions.
35+
36+
Furthermore, the FRED architecture is easy to extend for future CPU
37+
architectures.
38+
39+
Software based event dispatching
40+
================================
41+
42+
FRED operates differently from IDT in terms of event handling. Instead
43+
of directly dispatching an event to its handler based on the event
44+
vector, FRED requires the software to dispatch an event to its handler
45+
based on both the event's type and vector. Therefore, an event dispatch
46+
framework must be implemented to facilitate the event-to-handler
47+
dispatch process. The FRED event dispatch framework takes control
48+
once an event is delivered, and employs a two-level dispatch.
49+
50+
The first level dispatching is event type based, and the second level
51+
dispatching is event vector based.
52+
53+
Full supervisor/user context
54+
============================
55+
56+
FRED event delivery atomically save and restore full supervisor/user
57+
context upon event delivery and return. Thus it avoids the problem of
58+
transient states due to %cr2 and/or %dr6, and it is no longer needed
59+
to handle all the ugly corner cases caused by half baked entry states.
60+
61+
FRED allows explicit unblock of NMI with new event return instructions
62+
ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
63+
unblocks NMI, e.g., when an exception happens during NMI handling.
64+
65+
FRED always restores the full value of %rsp, thus ESPFIX is no longer
66+
needed when FRED is enabled.
67+
68+
LKGS
69+
====
70+
71+
LKGS behaves like the MOV to GS instruction except that it loads the
72+
base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
73+
segment’s descriptor cache. With LKGS, it ends up with avoiding
74+
mucking with kernel GS, i.e., an operating system can always operate
75+
with its own GS base address.
76+
77+
Because FRED event delivery from ring 3 and ERETU both swap the value
78+
of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus
79+
the introduction of LKGS instruction, the SWAPGS instruction is no
80+
longer needed when FRED is enabled, thus is disallowed (#UD).
81+
82+
Stack levels
83+
============
84+
85+
4 stack levels 0~3 are introduced to replace the nonreentrant IST for
86+
event handling, and each stack level should be configured to use a
87+
dedicated stack.
88+
89+
The current stack level could be unchanged or go higher upon FRED
90+
event delivery. If unchanged, the CPU keeps using the current event
91+
stack. If higher, the CPU switches to a new event stack specified by
92+
the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123].
93+
94+
Only execution of a FRED return instruction ERET[US], could lower the
95+
current stack level, causing the CPU to switch back to the stack it was
96+
on before a previous event delivery that promoted the stack level.

Documentation/arch/x86/x86_64/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,4 @@ x86_64 Support
1515
cpu-hotplug-spec
1616
machinecheck
1717
fsgs
18+
fred

MAINTAINERS

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11157,6 +11157,16 @@ L: [email protected]
1115711157
S: Maintained
1115811158
F: drivers/net/wwan/iosm/
1115911159

11160+
INTEL(R) FLEXIBLE RETURN AND EVENT DELIVERY
11161+
M: Xin Li <[email protected]>
11162+
M: "H. Peter Anvin" <[email protected]>
11163+
S: Supported
11164+
F: Documentation/arch/x86/x86_64/fred.rst
11165+
F: arch/x86/entry/entry_64_fred.S
11166+
F: arch/x86/entry/entry_fred.c
11167+
F: arch/x86/include/asm/fred.h
11168+
F: arch/x86/kernel/fred.c
11169+
1116011170
INTEL(R) TRACE HUB
1116111171
M: Alexander Shishkin <[email protected]>
1116211172
S: Supported

arch/x86/Kconfig

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,15 @@ config X86_CPU_RESCTRL
496496

497497
Say N if unsure.
498498

499+
config X86_FRED
500+
bool "Flexible Return and Event Delivery"
501+
depends on X86_64
502+
help
503+
When enabled, try to use Flexible Return and Event Delivery
504+
instead of the legacy SYSCALL/SYSENTER/IDT architecture for
505+
ring transitions and exception/interrupt handling if the
506+
system supports.
507+
499508
if X86_32
500509
config X86_BIGSMP
501510
bool "Support for big SMP systems with more than 8 CPUs"

arch/x86/entry/Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ obj-y += vdso/
1818
obj-y += vsyscall/
1919

2020
obj-$(CONFIG_PREEMPTION) += thunk_$(BITS).o
21+
CFLAGS_entry_fred.o += -fno-stack-protector
22+
CFLAGS_REMOVE_entry_fred.o += -pg $(CC_FLAGS_FTRACE)
23+
obj-$(CONFIG_X86_FRED) += entry_64_fred.o entry_fred.o
24+
2125
obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o
2226
obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o
23-

arch/x86/entry/calling.h

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ For 32-bit we have the following conventions - kernel is built with
6565
* for assembly code:
6666
*/
6767

68-
.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
68+
.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 unwind_hint=1
6969
.if \save_ret
7070
pushq %rsi /* pt_regs->si */
7171
movq 8(%rsp), %rsi /* temporarily store the return address in %rsi */
@@ -87,14 +87,17 @@ For 32-bit we have the following conventions - kernel is built with
8787
pushq %r13 /* pt_regs->r13 */
8888
pushq %r14 /* pt_regs->r14 */
8989
pushq %r15 /* pt_regs->r15 */
90+
91+
.if \unwind_hint
9092
UNWIND_HINT_REGS
93+
.endif
9194

9295
.if \save_ret
9396
pushq %rsi /* return address on top of stack */
9497
.endif
9598
.endm
9699

97-
.macro CLEAR_REGS
100+
.macro CLEAR_REGS clear_bp=1
98101
/*
99102
* Sanitize registers of values that a speculation attack might
100103
* otherwise want to exploit. The lower registers are likely clobbered
@@ -109,17 +112,19 @@ For 32-bit we have the following conventions - kernel is built with
109112
xorl %r10d, %r10d /* nospec r10 */
110113
xorl %r11d, %r11d /* nospec r11 */
111114
xorl %ebx, %ebx /* nospec rbx */
115+
.if \clear_bp
112116
xorl %ebp, %ebp /* nospec rbp */
117+
.endif
113118
xorl %r12d, %r12d /* nospec r12 */
114119
xorl %r13d, %r13d /* nospec r13 */
115120
xorl %r14d, %r14d /* nospec r14 */
116121
xorl %r15d, %r15d /* nospec r15 */
117122

118123
.endm
119124

120-
.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
121-
PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret
122-
CLEAR_REGS
125+
.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 clear_bp=1 unwind_hint=1
126+
PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret unwind_hint=\unwind_hint
127+
CLEAR_REGS clear_bp=\clear_bp
123128
.endm
124129

125130
.macro POP_REGS pop_rdi=1

arch/x86/entry/entry_32.S

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -649,10 +649,6 @@ SYM_CODE_START_LOCAL(asm_\cfunc)
649649
SYM_CODE_END(asm_\cfunc)
650650
.endm
651651

652-
.macro idtentry_sysvec vector cfunc
653-
idtentry \vector asm_\cfunc \cfunc has_error_code=0
654-
.endm
655-
656652
/*
657653
* Include the defines which emit the idt entries which are shared
658654
* shared between 32 and 64 bit and emit the __irqentry_text_* markers

arch/x86/entry/entry_64.S

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,13 @@ SYM_CODE_START(ret_from_fork_asm)
248248
* and unwind should work normally.
249249
*/
250250
UNWIND_HINT_REGS
251+
252+
#ifdef CONFIG_X86_FRED
253+
ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \
254+
"jmp asm_fred_exit_user", X86_FEATURE_FRED
255+
#else
251256
jmp swapgs_restore_regs_and_return_to_usermode
257+
#endif
252258
SYM_CODE_END(ret_from_fork_asm)
253259
.popsection
254260

@@ -371,14 +377,6 @@ SYM_CODE_END(\asmsym)
371377
idtentry \vector asm_\cfunc \cfunc has_error_code=1
372378
.endm
373379

374-
/*
375-
* System vectors which invoke their handlers directly and are not
376-
* going through the regular common device interrupt handling code.
377-
*/
378-
.macro idtentry_sysvec vector cfunc
379-
idtentry \vector asm_\cfunc \cfunc has_error_code=0
380-
.endm
381-
382380
/**
383381
* idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
384382
* @vector: Vector number

arch/x86/entry/entry_64_fred.S

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
/* SPDX-License-Identifier: GPL-2.0 */
2+
/*
3+
* The actual FRED entry points.
4+
*/
5+
6+
#include <linux/export.h>
7+
8+
#include <asm/asm.h>
9+
#include <asm/fred.h>
10+
#include <asm/segment.h>
11+
12+
#include "calling.h"
13+
14+
.code64
15+
.section .noinstr.text, "ax"
16+
17+
.macro FRED_ENTER
18+
UNWIND_HINT_END_OF_STACK
19+
ENDBR
20+
PUSH_AND_CLEAR_REGS
21+
movq %rsp, %rdi /* %rdi -> pt_regs */
22+
.endm
23+
24+
.macro FRED_EXIT
25+
UNWIND_HINT_REGS
26+
POP_REGS
27+
.endm
28+
29+
/*
30+
* The new RIP value that FRED event delivery establishes is
31+
* IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
32+
* Thus the FRED ring 3 entry point must be 4K page aligned.
33+
*/
34+
.align 4096
35+
36+
SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
37+
FRED_ENTER
38+
call fred_entry_from_user
39+
SYM_INNER_LABEL(asm_fred_exit_user, SYM_L_GLOBAL)
40+
FRED_EXIT
41+
1: ERETU
42+
43+
_ASM_EXTABLE_TYPE(1b, asm_fred_entrypoint_user, EX_TYPE_ERETU)
44+
SYM_CODE_END(asm_fred_entrypoint_user)
45+
46+
/*
47+
* The new RIP value that FRED event delivery establishes is
48+
* (IA32_FRED_CONFIG & ~FFFH) + 256 for events that occur in
49+
* ring 0, i.e., asm_fred_entrypoint_user + 256.
50+
*/
51+
.org asm_fred_entrypoint_user + 256, 0xcc
52+
SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
53+
FRED_ENTER
54+
call fred_entry_from_kernel
55+
FRED_EXIT
56+
ERETS
57+
SYM_CODE_END(asm_fred_entrypoint_kernel)
58+
59+
#if IS_ENABLED(CONFIG_KVM_INTEL)
60+
SYM_FUNC_START(asm_fred_entry_from_kvm)
61+
push %rbp
62+
mov %rsp, %rbp
63+
64+
UNWIND_HINT_SAVE
65+
66+
/*
67+
* Both IRQ and NMI from VMX can be handled on current task stack
68+
* because there is no need to protect from reentrancy and the call
69+
* stack leading to this helper is effectively constant and shallow
70+
* (relatively speaking). Do the same when FRED is active, i.e., no
71+
* need to check current stack level for a stack switch.
72+
*
73+
* Emulate the FRED-defined redzone and stack alignment.
74+
*/
75+
sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
76+
and $FRED_STACK_FRAME_RSP_MASK, %rsp
77+
78+
/*
79+
* Start to push a FRED stack frame, which is always 64 bytes:
80+
*
81+
* +--------+-----------------+
82+
* | Bytes | Usage |
83+
* +--------+-----------------+
84+
* | 63:56 | Reserved |
85+
* | 55:48 | Event Data |
86+
* | 47:40 | SS + Event Info |
87+
* | 39:32 | RSP |
88+
* | 31:24 | RFLAGS |
89+
* | 23:16 | CS + Aux Info |
90+
* | 15:8 | RIP |
91+
* | 7:0 | Error Code |
92+
* +--------+-----------------+
93+
*/
94+
push $0 /* Reserved, must be 0 */
95+
push $0 /* Event data, 0 for IRQ/NMI */
96+
push %rdi /* fred_ss handed in by the caller */
97+
push %rbp
98+
pushf
99+
mov $__KERNEL_CS, %rax
100+
push %rax
101+
102+
/*
103+
* Unlike the IDT event delivery, FRED _always_ pushes an error code
104+
* after pushing the return RIP, thus the CALL instruction CANNOT be
105+
* used here to push the return RIP, otherwise there is no chance to
106+
* push an error code before invoking the IRQ/NMI handler.
107+
*
108+
* Use LEA to get the return RIP and push it, then push an error code.
109+
*/
110+
lea 1f(%rip), %rax
111+
push %rax /* Return RIP */
112+
push $0 /* Error code, 0 for IRQ/NMI */
113+
114+
PUSH_AND_CLEAR_REGS clear_bp=0 unwind_hint=0
115+
movq %rsp, %rdi /* %rdi -> pt_regs */
116+
call __fred_entry_from_kvm /* Call the C entry point */
117+
POP_REGS
118+
ERETS
119+
1:
120+
/*
121+
* Objtool doesn't understand what ERETS does, this hint tells it that
122+
* yes, we'll reach here and with what stack state. A save/restore pair
123+
* isn't strictly needed, but it's the simplest form.
124+
*/
125+
UNWIND_HINT_RESTORE
126+
pop %rbp
127+
RET
128+
129+
SYM_FUNC_END(asm_fred_entry_from_kvm)
130+
EXPORT_SYMBOL_GPL(asm_fred_entry_from_kvm);
131+
#endif

0 commit comments

Comments
 (0)