Skip to content

Commit b376144

Browse files
committed
Merge tag 'kvm-x86-fixes-6.2-1' of https://github.com/kvm-x86/linux into HEAD
Misc KVM x86 fixes and cleanups for 6.2: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up the MSR filter docs. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency.
2 parents 44bc611 + 3ebcbd2 commit b376144

File tree

13 files changed

+269
-109
lines changed

13 files changed

+269
-109
lines changed

Documentation/virt/kvm/api.rst

Lines changed: 59 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -4079,80 +4079,71 @@ flags values for ``struct kvm_msr_filter_range``:
40794079
``KVM_MSR_FILTER_READ``
40804080

40814081
Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
4082-
indicates that a read should immediately fail, while a 1 indicates that
4083-
a read for a particular MSR should be handled regardless of the default
4082+
indicates that read accesses should be denied, while a 1 indicates that
4083+
a read for a particular MSR should be allowed regardless of the default
40844084
filter action.
40854085

40864086
``KVM_MSR_FILTER_WRITE``
40874087

40884088
Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
4089-
indicates that a write should immediately fail, while a 1 indicates that
4090-
a write for a particular MSR should be handled regardless of the default
4089+
indicates that write accesses should be denied, while a 1 indicates that
4090+
a write for a particular MSR should be allowed regardless of the default
40914091
filter action.
40924092

4093-
``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``
4094-
4095-
Filter both read and write accesses to MSRs using the given bitmap. A 0
4096-
in the bitmap indicates that both reads and writes should immediately fail,
4097-
while a 1 indicates that reads and writes for a particular MSR are not
4098-
filtered by this range.
4099-
41004093
flags values for ``struct kvm_msr_filter``:
41014094

41024095
``KVM_MSR_FILTER_DEFAULT_ALLOW``
41034096

41044097
If no filter range matches an MSR index that is getting accessed, KVM will
4105-
fall back to allowing access to the MSR.
4098+
allow accesses to all MSRs by default.
41064099

41074100
``KVM_MSR_FILTER_DEFAULT_DENY``
41084101

41094102
If no filter range matches an MSR index that is getting accessed, KVM will
4110-
fall back to rejecting access to the MSR. In this mode, all MSRs that should
4111-
be processed by KVM need to explicitly be marked as allowed in the bitmaps.
4103+
deny accesses to all MSRs by default.
4104+
4105+
This ioctl allows userspace to define up to 16 bitmaps of MSR ranges to deny
4106+
guest MSR accesses that would normally be allowed by KVM. If an MSR is not
4107+
covered by a specific range, the "default" filtering behavior applies. Each
4108+
bitmap range covers MSRs from [base .. base+nmsrs).
41124109

4113-
This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
4114-
specify whether a certain MSR access should be explicitly filtered for or not.
4110+
If an MSR access is denied by userspace, the resulting KVM behavior depends on
4111+
whether or not KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER is
4112+
enabled. If KVM_MSR_EXIT_REASON_FILTER is enabled, KVM will exit to userspace
4113+
on denied accesses, i.e. userspace effectively intercepts the MSR access. If
4114+
KVM_MSR_EXIT_REASON_FILTER is not enabled, KVM will inject a #GP into the guest
4115+
on denied accesses.
41154116

4116-
If this ioctl has never been invoked, MSR accesses are not guarded and the
4117-
default KVM in-kernel emulation behavior is fully preserved.
4117+
If an MSR access is allowed by userspace, KVM will emulate and/or virtualize
4118+
the access in accordance with the vCPU model. Note, KVM may still ultimately
4119+
inject a #GP if an access is allowed by userspace, e.g. if KVM doesn't support
4120+
the MSR, or to follow architectural behavior for the MSR.
4121+
4122+
By default, KVM operates in KVM_MSR_FILTER_DEFAULT_ALLOW mode with no MSR range
4123+
filters.
41184124

41194125
Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
41204126
filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
41214127
an error.
41224128

4123-
As soon as the filtering is in place, every MSR access is processed through
4124-
the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
4125-
x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
4126-
and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
4127-
register.
4128-
41294129
.. warning::
4130-
MSR accesses coming from nested vmentry/vmexit are not filtered.
4130+
MSR accesses as part of nested VM-Enter/VM-Exit are not filtered.
41314131
This includes both writes to individual VMCS fields and reads/writes
41324132
through the MSR lists pointed to by the VMCS.
41334133

4134-
If a bit is within one of the defined ranges, read and write accesses are
4135-
guarded by the bitmap's value for the MSR index if the kind of access
4136-
is included in the ``struct kvm_msr_filter_range`` flags. If no range
4137-
cover this particular access, the behavior is determined by the flags
4138-
field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
4139-
and ``KVM_MSR_FILTER_DEFAULT_DENY``.
4140-
4141-
Each bitmap range specifies a range of MSRs to potentially allow access on.
4142-
The range goes from MSR index [base .. base+nmsrs]. The flags field
4143-
indicates whether reads, writes or both reads and writes are filtered
4144-
by setting a 1 bit in the bitmap for the corresponding MSR index.
4145-
4146-
If an MSR access is not permitted through the filtering, it generates a
4147-
#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
4148-
allows user space to deflect and potentially handle various MSR accesses
4149-
into user space.
4134+
x2APIC MSR accesses cannot be filtered (KVM silently ignores filters that
4135+
cover any x2APIC MSRs).
41504136

41514137
Note, invoking this ioctl while a vCPU is running is inherently racy. However,
41524138
KVM does guarantee that vCPUs will see either the previous filter or the new
41534139
filter, e.g. MSRs with identical settings in both the old and new filter will
41544140
have deterministic behavior.
41554141

4142+
Similarly, if userspace wishes to intercept on denied accesses,
4143+
KVM_MSR_EXIT_REASON_FILTER must be enabled before activating any filters, and
4144+
left enabled until after all filters are deactivated. Failure to do so may
4145+
result in KVM injecting a #GP instead of exiting to userspace.
4146+
41564147
4.98 KVM_CREATE_SPAPR_TCE_64
41574148
----------------------------
41584149

@@ -6457,31 +6448,33 @@ if it decides to decode and emulate the instruction.
64576448

64586449
Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
64596450
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
6460-
will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
6451+
may instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
64616452
exit for writes.
64626453

6463-
The "reason" field specifies why the MSR trap occurred. User space will only
6464-
receive MSR exit traps when a particular reason was requested during through
6454+
The "reason" field specifies why the MSR interception occurred. Userspace will
6455+
only receive MSR exits when a particular reason was requested during through
64656456
ENABLE_CAP. Currently valid exit reasons are:
64666457

64676458
KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
64686459
KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
64696460
KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER
64706461

6471-
For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
6472-
wants to read. To respond to this request with a successful read, user space
6462+
For KVM_EXIT_X86_RDMSR, the "index" field tells userspace which MSR the guest
6463+
wants to read. To respond to this request with a successful read, userspace
64736464
writes the respective data into the "data" field and must continue guest
64746465
execution to ensure the read data is transferred into guest register state.
64756466

6476-
If the RDMSR request was unsuccessful, user space indicates that with a "1" in
6467+
If the RDMSR request was unsuccessful, userspace indicates that with a "1" in
64776468
the "error" field. This will inject a #GP into the guest when the VCPU is
64786469
executed again.
64796470

6480-
For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
6481-
wants to write. Once finished processing the event, user space must continue
6482-
vCPU execution. If the MSR write was unsuccessful, user space also sets the
6471+
For KVM_EXIT_X86_WRMSR, the "index" field tells userspace which MSR the guest
6472+
wants to write. Once finished processing the event, userspace must continue
6473+
vCPU execution. If the MSR write was unsuccessful, userspace also sets the
64836474
"error" field to "1".
64846475

6476+
See KVM_X86_SET_MSR_FILTER for details on the interaction with MSR filtering.
6477+
64856478
::
64866479

64876480

@@ -7247,19 +7240,27 @@ the module parameter for the target VM.
72477240
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
72487241
:Returns: 0 on success; -1 on error
72497242

7250-
This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
7251-
into user space.
7243+
This capability allows userspace to intercept RDMSR and WRMSR instructions if
7244+
access to an MSR is denied. By default, KVM injects #GP on denied accesses.
72527245

72537246
When a guest requests to read or write an MSR, KVM may not implement all MSRs
72547247
that are relevant to a respective system. It also does not differentiate by
72557248
CPU type.
72567249

7257-
To allow more fine grained control over MSR handling, user space may enable
7250+
To allow more fine grained control over MSR handling, userspace may enable
72587251
this capability. With it enabled, MSR accesses that match the mask specified in
7259-
args[0] and trigger a #GP event inside the guest by KVM will instead trigger
7260-
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
7261-
can then handle to implement model specific MSR handling and/or user notifications
7262-
to inform a user that an MSR was not handled.
7252+
args[0] and would trigger a #GP inside the guest will instead trigger
7253+
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications. Userspace
7254+
can then implement model specific MSR handling and/or user notifications
7255+
to inform a user that an MSR was not emulated/virtualized by KVM.
7256+
7257+
The valid mask flags are:
7258+
7259+
KVM_MSR_EXIT_REASON_UNKNOWN - intercept accesses to unknown (to KVM) MSRs
7260+
KVM_MSR_EXIT_REASON_INVAL - intercept accesses that are architecturally
7261+
invalid according to the vCPU model and/or mode
7262+
KVM_MSR_EXIT_REASON_FILTER - intercept accesses that are denied by userspace
7263+
via KVM_X86_SET_MSR_FILTER
72637264

72647265
7.22 KVM_CAP_X86_BUS_LOCK_EXIT
72657266
-------------------------------
@@ -7919,7 +7920,7 @@ KVM_EXIT_X86_WRMSR exit notifications.
79197920
This capability indicates that KVM supports that accesses to user defined MSRs
79207921
may be rejected. With this capability exposed, KVM exports new VM ioctl
79217922
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
7922-
ranges that KVM should reject access to.
7923+
ranges that KVM should deny access to.
79237924

79247925
In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
79257926
trap and emulate MSRs that are outside of the scope of KVM as well as

arch/x86/kvm/svm/sev.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -465,9 +465,9 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages)
465465
return;
466466

467467
for (i = 0; i < npages; i++) {
468-
page_virtual = kmap_atomic(pages[i]);
468+
page_virtual = kmap_local_page(pages[i]);
469469
clflush_cache_range(page_virtual, PAGE_SIZE);
470-
kunmap_atomic(page_virtual);
470+
kunmap_local(page_virtual);
471471
cond_resched();
472472
}
473473
}

arch/x86/kvm/svm/svm.c

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3895,8 +3895,14 @@ static int svm_vcpu_pre_run(struct kvm_vcpu *vcpu)
38953895

38963896
static fastpath_t svm_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
38973897
{
3898-
if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_MSR &&
3899-
to_svm(vcpu)->vmcb->control.exit_info_1)
3898+
struct vmcb_control_area *control = &to_svm(vcpu)->vmcb->control;
3899+
3900+
/*
3901+
* Note, the next RIP must be provided as SRCU isn't held, i.e. KVM
3902+
* can't read guest memory (dereference memslots) to decode the WRMSR.
3903+
*/
3904+
if (control->exit_code == SVM_EXIT_MSR && control->exit_info_1 &&
3905+
nrips && control->next_rip)
39003906
return handle_fastpath_set_msr_irqoff(vcpu);
39013907

39023908
return EXIT_FASTPATH_NONE;

arch/x86/kvm/vmx/nested.c

Lines changed: 47 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2588,12 +2588,9 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
25882588
nested_ept_init_mmu_context(vcpu);
25892589

25902590
/*
2591-
* This sets GUEST_CR0 to vmcs12->guest_cr0, possibly modifying those
2592-
* bits which we consider mandatory enabled.
2593-
* The CR0_READ_SHADOW is what L2 should have expected to read given
2594-
* the specifications by L1; It's not enough to take
2595-
* vmcs12->cr0_read_shadow because on our cr0_guest_host_mask we
2596-
* have more bits than L1 expected.
2591+
* Override the CR0/CR4 read shadows after setting the effective guest
2592+
* CR0/CR4. The common helpers also set the shadows, but they don't
2593+
* account for vmcs12's cr0/4_guest_host_mask.
25972594
*/
25982595
vmx_set_cr0(vcpu, vmcs12->guest_cr0);
25992596
vmcs_writel(CR0_READ_SHADOW, nested_read_cr0(vmcs12));
@@ -4798,6 +4795,17 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
47984795

47994796
vmx_switch_vmcs(vcpu, &vmx->vmcs01);
48004797

4798+
/*
4799+
* If IBRS is advertised to the vCPU, KVM must flush the indirect
4800+
* branch predictors when transitioning from L2 to L1, as L1 expects
4801+
* hardware (KVM in this case) to provide separate predictor modes.
4802+
* Bare metal isolates VMX root (host) from VMX non-root (guest), but
4803+
* doesn't isolate different VMCSs, i.e. in this case, doesn't provide
4804+
* separate modes for L2 vs L1.
4805+
*/
4806+
if (guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
4807+
indirect_branch_prediction_barrier();
4808+
48014809
/* Update any VMCS fields that might have changed while L2 ran */
48024810
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
48034811
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
@@ -5131,24 +5139,35 @@ static int handle_vmxon(struct kvm_vcpu *vcpu)
51315139
| FEAT_CTL_VMX_ENABLED_OUTSIDE_SMX;
51325140

51335141
/*
5134-
* Note, KVM cannot rely on hardware to perform the CR0/CR4 #UD checks
5135-
* that have higher priority than VM-Exit (see Intel SDM's pseudocode
5136-
* for VMXON), as KVM must load valid CR0/CR4 values into hardware while
5137-
* running the guest, i.e. KVM needs to check the _guest_ values.
5142+
* Manually check CR4.VMXE checks, KVM must force CR4.VMXE=1 to enter
5143+
* the guest and so cannot rely on hardware to perform the check,
5144+
* which has higher priority than VM-Exit (see Intel SDM's pseudocode
5145+
* for VMXON).
51385146
*
5139-
* Rely on hardware for the other two pre-VM-Exit checks, !VM86 and
5140-
* !COMPATIBILITY modes. KVM may run the guest in VM86 to emulate Real
5141-
* Mode, but KVM will never take the guest out of those modes.
5147+
* Rely on hardware for the other pre-VM-Exit checks, CR0.PE=1, !VM86
5148+
* and !COMPATIBILITY modes. For an unrestricted guest, KVM doesn't
5149+
* force any of the relevant guest state. For a restricted guest, KVM
5150+
* does force CR0.PE=1, but only to also force VM86 in order to emulate
5151+
* Real Mode, and so there's no need to check CR0.PE manually.
51425152
*/
5143-
if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
5144-
!nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
5153+
if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE)) {
51455154
kvm_queue_exception(vcpu, UD_VECTOR);
51465155
return 1;
51475156
}
51485157

51495158
/*
5150-
* CPL=0 and all other checks that are lower priority than VM-Exit must
5151-
* be checked manually.
5159+
* The CPL is checked for "not in VMX operation" and for "in VMX root",
5160+
* and has higher priority than the VM-Fail due to being post-VMXON,
5161+
* i.e. VMXON #GPs outside of VMX non-root if CPL!=0. In VMX non-root,
5162+
* VMXON causes VM-Exit and KVM unconditionally forwards VMXON VM-Exits
5163+
* from L2 to L1, i.e. there's no need to check for the vCPU being in
5164+
* VMX non-root.
5165+
*
5166+
* Forwarding the VM-Exit unconditionally, i.e. without performing the
5167+
* #UD checks (see above), is functionally ok because KVM doesn't allow
5168+
* L1 to run L2 without CR4.VMXE=0, and because KVM never modifies L2's
5169+
* CR0 or CR4, i.e. it's L2's responsibility to emulate #UDs that are
5170+
* missed by hardware due to shadowing CR0 and/or CR4.
51525171
*/
51535172
if (vmx_get_cpl(vcpu)) {
51545173
kvm_inject_gp(vcpu, 0);
@@ -5158,6 +5177,17 @@ static int handle_vmxon(struct kvm_vcpu *vcpu)
51585177
if (vmx->nested.vmxon)
51595178
return nested_vmx_fail(vcpu, VMXERR_VMXON_IN_VMX_ROOT_OPERATION);
51605179

5180+
/*
5181+
* Invalid CR0/CR4 generates #GP. These checks are performed if and
5182+
* only if the vCPU isn't already in VMX operation, i.e. effectively
5183+
* have lower priority than the VM-Fail above.
5184+
*/
5185+
if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) ||
5186+
!nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) {
5187+
kvm_inject_gp(vcpu, 0);
5188+
return 1;
5189+
}
5190+
51615191
if ((vmx->msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
51625192
!= VMXON_NEEDED_FEATURES) {
51635193
kvm_inject_gp(vcpu, 0);

arch/x86/kvm/vmx/nested.h

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,9 +79,10 @@ static inline bool nested_ept_ad_enabled(struct kvm_vcpu *vcpu)
7979
}
8080

8181
/*
82-
* Return the cr0 value that a nested guest would read. This is a combination
83-
* of the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
84-
* its hypervisor (cr0_read_shadow).
82+
* Return the cr0/4 value that a nested guest would read. This is a combination
83+
* of L1's "real" cr0 used to run the guest (guest_cr0), and the bits shadowed
84+
* by the L1 hypervisor (cr0_read_shadow). KVM must emulate CPU behavior as
85+
* the value+mask loaded into vmcs02 may not match the vmcs12 fields.
8586
*/
8687
static inline unsigned long nested_read_cr0(struct vmcs12 *fields)
8788
{

arch/x86/kvm/vmx/sgx.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,8 +182,10 @@ static int __handle_encls_ecreate(struct kvm_vcpu *vcpu,
182182
/* Enforce CPUID restriction on max enclave size. */
183183
max_size_log2 = (attributes & SGX_ATTR_MODE64BIT) ? sgx_12_0->edx >> 8 :
184184
sgx_12_0->edx;
185-
if (size >= BIT_ULL(max_size_log2))
185+
if (size >= BIT_ULL(max_size_log2)) {
186186
kvm_inject_gp(vcpu, 0);
187+
return 1;
188+
}
187189

188190
/*
189191
* sgx_virt_ecreate() returns:

arch/x86/kvm/vmx/vmenter.S

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,7 @@ SYM_FUNC_END(__vmx_vcpu_run)
269269

270270
.section .text, "ax"
271271

272+
#ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
272273
/**
273274
* vmread_error_trampoline - Trampoline from inline asm to vmread_error()
274275
* @field: VMCS field encoding that failed
@@ -317,6 +318,7 @@ SYM_FUNC_START(vmread_error_trampoline)
317318

318319
RET
319320
SYM_FUNC_END(vmread_error_trampoline)
321+
#endif
320322

321323
SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
322324
/*

0 commit comments

Comments
 (0)