Skip to content

Commit b64d740

Browse files
Junaid Shahidbonzini
authored andcommitted
kvm: x86: mmu: Always flush TLBs when enabling dirty logging
When A/D bits are not available, KVM uses a software access tracking mechanism, which involves making the SPTEs inaccessible. However, the clear_young() MMU notifier does not flush TLBs. So it is possible that there may still be stale, potentially writable, TLB entries. This is usually fine, but can be problematic when enabling dirty logging, because it currently only does a TLB flush if any SPTEs were modified. But if all SPTEs are in access-tracked state, then there won't be a TLB flush, which means that the guest could still possibly write to memory and not have it reflected in the dirty bitmap. So just unconditionally flush the TLBs when enabling dirty logging. As an alternative, KVM could explicitly check the MMU-Writable bit when write-protecting SPTEs to decide if a flush is needed (instead of checking the Writable bit), but given that a flush almost always happens anyway, so just making it unconditional seems simpler. Signed-off-by: Junaid Shahid <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
1 parent 1441ca1 commit b64d740

File tree

3 files changed

+61
-42
lines changed

3 files changed

+61
-42
lines changed

arch/x86/kvm/mmu/mmu.c

Lines changed: 7 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -6072,47 +6072,18 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
60726072
const struct kvm_memory_slot *memslot,
60736073
int start_level)
60746074
{
6075-
bool flush = false;
6076-
60776075
if (kvm_memslots_have_rmaps(kvm)) {
60786076
write_lock(&kvm->mmu_lock);
6079-
flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
6080-
start_level, KVM_MAX_HUGEPAGE_LEVEL,
6081-
false);
6077+
slot_handle_level(kvm, memslot, slot_rmap_write_protect,
6078+
start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
60826079
write_unlock(&kvm->mmu_lock);
60836080
}
60846081

60856082
if (is_tdp_mmu_enabled(kvm)) {
60866083
read_lock(&kvm->mmu_lock);
6087-
flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
6084+
kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
60886085
read_unlock(&kvm->mmu_lock);
60896086
}
6090-
6091-
/*
6092-
* Flush TLBs if any SPTEs had to be write-protected to ensure that
6093-
* guest writes are reflected in the dirty bitmap before the memslot
6094-
* update completes, i.e. before enabling dirty logging is visible to
6095-
* userspace.
6096-
*
6097-
* Perform the TLB flush outside the mmu_lock to reduce the amount of
6098-
* time the lock is held. However, this does mean that another CPU can
6099-
* now grab mmu_lock and encounter a write-protected SPTE while CPUs
6100-
* still have a writable mapping for the associated GFN in their TLB.
6101-
*
6102-
* This is safe but requires KVM to be careful when making decisions
6103-
* based on the write-protection status of an SPTE. Specifically, KVM
6104-
* also write-protects SPTEs to monitor changes to guest page tables
6105-
* during shadow paging, and must guarantee no CPUs can write to those
6106-
* page before the lock is dropped. As mentioned in the previous
6107-
* paragraph, a write-protected SPTE is no guarantee that CPU cannot
6108-
* perform writes. So to determine if a TLB flush is truly required, KVM
6109-
* will clear a separate software-only bit (MMU-writable) and skip the
6110-
* flush if-and-only-if this bit was already clear.
6111-
*
6112-
* See is_writable_pte() for more details.
6113-
*/
6114-
if (flush)
6115-
kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
61166087
}
61176088

61186089
static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
@@ -6480,32 +6451,30 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
64806451
void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
64816452
const struct kvm_memory_slot *memslot)
64826453
{
6483-
bool flush = false;
6484-
64856454
if (kvm_memslots_have_rmaps(kvm)) {
64866455
write_lock(&kvm->mmu_lock);
64876456
/*
64886457
* Clear dirty bits only on 4k SPTEs since the legacy MMU only
64896458
* support dirty logging at a 4k granularity.
64906459
*/
6491-
flush = slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
6460+
slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
64926461
write_unlock(&kvm->mmu_lock);
64936462
}
64946463

64956464
if (is_tdp_mmu_enabled(kvm)) {
64966465
read_lock(&kvm->mmu_lock);
6497-
flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
6466+
kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
64986467
read_unlock(&kvm->mmu_lock);
64996468
}
65006469

65016470
/*
6471+
* The caller will flush the TLBs after this function returns.
6472+
*
65026473
* It's also safe to flush TLBs out of mmu lock here as currently this
65036474
* function is only used for dirty logging, in which case flushing TLB
65046475
* out of mmu lock also guarantees no dirty pages will be lost in
65056476
* dirty_bitmap.
65066477
*/
6507-
if (flush)
6508-
kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
65096478
}
65106479

65116480
void kvm_mmu_zap_all(struct kvm *kvm)

arch/x86/kvm/mmu/spte.h

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -343,7 +343,7 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
343343
}
344344

345345
/*
346-
* An shadow-present leaf SPTE may be non-writable for 3 possible reasons:
346+
* A shadow-present leaf SPTE may be non-writable for 4 possible reasons:
347347
*
348348
* 1. To intercept writes for dirty logging. KVM write-protects huge pages
349349
* so that they can be split be split down into the dirty logging
@@ -361,8 +361,13 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
361361
* read-only memslot or guest memory backed by a read-only VMA. Writes to
362362
* such pages are disallowed entirely.
363363
*
364-
* To keep track of why a given SPTE is write-protected, KVM uses 2
365-
* software-only bits in the SPTE:
364+
* 4. To emulate the Accessed bit for SPTEs without A/D bits. Note, in this
365+
* case, the SPTE is access-protected, not just write-protected!
366+
*
367+
* For cases #1 and #4, KVM can safely make such SPTEs writable without taking
368+
* mmu_lock as capturing the Accessed/Dirty state doesn't require taking it.
369+
* To differentiate #1 and #4 from #2 and #3, KVM uses two software-only bits
370+
* in the SPTE:
366371
*
367372
* shadow_mmu_writable_mask, aka MMU-writable -
368373
* Cleared on SPTEs that KVM is currently write-protecting for shadow paging
@@ -391,7 +396,8 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
391396
* shadow page tables between vCPUs. Write-protecting an SPTE for dirty logging
392397
* (which does not clear the MMU-writable bit), does not flush TLBs before
393398
* dropping the lock, as it only needs to synchronize guest writes with the
394-
* dirty bitmap.
399+
* dirty bitmap. Similarly, making the SPTE inaccessible (and non-writable) for
400+
* access-tracking via the clear_young() MMU notifier also does not flush TLBs.
395401
*
396402
* So, there is the problem: clearing the MMU-writable bit can encounter a
397403
* write-protected SPTE while CPUs still have writable mappings for that SPTE

arch/x86/kvm/x86.c

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12473,6 +12473,50 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
1247312473
} else {
1247412474
kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
1247512475
}
12476+
12477+
/*
12478+
* Unconditionally flush the TLBs after enabling dirty logging.
12479+
* A flush is almost always going to be necessary (see below),
12480+
* and unconditionally flushing allows the helpers to omit
12481+
* the subtly complex checks when removing write access.
12482+
*
12483+
* Do the flush outside of mmu_lock to reduce the amount of
12484+
* time mmu_lock is held. Flushing after dropping mmu_lock is
12485+
* safe as KVM only needs to guarantee the slot is fully
12486+
* write-protected before returning to userspace, i.e. before
12487+
* userspace can consume the dirty status.
12488+
*
12489+
* Flushing outside of mmu_lock requires KVM to be careful when
12490+
* making decisions based on writable status of an SPTE, e.g. a
12491+
* !writable SPTE doesn't guarantee a CPU can't perform writes.
12492+
*
12493+
* Specifically, KVM also write-protects guest page tables to
12494+
* monitor changes when using shadow paging, and must guarantee
12495+
* no CPUs can write to those page before mmu_lock is dropped.
12496+
* Because CPUs may have stale TLB entries at this point, a
12497+
* !writable SPTE doesn't guarantee CPUs can't perform writes.
12498+
*
12499+
* KVM also allows making SPTES writable outside of mmu_lock,
12500+
* e.g. to allow dirty logging without taking mmu_lock.
12501+
*
12502+
* To handle these scenarios, KVM uses a separate software-only
12503+
* bit (MMU-writable) to track if a SPTE is !writable due to
12504+
* a guest page table being write-protected (KVM clears the
12505+
* MMU-writable flag when write-protecting for shadow paging).
12506+
*
12507+
* The use of MMU-writable is also the primary motivation for
12508+
* the unconditional flush. Because KVM must guarantee that a
12509+
* CPU doesn't contain stale, writable TLB entries for a
12510+
* !MMU-writable SPTE, KVM must flush if it encounters any
12511+
* MMU-writable SPTE regardless of whether the actual hardware
12512+
* writable bit was set. I.e. KVM is almost guaranteed to need
12513+
* to flush, while unconditionally flushing allows the "remove
12514+
* write access" helpers to ignore MMU-writable entirely.
12515+
*
12516+
* See is_writable_pte() for more details (the case involving
12517+
* access-tracked SPTEs is particularly relevant).
12518+
*/
12519+
kvm_arch_flush_remote_tlbs_memslot(kvm, new);
1247612520
}
1247712521
}
1247812522

0 commit comments

Comments
 (0)