Skip to content

Commit 3a0f64d

Browse files
sean-jcbonzini
authored andcommitted
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit 1af4a96 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: faaf05b ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 1af4a96 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <[email protected]> Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
1 parent 9fb12fe commit 3a0f64d

File tree

3 files changed

+28
-13
lines changed

3 files changed

+28
-13
lines changed

arch/x86/kvm/mmu/tdp_iter.c

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
2626
*/
2727
void tdp_iter_restart(struct tdp_iter *iter)
2828
{
29+
iter->yielded = false;
2930
iter->yielded_gfn = iter->next_last_level_gfn;
3031
iter->level = iter->root_level;
3132

@@ -160,6 +161,11 @@ static bool try_step_up(struct tdp_iter *iter)
160161
*/
161162
void tdp_iter_next(struct tdp_iter *iter)
162163
{
164+
if (iter->yielded) {
165+
tdp_iter_restart(iter);
166+
return;
167+
}
168+
163169
if (try_step_down(iter))
164170
return;
165171

arch/x86/kvm/mmu/tdp_iter.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,12 @@ struct tdp_iter {
4545
* iterator walks off the end of the paging structure.
4646
*/
4747
bool valid;
48+
/*
49+
* True if KVM dropped mmu_lock and yielded in the middle of a walk, in
50+
* which case tdp_iter_next() needs to restart the walk at the root
51+
* level instead of advancing to the next entry.
52+
*/
53+
bool yielded;
4854
};
4955

5056
/*

arch/x86/kvm/mmu/tdp_mmu.c

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -502,6 +502,8 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
502502
struct tdp_iter *iter,
503503
u64 new_spte)
504504
{
505+
WARN_ON_ONCE(iter->yielded);
506+
505507
lockdep_assert_held_read(&kvm->mmu_lock);
506508

507509
/*
@@ -575,6 +577,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
575577
u64 new_spte, bool record_acc_track,
576578
bool record_dirty_log)
577579
{
580+
WARN_ON_ONCE(iter->yielded);
581+
578582
lockdep_assert_held_write(&kvm->mmu_lock);
579583

580584
/*
@@ -640,18 +644,19 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
640644
* If this function should yield and flush is set, it will perform a remote
641645
* TLB flush before yielding.
642646
*
643-
* If this function yields, it will also reset the tdp_iter's walk over the
644-
* paging structure and the calling function should skip to the next
645-
* iteration to allow the iterator to continue its traversal from the
646-
* paging structure root.
647+
* If this function yields, iter->yielded is set and the caller must skip to
648+
* the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
649+
* over the paging structures to allow the iterator to continue its traversal
650+
* from the paging structure root.
647651
*
648-
* Return true if this function yielded and the iterator's traversal was reset.
649-
* Return false if a yield was not needed.
652+
* Returns true if this function yielded.
650653
*/
651-
static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
652-
struct tdp_iter *iter, bool flush,
653-
bool shared)
654+
static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
655+
struct tdp_iter *iter,
656+
bool flush, bool shared)
654657
{
658+
WARN_ON(iter->yielded);
659+
655660
/* Ensure forward progress has been made before yielding. */
656661
if (iter->next_last_level_gfn == iter->yielded_gfn)
657662
return false;
@@ -671,12 +676,10 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
671676

672677
WARN_ON(iter->gfn > iter->next_last_level_gfn);
673678

674-
tdp_iter_restart(iter);
675-
676-
return true;
679+
iter->yielded = true;
677680
}
678681

679-
return false;
682+
return iter->yielded;
680683
}
681684

682685
/*

0 commit comments

Comments
 (0)