Skip to content

Commit fef8f2b

Browse files
maluka-dmytrobonzini
authored andcommitted
KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking
KVM irqfd based emulation of level-triggered interrupts doesn't work quite correctly in some cases, particularly in the case of interrupts that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device in its threaded irq handler, i.e. later than it is acked to the interrupt controller (EOI at the end of hardirq), not earlier. Linux keeps such interrupt masked until its threaded handler finishes, to prevent the EOI from re-asserting an unacknowledged interrupt. However, with KVM + vfio (or whatever is listening on the resamplefd) we always notify resamplefd at the EOI, so vfio prematurely unmasks the host physical IRQ, thus a new physical interrupt is fired in the host. This extra interrupt in the host is not a problem per se. The problem is that it is unconditionally queued for injection into the guest, so the guest sees an extra bogus interrupt. [*] There are observed at least 2 user-visible issues caused by those extra erroneous interrupts for a oneshot irq in the guest: 1. System suspend aborted due to a pending wakeup interrupt from ChromeOS EC (drivers/platform/chrome/cros_ec.c). 2. Annoying "invalid report id data" errors from ELAN0000 touchpad (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg every time the touchpad is touched. The core issue here is that by the time when the guest unmasks the IRQ, the physical IRQ line is no longer asserted (since the guest has acked the interrupt to the device in the meantime), yet we unconditionally inject the interrupt queued into the guest by the previous resampling. So to fix the issue, we need a way to detect that the IRQ is no longer pending, and cancel the queued interrupt in this case. With IOAPIC we are not able to probe the physical IRQ line state directly (at least not if the underlying physical interrupt controller is an IOAPIC too), so in this patch we use irqfd resampler for that. Namely, instead of injecting the queued interrupt, we just notify the resampler that this interrupt is done. If the IRQ line is actually already deasserted, we are done. If it is still asserted, a new interrupt will be shortly triggered through irqfd and injected into the guest. In the case if there is no irqfd resampler registered for this IRQ, we cannot fix the issue, so we keep the existing behavior: immediately unconditionally inject the queued interrupt. This patch fixes the issue for x86 IOAPIC only. In the long run, we can fix it for other irqchips and other architectures too, possibly taking advantage of reading the physical state of the IRQ line, which is possible with some other irqchips (e.g. with arm64 GIC, maybe even with the legacy x86 PIC). [*] In this description we assume that the interrupt is a physical host interrupt forwarded to the guest e.g. by vfio. Potentially the same issue may occur also with a purely virtual interrupt from an emulated device, e.g. if the guest handles this interrupt, again, as a oneshot interrupt. Signed-off-by: Dmytro Maluka <[email protected]> Link: https://lore.kernel.org/kvm/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/ Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
1 parent d583fbd commit fef8f2b

File tree

3 files changed

+78
-9
lines changed

3 files changed

+78
-9
lines changed

arch/x86/kvm/ioapic.c

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -368,9 +368,39 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
368368
mask_after = e->fields.mask;
369369
if (mask_before != mask_after)
370370
kvm_fire_mask_notifiers(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index, mask_after);
371-
if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG
372-
&& ioapic->irr & (1 << index))
373-
ioapic_service(ioapic, index, false);
371+
if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG &&
372+
ioapic->irr & (1 << index) && !e->fields.mask && !e->fields.remote_irr) {
373+
/*
374+
* Pending status in irr may be outdated: the IRQ line may have
375+
* already been deasserted by a device while the IRQ was masked.
376+
* This occurs, for instance, if the interrupt is handled in a
377+
* Linux guest as a oneshot interrupt (IRQF_ONESHOT). In this
378+
* case the guest acknowledges the interrupt to the device in
379+
* its threaded irq handler, i.e. after the EOI but before
380+
* unmasking, so at the time of unmasking the IRQ line is
381+
* already down but our pending irr bit is still set. In such
382+
* cases, injecting this pending interrupt to the guest is
383+
* buggy: the guest will receive an extra unwanted interrupt.
384+
*
385+
* So we need to check here if the IRQ is actually still pending.
386+
* As we are generally not able to probe the IRQ line status
387+
* directly, we do it through irqfd resampler. Namely, we clear
388+
* the pending status and notify the resampler that this interrupt
389+
* is done, without actually injecting it into the guest. If the
390+
* IRQ line is actually already deasserted, we are done. If it is
391+
* still asserted, a new interrupt will be shortly triggered
392+
* through irqfd and injected into the guest.
393+
*
394+
* If, however, it's not possible to resample (no irqfd resampler
395+
* registered for this irq), then unconditionally inject this
396+
* pending interrupt into the guest, so the guest will not miss
397+
* an interrupt, although may get an extra unwanted interrupt.
398+
*/
399+
if (kvm_notify_irqfd_resampler(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index))
400+
ioapic->irr &= ~(1 << index);
401+
else
402+
ioapic_service(ioapic, index, false);
403+
}
374404
if (e->fields.delivery_mode == APIC_DM_FIXED) {
375405
struct kvm_lapic_irq irq;
376406

include/linux/kvm_host.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1987,6 +1987,9 @@ int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
19871987
#ifdef CONFIG_HAVE_KVM_IRQFD
19881988
int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
19891989
void kvm_irqfd_release(struct kvm *kvm);
1990+
bool kvm_notify_irqfd_resampler(struct kvm *kvm,
1991+
unsigned int irqchip,
1992+
unsigned int pin);
19901993
void kvm_irq_routing_update(struct kvm *);
19911994
#else
19921995
static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
@@ -1995,6 +1998,13 @@ static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
19951998
}
19961999

19972000
static inline void kvm_irqfd_release(struct kvm *kvm) {}
2001+
2002+
static inline bool kvm_notify_irqfd_resampler(struct kvm *kvm,
2003+
unsigned int irqchip,
2004+
unsigned int pin)
2005+
{
2006+
return false;
2007+
}
19982008
#endif
19992009

20002010
#else

virt/kvm/eventfd.c

Lines changed: 35 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,15 @@ irqfd_inject(struct work_struct *work)
5555
irqfd->gsi, 1, false);
5656
}
5757

58+
static void irqfd_resampler_notify(struct kvm_kernel_irqfd_resampler *resampler)
59+
{
60+
struct kvm_kernel_irqfd *irqfd;
61+
62+
list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
63+
srcu_read_lock_held(&resampler->kvm->irq_srcu))
64+
eventfd_signal(irqfd->resamplefd, 1);
65+
}
66+
5867
/*
5968
* Since resampler irqfds share an IRQ source ID, we de-assert once
6069
* then notify all of the resampler irqfds using this GSI. We can't
@@ -65,7 +74,6 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
6574
{
6675
struct kvm_kernel_irqfd_resampler *resampler;
6776
struct kvm *kvm;
68-
struct kvm_kernel_irqfd *irqfd;
6977
int idx;
7078

7179
resampler = container_of(kian,
@@ -76,11 +84,7 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
7684
resampler->notifier.gsi, 0, false);
7785

7886
idx = srcu_read_lock(&kvm->irq_srcu);
79-
80-
list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
81-
srcu_read_lock_held(&kvm->irq_srcu))
82-
eventfd_signal(irqfd->resamplefd, 1);
83-
87+
irqfd_resampler_notify(resampler);
8488
srcu_read_unlock(&kvm->irq_srcu, idx);
8589
}
8690

@@ -648,6 +652,31 @@ void kvm_irq_routing_update(struct kvm *kvm)
648652
spin_unlock_irq(&kvm->irqfds.lock);
649653
}
650654

655+
bool kvm_notify_irqfd_resampler(struct kvm *kvm,
656+
unsigned int irqchip,
657+
unsigned int pin)
658+
{
659+
struct kvm_kernel_irqfd_resampler *resampler;
660+
int gsi, idx;
661+
662+
idx = srcu_read_lock(&kvm->irq_srcu);
663+
gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
664+
if (gsi != -1) {
665+
list_for_each_entry_srcu(resampler,
666+
&kvm->irqfds.resampler_list, link,
667+
srcu_read_lock_held(&kvm->irq_srcu)) {
668+
if (resampler->notifier.gsi == gsi) {
669+
irqfd_resampler_notify(resampler);
670+
srcu_read_unlock(&kvm->irq_srcu, idx);
671+
return true;
672+
}
673+
}
674+
}
675+
srcu_read_unlock(&kvm->irq_srcu, idx);
676+
677+
return false;
678+
}
679+
651680
/*
652681
* create a host-wide workqueue for issuing deferred shutdown requests
653682
* aggregated from all vm* instances. We need our own isolated

0 commit comments

Comments
 (0)