Skip to content

Commit 89c4ff5

Browse files
author
Balazs Gibizer
committed
Add a WA flag waiting for vif-plugged event during reboot
The libvirt driver power on and hard reboot destroys the domain first and unplugs the vifs then recreate the domain and replug the vifs. However nova does not wait for the network-vif-plugged event before unpause the domain. This can cause that the domain starts running and requesting IP via DHCP before the networking backend finished plugging the vifs. So this patch adds a workaround config option to nova to wait for network-vif-plugged events during hard reboot the same way as nova waits for this event during new instance spawn. This logic cannot be enabled unconditionally as not all neutron networking backend sending plug time events to wait for. Also the logic needs to be vnic_type dependent as ml2/ovs and the in tree sriov backend often deployed together on the same compute. While ml2/ovs sends plug time event the sriov backend does not send it reliably. So the configuration is not just a boolean flag but a list of vnic_types instead. This way the waiting for the plug time event for a vif that is handled by ml2/ovs is possible while the instance has other vifs handled by the sriov backend where no event can be expected. Conflicts: nova/conf/workarounds.py due to I2da867f2734b590a884b1fe1200c402cbf7e9e1c is not in stable/wallaby Change-Id: Ie904d1513b5cf76d6d5f6877545e8eb378dd5499 Closes-Bug: #1946729 (cherry picked from commit 68c970e) (cherry picked from commit 0c41bfb)
1 parent 400d25f commit 89c4ff5

File tree

5 files changed

+148
-3
lines changed

5 files changed

+148
-3
lines changed

.zuul.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,12 @@
222222
# reduce the number of placement calls in steady state. Added in
223223
# Stein.
224224
resource_provider_association_refresh: 0
225+
workarounds:
226+
# This wa is an improvement on hard reboot that cannot be turned
227+
# on unconditionally. But we know that ml2/ovs sends plug time
228+
# events so we can enable this in this ovs job for vnic_type
229+
# normal
230+
wait_for_vif_plugged_event_during_hard_reboot: normal
225231
$NOVA_CONF:
226232
quota:
227233
# Added in Train.

nova/conf/workarounds.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,65 @@
353353
* :oslo.config:option:`DEFAULT.instances_path`
354354
* :oslo.config:option:`image_cache.subdirectory_name`
355355
* :oslo.config:option:`update_resources_interval`
356+
"""),
357+
cfg.ListOpt('wait_for_vif_plugged_event_during_hard_reboot',
358+
item_type=cfg.types.String(
359+
choices=[
360+
"normal",
361+
"direct",
362+
"macvtap",
363+
"baremetal",
364+
"direct-physical",
365+
"virtio-forwarder",
366+
"smart-nic",
367+
"vdpa",
368+
"accelerator-direct",
369+
"accelerator-direct-physical",
370+
]),
371+
default=[],
372+
help="""
373+
The libvirt virt driver implements power on and hard reboot by tearing down
374+
every vif of the instance being rebooted then plug them again. By default nova
375+
does not wait for network-vif-plugged event from neutron before it lets the
376+
instance run. This can cause the instance to requests the IP via DHCP before
377+
the neutron backend has a chance to set up the networking backend after the vif
378+
plug.
379+
380+
This flag defines which vifs nova expects network-vif-plugged events from
381+
during hard reboot. The possible values are neutron port vnic types:
382+
383+
* normal
384+
* direct
385+
* macvtap
386+
* baremetal
387+
* direct-physical
388+
* virtio-forwarder
389+
* smart-nic
390+
* vdpa
391+
* accelerator-direct
392+
* accelerator-direct-physical
393+
394+
Adding a ``vnic_type`` to this configuration makes Nova wait for a
395+
network-vif-plugged event for each of the instance's vifs having the specific
396+
``vnic_type`` before unpausing the instance, similarly to how new instance
397+
creation works.
398+
399+
Please note that not all neutron networking backends send plug time events, for
400+
certain ``vnic_type`` therefore this config is empty by default.
401+
402+
The ml2/ovs and the networking-odl backends are known to send plug time events
403+
for ports with ``normal`` ``vnic_type`` so it is safe to add ``normal`` to this
404+
config if you are using only those backends in the compute host.
405+
406+
The neutron in-tree SRIOV backend does not reliably send network-vif-plugged
407+
event during plug time for ports with ``direct`` vnic_type and never sends
408+
that event for port with ``direct-physical`` vnic_type during plug time. For
409+
other ``vnic_type`` and backend pairs, please consult the developers of the
410+
backend.
411+
412+
Related options:
413+
414+
* :oslo.config:option:`DEFAULT.vif_plugging_timeout`
356415
"""),
357416
]
358417

nova/tests/unit/virt/libvirt/test_driver.py

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16088,7 +16088,48 @@ def test_hard_reboot(self, mock_get_mdev, mock_destroy, mock_get_disk_info,
1608816088
accel_info=accel_info)
1608916089
mock_create_guest_with_network.assert_called_once_with(self.context,
1609016090
dummyxml, instance, network_info, block_device_info,
16091-
vifs_already_plugged=True)
16091+
vifs_already_plugged=True, external_events=[])
16092+
16093+
@mock.patch('oslo_utils.fileutils.ensure_tree', new=mock.Mock())
16094+
@mock.patch('nova.virt.libvirt.LibvirtDriver.get_info')
16095+
@mock.patch('nova.virt.libvirt.LibvirtDriver._create_guest_with_network')
16096+
@mock.patch('nova.virt.libvirt.LibvirtDriver._get_guest_xml')
16097+
@mock.patch('nova.virt.libvirt.LibvirtDriver.destroy', new=mock.Mock())
16098+
@mock.patch(
16099+
'nova.virt.libvirt.LibvirtDriver._get_all_assigned_mediated_devices',
16100+
new=mock.Mock(return_value={}))
16101+
def test_hard_reboot_wait_for_plug(
16102+
self, mock_get_guest_xml, mock_create_guest_with_network, mock_get_info
16103+
):
16104+
self.flags(
16105+
group="workarounds",
16106+
wait_for_vif_plugged_event_during_hard_reboot=["normal"])
16107+
self.context.auth_token = None
16108+
instance = objects.Instance(**self.test_instance)
16109+
network_info = _fake_network_info(self, num_networks=4)
16110+
network_info[0]["vnic_type"] = "normal"
16111+
network_info[1]["vnic_type"] = "direct"
16112+
network_info[2]["vnic_type"] = "normal"
16113+
network_info[3]["vnic_type"] = "direct-physical"
16114+
block_device_info = None
16115+
return_values = [hardware.InstanceInfo(state=power_state.SHUTDOWN),
16116+
hardware.InstanceInfo(state=power_state.RUNNING)]
16117+
mock_get_info.side_effect = return_values
16118+
mock_get_guest_xml.return_value = mock.sentinel.xml
16119+
16120+
drvr = libvirt_driver.LibvirtDriver(fake.FakeVirtAPI(), False)
16121+
drvr._hard_reboot(
16122+
self.context, instance, network_info, block_device_info)
16123+
16124+
mock_create_guest_with_network.assert_called_once_with(
16125+
self.context, mock.sentinel.xml, instance, network_info,
16126+
block_device_info,
16127+
vifs_already_plugged=False,
16128+
external_events=[
16129+
('network-vif-plugged', uuids.vif1),
16130+
('network-vif-plugged', uuids.vif3),
16131+
]
16132+
)
1609216133

1609316134
@mock.patch('oslo_utils.fileutils.ensure_tree')
1609416135
@mock.patch('oslo_service.loopingcall.FixedIntervalLoopingCall')

nova/virt/libvirt/driver.py

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3790,11 +3790,32 @@ def _hard_reboot(self, context, instance, network_info,
37903790
# on which vif type we're using and we are working with a stale network
37913791
# info cache here, so won't rely on waiting for neutron plug events.
37923792
# vifs_already_plugged=True means "do not wait for neutron plug events"
3793+
external_events = []
3794+
vifs_already_plugged = True
3795+
event_expected_for_vnic_types = (
3796+
CONF.workarounds.wait_for_vif_plugged_event_during_hard_reboot)
3797+
if event_expected_for_vnic_types:
3798+
# NOTE(gibi): We unplugged every vif during destroy above and we
3799+
# will replug them with _create_guest_with_network. As the
3800+
# workaround config has some vnic_types configured we expect
3801+
# vif-plugged events for every vif with those vnic_types.
3802+
# TODO(gibi): only wait for events if we know that the networking
3803+
# backend sends plug time events. For that we need to finish
3804+
# https://bugs.launchpad.net/neutron/+bug/1821058 first in Neutron
3805+
# then create a driver -> plug-time event mapping in nova.
3806+
external_events = [
3807+
('network-vif-plugged', vif['id'])
3808+
for vif in network_info
3809+
if vif['vnic_type'] in event_expected_for_vnic_types
3810+
]
3811+
vifs_already_plugged = False
3812+
37933813
# NOTE(efried): The instance should already have a vtpm_secret_uuid
37943814
# registered if appropriate.
37953815
self._create_guest_with_network(
37963816
context, xml, instance, network_info, block_device_info,
3797-
vifs_already_plugged=True)
3817+
vifs_already_plugged=vifs_already_plugged,
3818+
external_events=external_events)
37983819

37993820
def _wait_for_reboot():
38003821
"""Called at an interval until the VM is running again."""
@@ -7180,7 +7201,7 @@ def _create_guest_with_network(
71807201
power_on: bool = True,
71817202
vifs_already_plugged: bool = False,
71827203
post_xml_callback: ty.Callable = None,
7183-
external_events: ty.Optional[ty.List[str]] = None,
7204+
external_events: ty.Optional[ty.List[ty.Tuple[str, str]]] = None,
71847205
cleanup_instance_dir: bool = False,
71857206
cleanup_instance_disks: bool = False,
71867207
) -> libvirt_guest.Guest:
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
---
2+
issues:
3+
- |
4+
The libvirt virt driver in Nova implements power on and hard reboot by
5+
destroying the domain first and unpluging the vifs then recreating the
6+
domain and replugging the vifs. However nova does not wait for the
7+
network-vif-plugged event before unpause the domain. This can cause
8+
the domain to start running and requesting IP via DHCP before the
9+
networking backend has finished plugging the vifs. The config option
10+
[workarounds]wait_for_vif_plugged_event_during_hard_reboot has been added,
11+
defaulting to an empty list, that can be used to ensure that the libvirt
12+
driver waits for the network-vif-plugged event for vifs with specific
13+
``vnic_type`` before it unpauses the domain during hard reboot. This should
14+
only be used if the deployment uses a networking backend that sends such
15+
event for the given ``vif_type`` at vif plug time. The ml2/ovs and the
16+
networking-odl Neutron backend is known to send plug time events for ports
17+
with ``normal`` ``vnic_type``. For more information see
18+
https://bugs.launchpad.net/nova/+bug/1946729

0 commit comments

Comments
 (0)