Skip to content

Commit f167891

Browse files
author
Balazs Gibizer
committed
Repro gen conflict in COMPUTE_STATUS_DISABLED handling
The COMPUTE_STATUS_DISABLED trait is supposed to be added to the compute RP when the compute service is disabled, and the trait supposed to be removed when the service is enabled again. However adding and removing traits is subject to generation conflict in placement. The original implementation of blueprint pre-filter-disabled-computes noticed this and prints a detailed warning message while the API operation succeeds. We can ignore the conflict this way because the periodic update_available_resource() call will re-sync the traits later. Still this gives human noticeable time window where the trait and the service state are not in sync. Setting the compute service disable is the smaller problem as the scheduler still uses the ComputeFilter that filters the computes based on the service api. So during the enable ->disable race window we only lose scheduling performance as the placement filter is inefficient. In case of setting the compute service to enabled the race is more visible as the placement pre_filter will filter out the compute that is enable by the admin until the re-sync happens. If the de-sync would only happen due to high load on the given compute the such delay could be explained by the load itself. However de-sync can happen simply due to a new instance boot on the compute. This patch adds a functional test that reproduce the original problem. Related-Bug: #1886418 Change-Id: Ib980b1ba68ffcfe51a15dce10eb9f42ef12d7260
1 parent 24223ce commit f167891

File tree

1 file changed

+84
-0
lines changed

1 file changed

+84
-0
lines changed
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Licensed under the Apache License, Version 2.0 (the "License"); you may
2+
# not use this file except in compliance with the License. You may obtain
3+
# a copy of the License at
4+
#
5+
# http://www.apache.org/licenses/LICENSE-2.0
6+
#
7+
# Unless required by applicable law or agreed to in writing, software
8+
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
9+
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
10+
# License for the specific language governing permissions and limitations
11+
# under the License.
12+
13+
from nova.tests.functional import integrated_helpers
14+
15+
16+
class TestServices(integrated_helpers._IntegratedTestBase):
17+
api_major_version = 'v2.1'
18+
microversion = 'latest'
19+
20+
def setUp(self):
21+
super(TestServices, self).setUp()
22+
self.compute_rp_uuid = self.admin_api.api_get(
23+
'os-hypervisors?hypervisor_hostname_pattern=fake-mini'
24+
).body['hypervisors'][0]['id']
25+
self.compute_service_id = self.admin_api.get_services(
26+
host='compute', binary='nova-compute')[0]['id']
27+
28+
def _get_traits_on_compute(self):
29+
return self.placement_api.get(
30+
'/resource_providers/%s/traits' % self.compute_rp_uuid,
31+
version='1.6'
32+
).body['traits']
33+
34+
def _disable_compute(self):
35+
self.api.put_service(
36+
self.compute_service_id, {'status': 'disabled'})
37+
38+
def _enable_compute(self):
39+
self.api.put_service(
40+
self.compute_service_id, {'status': 'enabled'})
41+
42+
def _has_disabled_trait(self):
43+
return "COMPUTE_STATUS_DISABLED" in self._get_traits_on_compute()
44+
45+
def test_compute_disable_after_server_create(self):
46+
# Check that COMPUTE_STATUS_DISABLED is not on the compute
47+
self.assertFalse(self._has_disabled_trait())
48+
49+
self._disable_compute()
50+
# Check that COMPUTE_STATUS_DISABLED is now on the compute
51+
self.assertTrue(self._has_disabled_trait())
52+
53+
self._enable_compute()
54+
# Check that COMPUTE_STATUS_DISABLED is not on the compute
55+
self.assertFalse(self._has_disabled_trait())
56+
57+
# Create a server.
58+
self._create_server(networks=[])
59+
60+
self._disable_compute()
61+
# FIXME(gibi): Check that COMPUTE_STATUS_DISABLED is now on the
62+
# compute. Unfortunately it is not true as the compute manager failed
63+
# to update the traits in placement due to a stale provide tree cache.
64+
# It is stale because a server is booted on the compute since the last
65+
# update_available_resource periodic was run.
66+
self.assertIn(
67+
'An error occurred while updating COMPUTE_STATUS_DISABLED trait '
68+
'on compute node resource provider',
69+
self.stdlog.logger.output)
70+
self.assertFalse(self._has_disabled_trait())
71+
72+
# This would be the expected behavior
73+
#
74+
# self.assertTrue(self._has_disabled_trait())
75+
#
76+
# Alternatively the test could wait for the periodic to run or trigger
77+
# it manually.
78+
79+
# This passes now but not because enabling works but because the
80+
# above fault caused that COMPUTE_STATUS_DISABLED is not on the compute
81+
# RP in the first place.
82+
self._enable_compute()
83+
# Check that COMPUTE_STATUS_DISABLED is removed from the compute
84+
self.assertFalse(self._has_disabled_trait())

0 commit comments

Comments
 (0)