Skip to content

Commit d984a6d

Browse files
JohnGarbuttmelwitt
authored andcommitted
Tell oslo.limit how to count nova resources
A follow on patch will use this code to enforce the limits, this patch provides integration with oslo.limit and a new internal nova API that is able to enforce those limits. The first part is providing a callback for oslo.limit to be able to count the resources being used. We only count resources grouped by project_id. For counting servers, we make use of the instance mappings list in the api database, just as the existing quota code does. While we do check to ensure the queued for delete migration has been completed, we simply error out if that is not the case, rather than attempting to fallback to any other counting system. We hope one day we can count this in placement using consumer records, or similar. For counting all other resource usage, they must refer to some usage relating to a resource class being consumed in placement. This is similar to how the count with placement variant of the existing placement code works today. This is not restricted to RAM and VCPU, it is open to any resource class that is known to placement. The second part is the enforcement method, that keeps a similar signature to the existing enforce_num_instnaces call that is use to check quotas using the legacy quota system. From the flavor we extract the current resource usage. This is considered the simplest first step that helps us deliver Ironic limits alongside all the existing RAM and VCPU limits. At a later date, we would ideally get passed a more complete view of what resources are being requested from placement. NOTE: given the instance object doesn't exist when enforce is called, we can't just pass the instance into here. A [workarounds] option is also available for operators who need the legacy quota usage behavior where VCPU = VCPU + PCPU. blueprint unified-limits-nova Change-Id: I272b59b7bc8975bfd602640789f80d2d5f7ee698
1 parent c384824 commit d984a6d

File tree

7 files changed

+570
-8
lines changed

7 files changed

+570
-8
lines changed

nova/conf/workarounds.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -383,6 +383,24 @@
383383
before compute nodes have been able to update their service record. In an FFU,
384384
the service records in the database will be more than one version old until
385385
the compute nodes start up, but control services need to be online first.
386+
"""),
387+
cfg.BoolOpt('unified_limits_count_pcpu_as_vcpu',
388+
default=False,
389+
help="""
390+
When using unified limits, use VCPU + PCPU for VCPU quota usage.
391+
392+
If the deployment is configured to use unified limits via
393+
``[quota]driver=nova.quota.UnifiedLimitsDriver``, by default VCPU resources are
394+
counted independently from PCPU resources, consistent with how they are
395+
represented in the placement service.
396+
397+
Legacy quota behavior counts PCPU as VCPU and returns the sum of VCPU + PCPU
398+
usage as the usage count for VCPU. Operators relying on the aggregation of
399+
VCPU and PCPU resource usage counts should set this option to True.
400+
401+
Related options:
402+
403+
* :oslo.config:option:`quota.driver`
386404
"""),
387405
]
388406

nova/limit/placement.py

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Copyright 2022 StackHPC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"); you may
4+
# not use this file except in compliance with the License. You may obtain
5+
# a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
11+
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
12+
# License for the specific language governing permissions and limitations
13+
# under the License.
14+
15+
16+
import os_resource_classes as orc
17+
from oslo_limit import exception as limit_exceptions
18+
from oslo_limit import limit
19+
from oslo_log import log as logging
20+
21+
import nova.conf
22+
from nova import exception
23+
from nova.limit import utils as limit_utils
24+
from nova import objects
25+
from nova import quota
26+
from nova.scheduler.client import report
27+
from nova.scheduler import utils
28+
29+
LOG = logging.getLogger(__name__)
30+
CONF = nova.conf.CONF
31+
32+
# Cache to avoid repopulating ksa state
33+
PLACEMENT_CLIENT = None
34+
35+
36+
def _get_placement_usages(context, project_id):
37+
global PLACEMENT_CLIENT
38+
if not PLACEMENT_CLIENT:
39+
PLACEMENT_CLIENT = report.SchedulerReportClient()
40+
return PLACEMENT_CLIENT.get_usages_counts_for_limits(context, project_id)
41+
42+
43+
def _get_usage(context, project_id, resource_names):
44+
"""Called by oslo_limit's enforcer"""
45+
if not limit_utils.use_unified_limits():
46+
raise NotImplementedError("unified limits is disabled")
47+
48+
count_servers = False
49+
resource_classes = []
50+
51+
for resource in resource_names:
52+
if resource == "servers":
53+
count_servers = True
54+
continue
55+
56+
if not resource.startswith("class:"):
57+
raise ValueError("Unknown resource type: %s" % resource)
58+
59+
# Temporarily strip resource class prefix as placement does not use it.
60+
# Example: limit resource 'class:VCPU' will be returned as 'VCPU' from
61+
# placement.
62+
r_class = resource.lstrip("class:")
63+
if r_class in orc.STANDARDS or orc.is_custom(r_class):
64+
resource_classes.append(r_class)
65+
else:
66+
raise ValueError("Unknown resource class: %s" % r_class)
67+
68+
if not count_servers and len(resource_classes) == 0:
69+
raise ValueError("no resources to check")
70+
71+
resource_counts = {}
72+
if count_servers:
73+
# TODO(melwitt): Change this to count servers from placement once nova
74+
# is using placement consumer types and is able to differentiate
75+
# between "instance" allocations vs "migration" allocations.
76+
if not quota.is_qfd_populated(context):
77+
LOG.error('Must migrate all instance mappings before using '
78+
'unified limits')
79+
raise ValueError("must first migrate instance mappings")
80+
mappings = objects.InstanceMappingList.get_counts(context, project_id)
81+
resource_counts['servers'] = mappings['project']['instances']
82+
83+
try:
84+
usages = _get_placement_usages(context, project_id)
85+
except exception.UsagesRetrievalFailed as e:
86+
msg = ("Failed to retrieve usages from placement while enforcing "
87+
"%s quota limits." % ", ".join(resource_names))
88+
LOG.error(msg + " Error: " + str(e))
89+
raise exception.UsagesRetrievalFailed(msg)
90+
91+
# Use legacy behavior VCPU = VCPU + PCPU if configured.
92+
if CONF.workarounds.unified_limits_count_pcpu_as_vcpu:
93+
# If PCPU is in resource_classes, that means it was specified in the
94+
# flavor explicitly. In that case, we expect it to have its own limit
95+
# registered and we should not fold it into VCPU.
96+
if orc.PCPU in usages and orc.PCPU not in resource_classes:
97+
usages[orc.VCPU] = (usages.get(orc.VCPU, 0) +
98+
usages.get(orc.PCPU, 0))
99+
100+
for resource_class in resource_classes:
101+
# Need to add back resource class prefix that was stripped earlier
102+
resource_name = 'class:' + resource_class
103+
# Placement doesn't know about classes with zero usage
104+
# so default to zero to tell oslo.limit usage is zero
105+
resource_counts[resource_name] = usages.get(resource_class, 0)
106+
107+
return resource_counts
108+
109+
110+
def _get_deltas_by_flavor(flavor, is_bfv, count):
111+
if flavor is None:
112+
raise ValueError("flavor")
113+
if count < 0:
114+
raise ValueError("count")
115+
116+
# NOTE(johngarbutt): this skips bfv, port, and cyborg resources
117+
# but it still gives us better checks than before unified limits
118+
# We need an instance in the DB to use the current is_bfv logic
119+
# which doesn't work well for instances that don't yet have a uuid
120+
deltas_from_flavor = utils.resources_for_limits(flavor, is_bfv)
121+
122+
deltas = {"servers": count}
123+
for resource, amount in deltas_from_flavor.items():
124+
if amount != 0:
125+
deltas["class:%s" % resource] = amount * count
126+
return deltas
127+
128+
129+
def _get_enforcer(context, project_id):
130+
# NOTE(johngarbutt) should we move context arg into oslo.limit?
131+
def callback(project_id, resource_names):
132+
return _get_usage(context, project_id, resource_names)
133+
134+
return limit.Enforcer(callback)
135+
136+
137+
def enforce_num_instances_and_flavor(context, project_id, flavor, is_bfvm,
138+
min_count, max_count, enforcer=None):
139+
"""Return max instances possible, else raise TooManyInstances exception."""
140+
if not limit_utils.use_unified_limits():
141+
return max_count
142+
143+
# Ensure the recursion will always complete
144+
if min_count < 0 or min_count > max_count:
145+
raise ValueError("invalid min_count")
146+
if max_count < 0:
147+
raise ValueError("invalid max_count")
148+
149+
deltas = _get_deltas_by_flavor(flavor, is_bfvm, max_count)
150+
enforcer = _get_enforcer(context, project_id)
151+
try:
152+
enforcer.enforce(project_id, deltas)
153+
except limit_exceptions.ProjectOverLimit as e:
154+
# NOTE(johngarbutt) we can do better, but this is very simple
155+
LOG.debug("Limit check failed with count %s retrying with count %s",
156+
max_count, max_count - 1)
157+
try:
158+
return enforce_num_instances_and_flavor(context, project_id,
159+
flavor, is_bfvm, min_count,
160+
max_count - 1,
161+
enforcer=enforcer)
162+
except ValueError:
163+
# Copy the *original* exception message to a OverQuota to
164+
# propagate to the API layer
165+
raise exception.TooManyInstances(str(e))
166+
167+
# no problems with max_count, so we return max count
168+
return max_count

nova/quota.py

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1223,6 +1223,17 @@ def _server_group_count_members_by_user_legacy(context, group, user_id):
12231223
return {'user': {'server_group_members': count}}
12241224

12251225

1226+
def is_qfd_populated(context):
1227+
global UID_QFD_POPULATED_CACHE_ALL
1228+
if not UID_QFD_POPULATED_CACHE_ALL:
1229+
LOG.debug('Checking whether user_id and queued_for_delete are '
1230+
'populated for all projects')
1231+
UID_QFD_POPULATED_CACHE_ALL = _user_id_queued_for_delete_populated(
1232+
context)
1233+
1234+
return UID_QFD_POPULATED_CACHE_ALL
1235+
1236+
12261237
def _server_group_count_members_by_user(context, group, user_id):
12271238
"""Get the count of server group members for a group by user.
12281239
@@ -1240,14 +1251,7 @@ def _server_group_count_members_by_user(context, group, user_id):
12401251
# So, we check whether user_id/queued_for_delete is populated for all
12411252
# records and cache the result to prevent unnecessary checking once the
12421253
# data migration has been completed.
1243-
global UID_QFD_POPULATED_CACHE_ALL
1244-
if not UID_QFD_POPULATED_CACHE_ALL:
1245-
LOG.debug('Checking whether user_id and queued_for_delete are '
1246-
'populated for all projects')
1247-
UID_QFD_POPULATED_CACHE_ALL = _user_id_queued_for_delete_populated(
1248-
context)
1249-
1250-
if UID_QFD_POPULATED_CACHE_ALL:
1254+
if is_qfd_populated(context):
12511255
count = objects.InstanceMappingList.get_count_by_uuids_and_user(
12521256
context, group.members, user_id)
12531257
return {'user': {'server_group_members': count}}

nova/scheduler/client/report.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2486,6 +2486,30 @@ def _get_usages(self, context, project_id, user_id=None):
24862486
return self.get(url, version=GET_USAGES_VERSION,
24872487
global_request_id=context.global_id)
24882488

2489+
def get_usages_counts_for_limits(self, context, project_id):
2490+
"""Get the usages counts for the purpose of enforcing unified limits
2491+
2492+
The response from placement will not contain a resource class if
2493+
there is no usage. i.e. if there is no usage, you get an empty dict.
2494+
2495+
Note resources are counted as placement sees them, as such note
2496+
that VCPUs and PCPUs will be counted independently.
2497+
2498+
:param context: The request context
2499+
:param project_id: The project_id to count across
2500+
:return: A dict containing the project-scoped counts, for example:
2501+
{'VCPU': 2, 'MEMORY_MB': 1024}
2502+
:raises: `exception.UsagesRetrievalFailed` if a placement API call
2503+
fails
2504+
"""
2505+
LOG.debug('Getting usages for project_id %s from placement',
2506+
project_id)
2507+
resp = self._get_usages(context, project_id)
2508+
if resp:
2509+
data = resp.json()
2510+
return data['usages']
2511+
self._handle_usages_error_from_placement(resp, project_id)
2512+
24892513
def get_usages_counts_for_quota(self, context, project_id, user_id=None):
24902514
"""Get the usages counts for the purpose of counting quota usage.
24912515

nova/scheduler/utils.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -615,6 +615,10 @@ def resources_from_flavor(instance, flavor):
615615
"""
616616
is_bfv = compute_utils.is_volume_backed_instance(instance._context,
617617
instance)
618+
return _get_resources(flavor, is_bfv)
619+
620+
621+
def _get_resources(flavor, is_bfv):
618622
# create a fake RequestSpec as a wrapper to the caller
619623
req_spec = objects.RequestSpec(flavor=flavor, is_bfv=is_bfv)
620624

@@ -628,6 +632,11 @@ def resources_from_flavor(instance, flavor):
628632
return res_req.merged_resources()
629633

630634

635+
def resources_for_limits(flavor, is_bfv):
636+
"""Work out what unified limits may be exceeded."""
637+
return _get_resources(flavor, is_bfv)
638+
639+
631640
def resources_from_request_spec(ctxt, spec_obj, host_manager,
632641
enable_pinning_translate=True):
633642
"""Given a RequestSpec object, returns a ResourceRequest of the resources,

0 commit comments

Comments
 (0)