Conversation
joker-at-work
left a comment
There was a problem hiding this comment.
We're not currently using the returned tags and metadata to group the VMs, do we?
I am not sure what you mean by grouping. Can you give a bit more context? |
We want to group by cluster as defined by Gardener/Kubernikus. Currently, if I read it correctly, the code only groups by project-id - which we also need, but only for instances without k8s info. If we know the cluster via metadata/tags, we need to group by that. |
We use the query |
You're right. I didn't get that part, it seems. Thanks for explaining. Looking good then 👍 |
2c3abab to
74012b6
Compare
74012b6 to
7afd66b
Compare
@joker-at-work is that an approve, or just a lgtm for this particular part? |
|
Since this is a WIP-named PR with a WIP commit-message, I didn't want to review the whole thing deeply, because it didn't look finished. So no, this was not an approval for the whole PR, but just the part Marius and I talked about. |
2571454 to
9d433ea
Compare
cd19633 to
5b8d8be
Compare
| HANA_PREFIX = "hana_" | ||
|
|
||
|
|
||
| class ShardFilter(filters.BaseHostFilter): |
There was a problem hiding this comment.
The comment for the class needs to be updated. In the cpu_info_migration_filter we also explicitly documented, that we're implementing filter_all() instead of host_passes().
| Can be overridden in a subclass, if you need to base filtering | ||
| decisions on all objects. Otherwise, one can just override | ||
| _filter_one() to filter a single object. |
There was a problem hiding this comment.
This looks like copy-and-paste from the parent class. Maybe rather document why we overwrote it?
| for obj in filter_obj_list: | ||
| yield obj |
There was a problem hiding this comment.
This could optionally be yield from filter_obj_list
>>> def a():
... yield from [1, 2, 3]
...
>>> for x in a():
... print(x)
...
1
2
3tested on Python 3.8
|
|
||
| k8s_instances = self._get_k8s_cluster_instances(spec_obj) | ||
|
|
||
| for obj in filter_obj_list: |
There was a problem hiding this comment.
In the cpu_info_migration_filter we called obj host_state and I would prefer to have that name here, too, because it better describes what we're handling here.
| # We allow any shard in this AZ for the first instance. | ||
| return True | ||
|
|
||
| k8s_hosts = set([i.host for i in siblings]) |
There was a problem hiding this comment.
There's no need to build the list, sets can be instantiated from iterators, too, i.e. set(i.host for i in siblings) works.
| def _is_hana_flavor(flavor): | ||
| return flavor.name.startswith(HANA_PREFIX) | ||
|
|
||
| def _is_same_category(instance, flavor): | ||
| """Check whether instance is from the flavor's family.""" | ||
| if _is_hana_flavor(flavor): | ||
| return _is_hana_flavor(instance.flavor) | ||
| return True | ||
|
|
||
| siblings = [i for i in k8s_instances | ||
| if _is_same_category(i, spec_obj.flavor)] | ||
|
|
||
| if not siblings: | ||
| # This is the first instance of this particular type (HANA). | ||
| # We allow any shard in this AZ for the first instance. | ||
| return True | ||
|
|
||
| k8s_hosts = set([i.host for i in siblings]) |
There was a problem hiding this comment.
We compute this for every host even though it's not host-dependent. Could we pass k8s_hosts instead of k8s_instances into the method and compute k8s_hosts in filter_all()?
| return any(agg.name in host_shard_names and | ||
| set(agg.hosts) & k8s_hosts | ||
| for agg in host_state.aggregates) |
There was a problem hiding this comment.
You could pass host_shard_aggrs instead of host_shard_names into this method and then write this as
return any(set(aggr.hosts) & k8s_hosts for aggr in host_shard_aggrs)| # BigVMs are scheduled based on their own rules. | ||
| if nova_utils.is_big_vm(spec_obj.memory_mb, spec_obj.flavor): | ||
| return [] |
There was a problem hiding this comment.
hm ... if we ignore big VMs here, but later explicitly handle hana_* flavors ... I mean big VMs are always hana_* flavors
I think we should ignore hana_* flavors here already and handle them explicitly in a follow-up, as we also have a ticket open on handling them differently.
There was a problem hiding this comment.
Right now hana_* are forced to their own shard if any is already part of the k8s cluster (see the usage of _is_same_category comparator).
Do you mean to remove that check as well, and simply ignore hana_* for now ?
There was a problem hiding this comment.
I mean we do not go into the _host_passes_k8s() for hana_* flavors and just handle non-hana VMs here for now, as we will implement something afterwards for hanas. IIRC, we will implement "ignore the shard filter for hana flavors unless the project explicitly wants to be bound to a shard for hanas".
| if instances and spec_obj.availability_zone: | ||
| return [i for i in instances | ||
| if i.availability_zone == spec_obj.availability_zone] |
There was a problem hiding this comment.
This means we cannot filter by availability_zone in the query?
There was a problem hiding this comment.
I think we can filter in the query as well, it will do a regex lookup with LIKE %az% (since I don't see availability_zone in exact_match_filter_names)
There was a problem hiding this comment.
hm ... doesn't sound too nice. filtering in Python might then be better
| build_request = BuildRequest.get_by_instance_uuid( | ||
| elevated, spec_obj.instance_uuid) |
There was a problem hiding this comment.
If this is a resize, do we we have a BuildRequest? We call this function before we "skip for resize", as we pre-compute things.
|
TODO: We need to support instances being already spread across shards and need to take the shard where most of the instances life. |
cb9fbdc to
a256459
Compare
fwiesel
left a comment
There was a problem hiding this comment.
Hi,
I would move it here also more into the database, as the instance list can be quite expensive, even after filtering for AZ and tags, because the row is quite wide, and contains a lot of data we actually do not care about (key-data, user-data, etc...).
Since this is spread over multiple database instances, we need to split the query up.
- Query the hosts for the instances matching the tag/metadata:
The query could look in SQL like this
select `host`, count(*)
from tags as t join instances as i on t.resource_id=i.uuid
where i.deleted=0 and i.availability_zone=$availability_zone and t.tag=$tag
group by i.`host`
order by count desc
limit 1or for the metadata
select i.`host`, count(*) as count
from instances as i join instance_metadata as im on im.instance_uuid=i.uuid
where i.deleted=0 and im.deleted=0 and im.`key`=$key and i.availability_zone=$availability_zone
group by i.`host`
order by count desc
limit 1You can the build the query with sqlalchemy depending on the given metadata / tags of the similar as you are now.
But since the metadata lives in the cell databases, they need to be scattered over all cells.
Like _get_instance_group_hosts_all_cells
-
You can then match the host to an aggregate list by simply calling
AggregateList.get_by_host. -
Then you can get the dominant shard from this list.
Open is still to skip the filter when it isn't a vsphere vm, and to ignore hosts, which are not vmware hosts.
| CONF = nova.conf.CONF | ||
|
|
||
| _SERVICE_AUTH = None | ||
| GARDENER_PREFIX = "kubernetes.io-cluster-shoot--garden--" |
There was a problem hiding this comment.
I checked in eu-de-2, and I see there some entries starting just with kubernetes.io-cluster-shoot--hcm-eng .
I would shorten the prefix check to kubernetes.io-cluster-, as I don't think we gain much by getting more specific.
|
Extending the queries from (1) by the following should do the trick to only get the vmware hosts: select
...
join compute_nodes as cn on i.node=cn.hypervisor_hostname
where ...
cn.deleted=0 and cn.hypervisor_type='VMware vCenter Server' |
5dd9c7b to
867a490
Compare
| self.assertEqual([host1, host2], result) | ||
| mock_is_non_vmware_spec.assert_called_once_with(spec_obj) | ||
|
|
||
| def _assert_passes(self, host, spec_obj, passes): |
There was a problem hiding this comment.
Optional: I think reading self._assert_passes(False, host, spec) is easier to read than self._assert_passes(host, spec, False), because the False basically belongs to the passes. One could even argue, that it should be part of the method name.
| self.assertEqual(3, len(result)) | ||
| self.assertEqual(result[0].host, 'host1') | ||
| self.assertEqual(result[1].host, 'host2') | ||
| self.assertEqual(result[2].host, 'host3') |
There was a problem hiding this comment.
Is there a reason not to write self.assertEqual(result, ['host1', 'host2', 'host3']) here?
There was a problem hiding this comment.
Ah. How did I not see that? :D
So it would have to be self.assertEqual([r.host for r in result], ['host1', 'host2', 'host3']). Yeah, not sure if that's any better.
| instance_uuid=self.fake_build_req.instance_uuid, | ||
| flavor=fake_flavor.fake_flavor_obj( | ||
| mock.sentinel.ctx, expected_attrs=['extra_specs'])) | ||
|
|
||
| self.filt_cls._PROJECT_SHARD_CACHE['foo'] = [] | ||
| self.assertFalse(self.filt_cls.host_passes(host, spec_obj)) | ||
| self._assert_passes(host, spec_obj, False) |
There was a problem hiding this comment.
Do you think we need to check if the k8s part of the ShardFilter is called i.e. what makes sure that _get_k8s_shard() return None?
Ah, it seems to be ensured by the self.fake_build_req returned in BuildRequest.get_by_instance_uuid. That's ... not right, I guess, because BuildRequest objects should get deleted after the instance is spawned. Therefore, we only test spawns here and not live-migrations or offline migrations.
| if utils.is_non_vmware_spec(spec_obj): | ||
| return True | ||
| return filter_obj_list |
There was a problem hiding this comment.
Would we want to a log a debug message here, that the filter is not applicable for the request, because it's non-VMware?
| return any(host_shard == k8s_shard | ||
| for host_shard in host_shard_names) |
There was a problem hiding this comment.
We do not have a log line here, so we'd only see "project enabled for all shard" or "… found in project" messages from above. In the end, we do not see that the host was rejected due to not matching the k8s shard. I think we should add a log line with the result here.
nova/db/main/api.py
Outdated
|
|
||
| @require_context | ||
| @pick_context_manager_reader_allow_async | ||
| def instance_get_host_by_metadata(context, meta_key, meta_value, |
There was a problem hiding this comment.
Same here: Please rename the function.
nova/db/main/api.py
Outdated
| def instance_get_host_by_tag(context, tag, filters=None): | ||
| count_label = func.count('*').label('count') | ||
| query = context.session.query(models.Instance, count_label). \ | ||
| join(models.Tag, models.Tag.resource_id == models.Instance.uuid) |
There was a problem hiding this comment.
I think you should be able to use the tags backref from the Tags model directly instead of having to define the join-condition again. Upstream seems to do it like this.
nova/db/main/api.py
Outdated
| result = query.all() | ||
| if result: | ||
| return result[0] | ||
| else: | ||
| return None |
There was a problem hiding this comment.
Optional: You could use query.first() instead. We could also use query.one(), but would have to catch an exception if there is no result, so first() should be better.
return query.first()There was a problem hiding this comment.
As discussed in the last meeting, we will probably need all hosts returned, because we need to find the shard with most VMs, not the host.
nova/db/main/api.py
Outdated
| result = query.all() | ||
| if result: | ||
| return result[0] | ||
| else: | ||
| return None |
There was a problem hiding this comment.
Optional: Could be return query.first(), too.
nova/db/main/api.py
Outdated
|
|
||
| @require_context | ||
| @pick_context_manager_reader_allow_async | ||
| def instance_get_host_by_tag(context, tag, filters=None): |
There was a problem hiding this comment.
Please add comments stating the purpose of the 3 functions in this file.
867a490 to
3f8eb62
Compare
3f8eb62 to
be090e5
Compare
nova/db/main/api.py
Outdated
| """Get the list of K8S hosts and the number of instances associated to | ||
| that K8S running on that host, querying by instances tags. |
There was a problem hiding this comment.
typo: "associated to that K8S running on that host" should imho be "associated to the K8S cluster running on that host", right? Could also be fine to just remove "that".
nova/db/main/api.py
Outdated
| """Get the list of K8S hosts and the number of instances associated to | ||
| that K8S running on that host, querying by instances metadata. |
nova/db/main/api.py
Outdated
| query = context.session.query(models.Instance, count_label). \ | ||
| join(models.Instance.tags) | ||
| query = _handle_k8s_hosts_query_filters(query, filters) | ||
| query = query.filter(models.Instance.deleted == 0, | ||
| models.Tag.tag == tag) | ||
|
|
||
| query = query.group_by(models.Instance.host). \ | ||
| order_by(sql.desc(count_label)) | ||
|
|
||
| return query.all() |
There was a problem hiding this comment.
We're not querying for models.Instance.host explicitly. How does it work, that we only get models.Instance.host in the end? The other query in the function down below explicitly queries for models.Instance.host.
nova/db/main/api.py
Outdated
| if hv_type: | ||
| query = query.filter(models.ComputeNode.deleted == 0, | ||
| models.ComputeNode.hypervisor_type == hv_type) |
There was a problem hiding this comment.
Q: Why are these down here (I was confused, because I thought the hv_type was fully handled above)? Do all joins need to happen first before we can start filtering? Have you thought about adding these filters to the join condition?
| Returns None if the request is not for an instance that's part of | ||
| a K8S cluster, or if this is the first instance of a new cluster. |
There was a problem hiding this comment.
... or if it's a HANA flavor or if it's a resize
| return build_request.instance.metadata if build_request \ | ||
| else instance.metadata | ||
|
|
||
| check_type = spec_obj.get_scheduler_hint('_nova_check_type') |
There was a problem hiding this comment.
Is this safe i.e. do we have a build request every time we do not set _nova_check_type? I see resize, live-migrate and rebuild setting it, but e.g. not unshelve.
Would it make sense to try and get the BuildRequest if we don't have a check_type, but fall back to getting the instance instead?
| k8s_hosts = {} | ||
|
|
||
| for cell_result in results.values(): | ||
| if not nova_context.is_cell_failure_sentinel(cell_result): | ||
| cell_hosts = dict(cell_result) | ||
| k8s_hosts = { | ||
| h: k8s_hosts.get(h, 0) + cell_hosts.get(h, 0) | ||
| for h in set(cell_hosts) | set(k8s_hosts) | ||
| } |
There was a problem hiding this comment.
Suggestion:
With a defaultdict, this could look like:
k8s_hosts = defaultdict(lambda: 0)
for cell_result in results.values():
if nova_context.is_cell_failure_sentinel(cell_result):
continue
for h, c in cell_hosts.items():
k8s_hosts[h] += cI think this would be easier readable.
Additionally: Is there a way for a host to turn up in multiple cells? If not, shouldn't we be able to use k8s_hosts.update(cell_hosts) after checking that it's not a failure?
Additionally: Should we really schedule if we have a failure? The majority of the cluster might reside in the failed shard.
| matches = any(host_shard == k8s_shard | ||
| for host_shard in host_shard_names) | ||
| if not matches: |
There was a problem hiding this comment.
Couldn't we use if k8s_shard in host_shard_names here? Would be much shorter.
We'd need to explicitly return False and return True then, but I don't see that as a downside.
As an alternative, we could do matches = k8s_shard in host_shard_names, too.
| LOG.debug("%(host_state)s is not part of the requested " | ||
| "K8S cluster shard '%(k8s_shard)s'", |
There was a problem hiding this comment.
typo: since the user doesn't request a shard but a k8s cluster, imho this needs to be "requested K8S cluster's shard"
I'm also wondering if we shouldn't just ditch "requested" here and make it "is not part of the K8S cluster's shard"
112e9f6 to
ad10664
Compare
fwiesel
left a comment
There was a problem hiding this comment.
Maybe I am missing something, but how to we now migrate a cluster out of a shard?
In my memory, we set the sharding on the project level, which ensures that new instances are on the new shard, and then we migrate the vms over.
But I don't think that is possible with the proposed code.
My memory is here a bit vague though. Am I mistaken about how we wanted to migrate things, or how it is supposed to work with the code?
| filters = { | ||
| 'hv_type': 'The hypervisor_type', | ||
| 'availability_zone': 'The availability zone' | ||
| } |
There was a problem hiding this comment.
Missing skip_instance_uuid in the documentation
| if not instance and not build_request: | ||
| LOG.warning("There were no build_request and no instance " | ||
| "for the uuid %s", spec_obj.instance_uuid) | ||
| return |
There was a problem hiding this comment.
return None , so that all paths return something explicitly.
Instances that are part of the same K8S cluster will get scheduled to the same shard (vCenter). It identifies the K8S cluster by looking at the tags or metadata set by the k8s cluster orchestrators when creating the instances. Kubernikus and Gardener are supported for now. It queries the database to determine the dominant shard, by looking which shard contains the most instances of a given K8S cluster. BigVMs are "out of the picture" and should not adhere to shards. They are only scheduled on their allocated hosts. The K8S logic is skipped for offline migrations (and thus for resizes too) since offline migration is a non-usecase for K8S. Change-Id: I73d04ba295d23db1d4728e9db124fc2a27c2d4bc
In VMWare we need to handle race conditions when spawning k8s instances in parallel while building a new cluster. Similar to how server groups are validated prior to spawning the VM on the compute host, we add a new method on the driver `validate_instance_group_policy` that checks driver-specific grouping policy (in this case, the K8S shard for the instance) Change-Id: I04151875fae44b72be52127e3b160f7f95abfb9e
ad10664 to
f1ae593
Compare
joker-at-work
left a comment
There was a problem hiding this comment.
In addition to Fabian's comments, I found these
| for aggr in k8s_shard_aggrs) | ||
| if not matches: | ||
| msg = ("Host %(host)s rejected K8S instance %(instance_uuid)s " | ||
| "because the K8S cluster is not part to this shard." |
| for aggr in k8s_shard_aggrs) | ||
| if not matches: | ||
| msg = ("Host %(host)s rejected K8S instance %(instance_uuid)s " | ||
| "because the K8S cluster is not part to this shard." |
There was a problem hiding this comment.
What does "this shard" mean in this context? Shouldn't that be "any already used shards"?
Keep this check in mind for the live-migration with target shard, because this check would probably prevent it.
| def validate_instance_group_policy(self, context, instance): | ||
| """Validates that the instance meets driver-specific grouping policy | ||
|
|
||
| The driver can raise exception.RescheduledException to reject and | ||
| trigger rescheduling of the instance to a different host. | ||
| """ | ||
| pass |
There was a problem hiding this comment.
instance_group being the name for server groups inside Nova, this function name implies a relation to server groups while it doesn't have any.
I see that you tried to keep it generic, but maybe we need to name it "validate_k8s_shard" after all :/
| results = nova_context.scatter_gather_skip_cell0( | ||
| context, ComputeNodeList.get_k8s_hosts_by_instances_tag, | ||
| kks_tag, filters=q_filters) |
There was a problem hiding this comment.
Why do we have to do that in all cells? Is that for non-AZ cells?
| def _validate_driver_instance_group_policy(self, context, instance): | ||
| lock_id = "driver-instance-group-validation-%s" % instance.uuid | ||
|
|
||
| @utils.synchronized(lock_id) | ||
| def _do_validation(context, instance): | ||
| self.driver.validate_instance_group_policy(context, instance) | ||
|
|
||
| _do_validation(context, instance) |
There was a problem hiding this comment.
A little docstring would be nice, e.g. why we need a lock here.
| query.filter( | ||
| models.Instance.uuid != skip_instance_uuid) |
Instances created by K8S orchestrators (Kubernikus or Gardener) which are part of the same K8S cluster should be scheduled in the same shard, because Kubernetes is heavily using volumes and we want to avoid slow attachments due to cross-shard migration of volumes.
(check also the commit message)