Add tag pool for links + fix potential consistency and dead lock issues#281
Add tag pool for links + fix potential consistency and dead lock issues#281
Conversation
|
@Ktmi, here's the feedback you asked about the locks usage and the patterns we'll look forward to maintaining: Certainly we need to improve thread safety in certain resource deletions. I'm glad you identified these cases. But, let's try to make it as dead simple as possible by following these guidelines below, see if you have other suggestions to simplify even more while still solving the thread safety problem:
Currently you have:
By following guidelines G1 and G2, I think in theory we could simplify to (not too far away from what we currently have):
That'd be much simpler to reason and maintain, while in theory still not decreasing or impacting much IO Ops throughput. Places where an interface lock is being used, it would use its respective switch.lock instead Also, a resource lower in the list wouldn't need to try to get a higher lock, say for instance when making an interface tag available it wouldn't need either a
Can you check if this solves the problem? See if you can come up with a simpler solution or a counter example, and let's keep refining. |
|
I would have liked to do G1, but the problem is that iterating a As for G2 I like that idea. A lot of the time I'm acquiring multiple interface locks I'm also acquiring the switch.lock as well. Though it could be possible to specifically target the interface subdocument, in which case you could have multiple simultaneous transactions targeting the same switch document. For G3, we could cut down on the usage of global locks around db operations if we cut down on the usage of |
Yes @Ktmi. +1 to that idea, let's start leveraging a 1) sorted dict (ordered tree implementation), 2) also encapsulate some accesss for iteration, just so we can also take the opportunity to mitigate that issue here you and I were discussing in some quick fixes some time ago kytos-ng/kytos#534, if we still end up with a mutable sorted dict, then let's also encapsulate the shallow copy in a method for iteration, just so that doesn't get repeated on every call site. I'll leave up to you to refine and propose the libs, +1 on the idea, I think we should finally pursue and prioritize it, it's a small-ish effort that can really make things more ergonomic for us code maintenance-wise too.
On Mongo side its provide atomicity at a document level, let's let it handle, and in our runtime objs we follow suite.
Yes, but DB IO wise we don't gain much from it so let's steer away from it for now, since ultimately MongoDB will be r/w bounded to that document. If one day we need more throughput or that starts to be a bottleneck (which isn't at the moment), then we reassess our data models too.
Yes, let's only go for G3 in cases where it's not doing DB IO though or it's expected to be very minimal, say for instance, keeping track of liveness stuff in the runtime objs (and even in that case it could be G1 - per sorted object - except no DB w/r). For operations involving typical moderate+ DB r/w ops, let's keep following G2, and then in this case acquire the Appreciated your thoughts and feedback. Let's stick with G1, G2 and G3 in that order then. |
|
I'm thinking some of the E2E tests failed because I forgot to pull the latest version, I'll be re running them. |
|
So key changes that need to be made for E2E tests.
|
|
Tested against race condition E2E test, and it is passing: |
|
On the matter of removing global locks, I don't think its possible. With only ordered locks, some relations become literally impossible to traverse safely. To traverse a relation safely, one end of the relation needs to be locked. To hold multiple locks safely, without another lock to prevent holding more than one, then you must acquire them in a predetermined order. That means that with only ordered locks the only relations we can traverse safely are from higher in the lock hierarchy to lower in the lock hierarchy, for example Switch to Interface With global locks, we add an additional set of locks higher in the lock hierarchy that have no other prerequisite to being acquired other than being in order, and that can lock certain sets of relationships. This then makes it possible to traverse through sets of objects in any direction, before acquiring any of their locks. As for maybe simplifying the code for acquiring object locks, it might be possible, but it would be a lot of work to get done. I compiled a list of most of the multi lock scenarios in topology, and got the following:
I was originally thinking, that we could add some specific functions to create context managers for these scenarios. For example: with interface_manager.get_interface_and_connected_link_and_interface(interface_id) as endpoint_a, link, endpoint_b:
# do work hereHowever the scenarios where objects are being created or deleted, as that depends on having not just the object, but the collection its in, as well as modifying the relations between objects. I just don't see a good way to encapsulate that in a single context manager outside maybe a builder pattern, but that still has its own issues. There just isn't a good way to incorporate updating the database for the related objects that where modified in addition to the target object. It would require more integration into topology, when I would want to make such an API more general than that. |
|
@Ktmi, here's a short answer feedback (I'll reply a longer version in the next reply too)
Yes, G1.
Let's follow G2 as much as possible but let's find a balance on kytos core by leveraging a common resource denominator that unifies the lock: a switch. Switch and its interfaces should use the same switch lock. Kind of a balance of not using global lock, but still mostly aligned with G2 - only See if suffices (and then let's also measure in a development prototype phase trying to scale the operations) and/or if you have any other ideas that we should consider which aren't too far away from what we currently. |
|
@Ktmi, long answer feedback: Hierarchical locks with more than two different types/kinds of resources switch/link/interface will only be potentially accepted on kytos core if such approach is strongly justified to a) not decrease related DB IO ops while ensuring concurrency safety and b) while still be easy to reason in terms of code maintainability. b) will usually be the deal breaker depending on the resources relationships too, certain hierarchical locks with different types of resources is very hard to maintain as you pointed out even to a point where the list above you compiled it almost need a wiki to remember, so let's steer away from it. @Ktmi, suggestion on kytos core: 1a) Let's stick with a lock with a common denominator: a switch
1b) Most of current object runtime operations aren't a bottleneck and tend to be fast compared to DB IO operations.
1c) This approach is still expected to work with your PR kytos-ng/kytos-end-to-end-tests#399 should still be passing. Switch/interface/link deletion disabling shouldn't result in race conditions or dead locks. See if point 1 above suffices (and then let's also measure in a development prototype phase trying to scale the operations) and/or if you have any other ideas which aren't too far away from what we currently have that we should consider. We can also expand the discussion for non core resources on NApps ( |
6d5d04c to
ce7e867
Compare
|
Alright, I got a branch which removes |
|
Also here's a potential utility for acquiring locks: from contextlib import ExitStack
from threading import Lock
from kytos.core.controller import Controller
class SwitchGetter:
def __init__(self, controller: Controller, switch_id: str):
self.controller = controller
self.target = switch_id
self.switch = None
def acquire(self):
switch = self.controller.switches[self.taget]
switch.lock.acquire()
self.switch = switch
def release(self):
self.switch.lock.release()
self.switch = None
def __enter__(self):
self.acquire()
return self.switch
def __exit__(self, exc_type, exc_val, tb):
self.release()
class StrongSwitchGetter(SwitchGetter):
def acquire(self):
with self.controller.switches_lock:
return super().acquire()
class InterfaceGetter:
def __init__(self, controller: Controller, interface_id: str):
self.controller = controller
switch_id, _, port = interface_id.rpartition(":")
port = int(port)
self.target = switch_id
self.port = port
self.switch = None
self.interface = None
def acquire(self):
switch = self.controller.switches[self.taget]
switch.lock.acquire()
interface = switch.interfaces[self.port]
self.switch = switch
self.interface = interface
def release(self):
self.switch.lock.release()
self.switch = None
self.interface = None
def __enter__(self):
self.acquire()
return self.interface
def __exit__(self, exc_type, exc_val, tb):
self.release()
class StrongInterfaceGetter(InterfaceGetter):
def acquire(self):
with self.controller.switches_lock:
return super().acquire()
class LinkGetter:
def __init__(self, controller: Controller, link_id: str):
self.controller = controller
self.target = link_id
self.link = None
def acquire(self):
link = self.controller.links[self.taget]
link.lock.acquire()
self.link = link
def release(self):
self.link.lock.release()
self.link = None
def __enter__(self):
self.acquire()
return self.link
def __exit__(self, exc_type, exc_val, tb):
self.release()
class StrongLinkGetter(LinkGetter):
def acquire(self):
with self.controller.switches_lock:
return super().acquire()
class TopologyEditorGetter:
def __init__(self, controller: Controller):
self.controller = controller
self.editor = None
def acquire(self):
self.controller.switches_lock.acquire()
self.editor = TopologyEditor(self.controller)
def release(self):
self.editor.release()
self.controller.switches_lock.release()
class TopologyEditor:
def __init__(self, controller: Controller):
self.controller = controller
self.lock_stack = ExitStack()
self.switches = {}
self.interfaces = {}
self.links = {}
def acquire_lock(self, lock: Lock):
self.lock_stack.enter_context(lock)
def release(self):
self.lock_stack.close()
def get_switch(self, switch_id: str):
if switch_id in self.switches:
return self.switches[switch_id]
switch = self.controller.switches[switch_id]
self.acquire_lock(switch.lock)
self.switches[switch_id] = switch
return switch
def get_interface(self, interface_id: str):
if interface_id in self.interfaces:
return self.interfaces[interface_id]
switch_id, _, port = interface_id.rsplit(":")
port = int(port)
switch = self.get_switch(switch_id)
interface = switch.interfaces[port]
self.interfaces[interface_id] = interface
return interface
def get_link(self, link_id: str):
if link_id in self.links:
return self.links[link_id]
link = self.controller.links[link_id]
self.acquire_lock(link.lock)
self.links[link_id] = link
return linkI think this is starting to move in the right direction. Most of the cases where we need to work on a single object are covered, with the ability to guarantee the lifetime of an object if necessary. The editor, I think needs more tools such as creation and deletion, but for now serves as a good proof of concept. |
viniarck
left a comment
There was a problem hiding this comment.
Partial review:
- Scrutinizer is failing
- Changelog needs to be updated
- I'd like to see APM DB charts for major DB stress test OPs
- These scripts https://github.com/kytos-ng/topology/tree/master/scripts/console/2023.2.0 need to be updated too, if tags becomes unexpectedly inconsistent in prod this is what fixes them
- We also need DB migration otherwise in a upgrade it'll crash out of the gate when loading links:
nThread) Loading link: 20070c87457fc5197d74eb6354365560dd7b2b19ac414ad48c8dc2597236a6c4
2026-01-05 19:32:39,456 - INFO [kytos.napps.kytos/topology] [main.py:192:_load_links] (MainThread) Loading link: 3bdc34e8e0ca38d7c24724d07c8282cc2c5f123cfed602f5b2eb3594c9606476
2026-01-05 19:32:39,456 - INFO [kytos.napps.kytos/topology] [main.py:192:_load_links] (MainThread) Loading link: 4d42dc0852278accac7d9df15418f6d921db160b13d674029a87cef1b5f67f30
2026-01-05 19:32:39,457 - INFO [kytos.napps.kytos/topology] [main.py:192:_load_links] (MainThread) Loading link: 78282c4d5b579265f04ebadc4405ca1b49628eb1d684bb45e5d0607fa8b713d0
2026-01-05 19:32:39,457 - INFO [kytos.napps.kytos/topology] [main.py:192:_load_links] (MainThread) Loading link: c0fe8cb39c5d91a28891d8c81403391fda144e77ac812c22043ce3bb373f6cd2
2026-01-05 19:32:39,458 - INFO [kytos.napps.kytos/topology] [main.py:192:_load_links] (MainThread) Loading link: c8b55359990f89a5849813dc348d30e9e1f991bad1dcb7f82112bd35429d9b07
Kytos couldn't start because of KytosNAppSetupException: NApp kytos/topology exception 'default_tag_ranges' Traceback (most recent call last):
File "/home/viniarck/repos/kytos/kytos/core/controller.py", line 893, in load_napp
napp = napp_module.Main(controller=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/viniarck/repos/kytos/kytos/core/napps/base.py", line 196, in __init__
self.setup()
File "/home/viniarck/repos/napps/napps/kytos/topology/main.py", line 80, in setup
self.load_topology()
File "/home/viniarck/repos/napps/napps/kytos/topology/main.py", line 351, in load_topology
self._load_interface_details(
File "/home/viniarck/repos/napps/napps/kytos/topology/main.py", line 313, in _load_interface_details
self.load_details(
File "/home/viniarck/repos/napps/napps/kytos/topology/main.py", line 1829, in load_details
tag_details["default_tag_ranges"],
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'default_tag_ranges'
Nice @Ktmi good to have this information and know about the collected measure, also the comparison looks good as far as the percentiles, sometimes we know that there are certain data point outliers but that's fine as long as they only show up in IO stress cases and temporarily |
|
Latest E2E results: Have been consistently failing that 1 of_lldp test. I'll have to do some investigation on what's meant to be happening here to understand why the error is occuring, and how my changes caused it. |
|
@viniarck I can't figure out the E2E test issue. I did some further experimenting with it, and wherever the packet is coming from doesn't appear to be from of_lldp, as shutting down the controller before testing if any packets are being sent across those links still results in packets being detected. Other than that, this is feature complete, with an exception to maybe some web-ui changes for updating tag ranges for both interfaces and links. |
viniarck
left a comment
There was a problem hiding this comment.
Just did another partial review.
I'll do another later soon enough.
Did you also have the chance to try out all the scripts?
| [UNRELEASED] - Under development | ||
| ******************************** | ||
|
|
||
| Changed |
There was a problem hiding this comment.
Changelog should include the new endpoints
| if ( | ||
| link.id in self._link_tags_updated_at | ||
| and self._link_tags_updated_at[link.id] > event.timestamp | ||
| ): |
There was a problem hiding this comment.
Early return is missing here, as it is it'd be able to process an old preempted event.
Can you review this again?
There was a problem hiding this comment.
This will also lead to other issues, since it won't right correctly to db.link_details, after a restart, it would leak vlans, example with a link with one EVC:
kytos $> link_id = '78282c4d5b579265f04ebadc4405ca1b49628eb1d684bb45e5d0607fa8b713d0'
kytos $>
kytos $> controller.links[link_id].available_tags
Out[2]: {'vlan': [[2, 3798], [3800, 4094]]}
After restart:
kytos $> link_id = '78282c4d5b579265f04ebadc4405ca1b49628eb1d684bb45e5d0607fa8b713d0'
kytos $> controller.links[link_id].available_tags
Out[2]: {'vlan': [[1, 3798], [3800, 4094]]}
| port_num = int(port_num) | ||
| return self.db.switches.find_one_and_update( | ||
| {"_id": switch_id}, | ||
| {"$unset": {"interfaces.$[iface]": 1}}, |
There was a problem hiding this comment.
Won't this leave a null interface? Did you mean to use $pull instead of $unset?
Can you review this again
| class TopoController: | ||
| """TopoController.""" | ||
|
|
||
| def __init__(self, get_mongo=lambda: Mongo()) -> None: |
There was a problem hiding this comment.
Trivial comment but we've been using this way in other NApps too, do we really need this equivalent change and start deviating from the rest?
viniarck
left a comment
There was a problem hiding this comment.
There's a major issue after restart that's leaking vlan.
Make sure to fully explore your change
| if ( | ||
| link.id in self._link_tags_updated_at | ||
| and self._link_tags_updated_at[link.id] > event.timestamp | ||
| ): |
There was a problem hiding this comment.
This will also lead to other issues, since it won't right correctly to db.link_details, after a restart, it would leak vlans, example with a link with one EVC:
kytos $> link_id = '78282c4d5b579265f04ebadc4405ca1b49628eb1d684bb45e5d0607fa8b713d0'
kytos $>
kytos $> controller.links[link_id].available_tags
Out[2]: {'vlan': [[2, 3798], [3800, 4094]]}
After restart:
kytos $> link_id = '78282c4d5b579265f04ebadc4405ca1b49628eb1d684bb45e5d0607fa8b713d0'
kytos $> controller.links[link_id].available_tags
Out[2]: {'vlan': [[1, 3798], [3800, 4094]]}








Part of closing kytos-ng/kytos#554
Summary
Originally this was meant to be just a forward port of the changes for the tag pool separation from
feat/tag_capablebranch. It adds in support for the separate tag pool for links, and implements automatic transfer of tags from interfaces to links upon link creation, as well as returning the tags from links to interfaces upon link deletion.However, I discovered that there were potential inconsistency issues with
topologyin regards to the way in which locks were being used or not used in some places, in addition to some potential scenarios where deleted objects could be revived. I handled quite a few of these scenarios by adding in locks to kytos core, but I think there are still a few scenarios which haven't been handled.Here is the hierarchy of locks, or what locks protect what groups of locks:
And here is the order in which locks should be acquired:
To run this, the following branches need to be installed:
Local Tests
Link creation and deletion both appear to be working as expected. Tags are successfully taken and returned to the constituent endpoints.
End-to-End Tests
Redid tests with latest E2E version. Test are now passing with the modified tests kytos-ng/kytos-end-to-end-tests#401.