|
| 1 | +# Worker pods for KMM |
| 2 | + |
| 3 | +Authors: @yevgeny-shnaidman, @ybettan |
| 4 | + |
| 5 | +## Introduction |
| 6 | + |
| 7 | +This enhancement aims at redifining areas of resposiblity between Module-NMC and NMC controllers. |
| 8 | +This will allow for more clear-cut code and eliminate the variuos race-conditions that we are seeing(or will see) in the current situation |
| 9 | + |
| 10 | +### Current situation |
| 11 | + |
| 12 | +Currently both Module-NMC and NMC controller takes decision regarding kernel module deployment based on node status |
| 13 | +- Module-NMC controller check the schedulability of the node in order to decide whether kernel module should be deployed or removed |
| 14 | + from the node (add/updating spec of the NMC or removing spec of the NMC) |
| 15 | +- NMC controller check the node's schedulability to decide whether to start creating loading/unloading pod on the node. In addition it also |
| 16 | + check if the node has been recently rebooted, in order to create a loading pod, even if the status and spec of the NMC are equal. |
| 17 | + |
| 18 | +This creates a situation where 2 entities decide whether kernel modules should be loaded or not based on a nodes' status |
| 19 | + |
| 20 | +## Goals |
| 21 | + |
| 22 | +1. Create a clear-cut distinction between responsibilities of the two controllers |
| 23 | +2. Eliminate race conditions which are the result of the current situation |
| 24 | + |
| 25 | +## Non-Goals |
| 26 | + |
| 27 | +Do not change any other functionality of the two controllers, besides their decision making that is described above |
| 28 | + |
| 29 | +## Design |
| 30 | + |
| 31 | +### Module-NMC controller decision-making flow |
| 32 | + |
| 33 | +The flow takes into account both Module with Version field defined (ordered upgrade) and without Version field defined (un-ordered upgrade) |
| 34 | +Module-NMC does not take into account the current state of the Node (Ready/NotReady/Schedulable/etc'). It just defines if the kernel module should |
| 35 | +be loaded on the node or not based on whether there is a KernelMapping for the current node's kernel and on the labels of the node. All the rest of the decisions |
| 36 | +will be taken by NMC reconciler, which has a much better view of Node's current state and kernel module's current state |
| 37 | + |
| 38 | +1. Find all the nodes targeted by the Module regardless of node's status, based on the node selector field of the Module |
| 39 | +2. If no suitable KernelMapping for the Node's kernel - do nothing |
| 40 | +3. If there is a suitable KernelMapping and Version field missing in Module (not an ordered upgrade) - update the NMC spec |
| 41 | +4. If there is a suitable KernelMapping, Version field is present in the Module, module loader version label is on the node and |
| 42 | + its value is equal to the Version - update the spec |
| 43 | +5. If there is a suitable KernelMapping, Version field is present in the Module, module loader version label is on the node and |
| 44 | + its value is not equal to Module's version (meaning old version) - do nothing |
| 45 | +6. If there is a suitable KernelMapping, Version field is present in the Module, module loader version label is missing on the node |
| 46 | + (meaning kernel module should not be running on the node) - delete the NMC spec |
| 47 | + |
| 48 | +In this implementation, Module-NMC does not need to delete the spec, but in the 2 following cases: |
| 49 | +1. during ordered upgrade (see point 6 above) |
| 50 | +2. Module is deleted, and so the kernel module should be unloaded |
| 51 | + |
| 52 | +```mermaid |
| 53 | +flowchart TD |
| 54 | +Module[KMM Module]-->|Reconcile| MNC[Module-NMC controller] |
| 55 | +MNC-->|get nodes based on node selector| J1((.)) |
| 56 | +J1-->|no KernelMapping for node's kernel| Done[Done] |
| 57 | +J1-->|found KernelMapping for node's kernel| J2((.)) |
| 58 | + J2-->|Version missing in Module| US[Update NMC Spec] |
| 59 | + J2-->|Version present in Module| J3((.)) |
| 60 | + J3-->|module loader version label equals Version| US |
| 61 | + J3-->|module loader version label not equal Version| Done |
| 62 | + J3-->|module loader version label missing| DS[Delete NMC Spec] |
| 63 | +``` |
| 64 | + |
| 65 | + |
| 66 | +### NMC controller decision-making flow |
| 67 | + |
| 68 | +NMC takes into account the NMC spec, status, node's status and node's ready timestamp to make decision whether to run worker pods, and whether to run unload or load |
| 69 | +worker pod |
| 70 | + |
| 71 | +1. If Node is not Ready/Schedulable - do nothing |
| 72 | +2. If NMC's status is missing and Node's kernel version equal to NMC's spec kernel version - run worker load pod |
| 73 | +3. If NMC's spec is missing, NMC's status is present and NMC's status kernel version equal to Node's kernel version - run worker unload pod |
| 74 | +4. If NMC's spec is present and NMC's status is present, and NMC spec differ from NMC status: |
| 75 | + - if status kernel version equal to node's kernel version - run worker unload pod |
| 76 | + - if spec's kernel version equal to node's kernel version - run worker load pod |
| 77 | +5. If NMC's spec is present and NMC's status is present, and NMC spec equal to NMC status and status timestamp older then node's Ready timestamp - run worker load pod |
| 78 | + |
| 79 | +```mermaid |
| 80 | +flowchart TD |
| 81 | +NMC[NodeModuleConfig]-->|Reconcile| NMCC[NCM controller] |
| 82 | +NMCC-->|get all completed Pods and update NMC statuses| J1((.)) |
| 83 | +J1-->| get NMC's node| J2((.)) |
| 84 | +J2-->|node is not Ready/Schedulable| Done[Done] |
| 85 | +J2-->|node is Ready/Schedulable| J3((.)) |
| 86 | +J3-->|status missing| J4((.)) |
| 87 | + J4-->|node's kernel equals spec' kernel| WLP[Create Worker Load Pod]-->Done |
| 88 | + J4-->|node's kernel differs spec' kernel| Done |
| 89 | +J3-->|spec missing| J5((.)) |
| 90 | + J5-->|node's kernel equals status' kernel| WUP[Create Worker Unload Pod]-->Done |
| 91 | + J5-->|node's kernel differs status' kernel| Done |
| 92 | +J3-->|spec and status differ| J6((.)) |
| 93 | + J6-->|status kernel equals node's kernel| WUP-->Done |
| 94 | + J6-->|status kernel differs node's kernel| J7((.)) |
| 95 | + J7-->|spec kernel equals node's kernel| WLP-->Done |
| 96 | + J7-->|spec kernel differs node's kernel| Done |
| 97 | +J3-->|spec and status equal| J8((.)) |
| 98 | + J8-->|node's Ready timestamp older than status's timestamp| Done |
| 99 | + J8-->|node's Ready timestamp newer than status's timestamp| J9((.)) |
| 100 | + J9-->|spec's kernel equals node's kernel| WLP-->Done |
| 101 | + J9-->|spec's kernel differs node's kernel| Done |
| 102 | +``` |
| 103 | + |
| 104 | +## Addressing goal |
| 105 | + |
| 106 | +* **clear-cut distinction between responsibilities of the two controllers** |
| 107 | + Module-NMC now specifies want it wants to run on the node, and NMC takes care of when to run it and how |
| 108 | + |
| 109 | +* **Eliminating race conditions** |
| 110 | + Race conditin was due to both controllers looking at the same data Nodes's status and kernel, and making decisions based on that data. |
| 111 | + Now Module-NMC does not look at node's status, and NMC looks only at node's status and current kernel |
0 commit comments