23
23
- [ Dual-Stack Support] ( #dual-stack-support )
24
24
- [ Startup Options] ( #startup-options )
25
25
- [ Startup] ( #startup )
26
- - [ Reconciliation Loop ] ( #reconciliation-loop )
26
+ - [ Processing Queue ] ( #processing-queue )
27
27
- [ Event Watching Loops] ( #event-watching-loops )
28
28
- [ Node Added] ( #node-added )
29
29
- [ Node Updated] ( #node-updated )
@@ -77,15 +77,15 @@ checklist items _must_ be updated for the enhancement to be released.
77
77
Items marked with (R) are required * prior to targeting to a milestone /
78
78
release* .
79
79
80
- - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in
80
+ - [X ] (R) Enhancement issue in release milestone, which links to KEP dir in
81
81
[ kubernetes/enhancements] (not the initial KEP PR)
82
- - [ ] (R) KEP approvers have approved the KEP status as ` implementable `
83
- - [ ] (R) Design details are appropriately documented
84
- - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and
82
+ - [X ] (R) KEP approvers have approved the KEP status as ` implementable `
83
+ - [X ] (R) Design details are appropriately documented
84
+ - [X ] (R) Test plan is in place, giving consideration to SIG Architecture and
85
85
SIG Testing input (including test refactors)
86
- - [ ] (R) Graduation criteria is in place
87
- - [ ] (R) Production readiness review completed
88
- - [ ] (R) Production readiness review approved
86
+ - [X ] (R) Graduation criteria is in place
87
+ - [X ] (R) Production readiness review completed
88
+ - [X ] (R) Production readiness review approved
89
89
- [ ] "Implementation History" section is up-to-date for milestone
90
90
- [ ] User-facing documentation has been created in [ kubernetes/website] , for
91
91
publication to [ kubernetes.io]
@@ -225,6 +225,8 @@ do not assume Kubernetes has a single continguous Pod CIDR.
225
225
226
226
### New Resource
227
227
228
+ This KEP proposes adding a new built-in API called ` ClusterCIDRConfig ` .
229
+
228
230
``` go
229
231
type ClusterCIDRConfig struct {
230
232
metav1.TypeMeta
@@ -238,7 +240,7 @@ type ClusterCIDRConfigSpec struct {
238
240
// This defines which nodes the config is applicable to. A nil selector can
239
241
// be applied to any node.
240
242
// +optional
241
- NodeSelector *v1.LabelSelector
243
+ NodeSelector *v1.NodeSelector
242
244
243
245
// This defines the IPv4 CIDR assignable to nodes selected by this config.
244
246
// +optional
@@ -275,9 +277,10 @@ type ClusterCIDRConfigStatus struct {
275
277
276
278
``` 32 - IPv4.PerNodeMaskSize == 128 - IPv6.PerNodeMaskSize ```
277
279
278
- - Each node will be assigned all Pod CIDRs from a matching config.
279
- Consider the following example:
280
-
280
+ - Each node will be assigned all Pod CIDRs from a matching config. That is to
281
+ say, you cannot assing only IPv4 addresses from a ` ClusterCIDRConfig ` which
282
+ specifies both IPv4 and IPv6. Consider the following example:
283
+
281
284
``` go
282
285
{
283
286
IPv4 : {
@@ -294,12 +297,22 @@ type ClusterCIDRConfigStatus struct {
294
297
Pod CIDRs can be partitioned from the IPv4 CIDR . The remaining IPv6 Pod
295
298
CIDRs may be used if referenced in another ` ClusterCIDRConfig` .
296
299
297
- - In case of multiple matching ranges, attempt to break ties with the
300
+ - When there are multiple ` ClusterCIDRConfig` resources in the cluster, first
301
+ collect the list of applicable ` ClusterCIDRConfig` . A ` ClusterCIDRConfig` is
302
+ applicable if its ` NodeSelector` matches the ` Node` being allocated, and if
303
+ it has free CIDRs to allocate.
304
+
305
+ A nil ` NodeSelector` functions as a default that applies to all nodes. This
306
+ should be the fall-back and not take precedence if any other range matches.
307
+ If there are multiple default ranges, ties are broken using the scheme
308
+ outlined below.
309
+
310
+ In ths case of multiple matching ranges, attempt to break ties with the
298
311
following rules:
299
312
1 . Pick the ` ClusterCIDRConfig` whose ` NodeSelector` matches the most
300
- labels on the ` Node` . For example, ` {'node.kubernetes.io/instance-type':
301
- ' medium', 'rack': 'rack1'}` before ` {'node.kubernetes.io/instance-type':
302
- 'medium'}` .
313
+ labels/fields on the ` Node` . For example,
314
+ ` {'node.kubernetes.io/instance-type': ' medium', 'rack': 'rack1'}` before
315
+ ` {'node.kubernetes.io/instance-type': 'medium'}` .
303
316
1 . Pick the ` ClusterCIDRConfig` with the fewest Pod CIDRs allocatable. For
304
317
example, ` {CIDR: "10.0.0.0/16", PerNodeMaskSize: "16"}` (1 possible Pod
305
318
CIDR ) is picked before ` {CIDR: "192.168.0.0/20", PerNodeMaskSize: "22"}`
@@ -308,11 +321,6 @@ type ClusterCIDRConfigStatus struct {
308
321
For example, ` 27` (32 IPs ) picked before ` 25` (128 IPs ).
309
322
1 . Break ties arbitrarily.
310
323
311
- - An empty ` NodeSelector` functions as a default that applies to all nodes.
312
- This should be the fall-back and not take precedence if any other range
313
- matches. If there are multiple default ranges, ties are broken using the
314
- scheme outlined above.
315
-
316
324
- When breaking ties between matching ` ClusterCIDRConfig` , if the most
317
325
applicable (as defined by the tie-break rules) has no more free allocations,
318
326
attempt to allocate from the next highest matching ` ClusterCIDRConfig` . For
@@ -329,21 +337,21 @@ type ClusterCIDRConfigStatus struct {
329
337
to the tie-break rules.
330
338
` ` ` go
331
339
{
332
- NodeSelector: { MatchLabels : { "node": "n1", "rack": "rack1" } },
340
+ NodeSelector: { MatchExpressions : { "node": "n1", "rack": "rack1" } },
333
341
IPv4: {
334
342
CIDR: "10.5.0.0/16",
335
343
PerNodeMaskSize: 26,
336
344
}
337
345
},
338
346
{
339
- NodeSelector: { MatchLabels : { "node": "n1" } },
347
+ NodeSelector: { MatchExpressions : { "node": "n1" } },
340
348
IPv4: {
341
349
CIDR: "192.168.128.0/17",
342
350
PerNodeMaskSize: 28,
343
351
}
344
352
},
345
353
{
346
- NodeSelector: { MatchLabels : { "node": "n1" } },
354
+ NodeSelector: { MatchExpressions : { "node": "n1" } },
347
355
IPv4: {
348
356
CIDR: "192.168.64.0/20",
349
357
PerNodeMaskSize: 28,
@@ -363,7 +371,7 @@ type ClusterCIDRConfigStatus struct {
363
371
364
372
- On deletion of the ` ClusterCIDRConfig` , the controller checks to see if any
365
373
Nodes are using ` PodCIDRs` from this range -- if so it keeps the finalizer
366
- in place and periodically polls Nodes . When all Nodes using this
374
+ in place and waits for the Nodes to be deleted . When all Nodes using this
367
375
` ClusterCIDRConfig` are deleted, the finalizer is removed.
368
376
369
377
#### Example : Allocations
@@ -451,11 +459,11 @@ nodes we expect.
451
459
452
460
#### Dual-Stack Support
453
461
454
- To assign both IPv4 and IPv6 Pod CIDRs to a Node, the `IPv4` and `IPv6` fields
455
- must be both set on the object. The controller does not have an in-built notion
456
- of single-stack or dual-stack clusters. It uses the tie-break rules specified
457
- [above](#expected-behavior) to pick a `ClusterCIDRConfig` from which to allocate
458
- Pod CIDRs for each Node .
462
+ The decision of whether to assign only IPv4, only IPv6, or both depends on the
463
+ CIDRs configured in a `ClusterCIDRConfig` object. As described
464
+ [above](#expected-behavior), the controller creates an ordered list of
465
+ `ClusterCIDRConfig` resources which apply to a given `Node` and allocates from
466
+ the first matching `ClusterCIDRConfig` with CIDRs available .
459
467
460
468
The controller makes no guarantees that all Nodes are single-stack or that all
461
469
Nodes are dual-stack. This is to specifically allow users to upgrade existing
@@ -497,12 +505,20 @@ from the existing NodeIPAM controller:
497
505
necessary.
498
506
- The "created-from-flags-\<hash\>" object will always be created as long
499
507
as the flags are set. The hash is arbitrarily assigned.
500
- - If an object with the name "created-from-flags-\<hash>" already exists,
501
- but it does not match the flag values, the controller will delete it and
502
- create a new object. The controller will ensure (on startup) that there
503
- is only one non-deleted `ClusterCIDRConfig` with the name
504
- "create-from-flags\<hash>". This will allow users to change the flag
505
- values and stop using the old values.
508
+ - If an un-deleted object with the name "created-from-flags-*" already
509
+ exists, but it does not match the flag values, the controller will
510
+ delete it and create a new object. The controller will ensure (on
511
+ startup) that there is only one non-deleted `ClusterCIDRConfig` with the
512
+ name "create-from-flags-\<hash>". The "\<hash>" at the end of the name
513
+ allows the controller to have multiple "created-from-flags" objects
514
+ present (e.g. blocked on deletion because of the finalizer), without
515
+ blocking startup.
516
+ - If some `Node`s were allocated Pod CIDRs from the old
517
+ "created-from-flags-\<hash>" object, they will follow the standard
518
+ lifecycle for deleting a `ClusterCIDRConfig` object. The
519
+ "created-from-flag-\<hash>" object the `Nodes` are allocated from will
520
+ remain pending deletion (waiting for its finalizer to be cleared) until
521
+ all `Nodes` using those ranges are re-created.
506
522
- Fetch list of `Node`s. Check each node for `PodCIDRs`
507
523
- If `PodCIDR` is set, mark the allocation in the internal data structure
508
524
and store this association with the node.
@@ -512,13 +528,12 @@ from the existing NodeIPAM controller:
512
528
After processing all nodes, allocate ranges to any nodes without Pod
513
529
CIDR(s) [Same logic as Node Added event]
514
530
515
- #### Reconciliation Loop
516
-
517
- This go-routine will watch for cleanup operations and failed allocations and
518
- continue to try them in the background.
531
+ #### Processing Queue
519
532
520
- For example if a Node can't be allocated a PodCIDR, it will be periodically
521
- retried until it can be allocated a range or it is deleted.
533
+ The controller will maintain a queue of events that it is processing. `Node`
534
+ additions and `ClusterCIDRConfig` additions will be appended to the queue.
535
+ Similarly, Node allocations which failed due to insufficient CIDRs can be
536
+ retried by adding them back on to the queue (with exponential backoff).
522
537
523
538
#### Event Watching Loops
524
539
@@ -729,6 +744,12 @@ This section must be completed when targeting alpha to a release.
729
744
Pick one of these and delete the rest.
730
745
-->
731
746
747
+ - [X] Feature Gate
748
+ - Feature gate name: ClusterCIDRConfig
749
+ - Components depending on the feature gate: kube-controller-manager
750
+ - The feature gate will control whether the new controller can even be
751
+ used, while the kube-controller-manager flag below will pick the
752
+ active controller.
732
753
- [X] Other
733
754
- Describe the mechanism:
734
755
- The feature is enabled by setting the kube-controller-manager flag
@@ -755,8 +776,8 @@ too only for nodes created after that point).
755
776
Yes, users can switch back to the old controller and delete the
756
777
`ClusterCIDRConfig` objects. However, if any Nodes were allocated `PodCIDR` by
757
778
the new controller, those allocation will persist for the lifetime of the Node.
758
- Users will have to restart their Nodes to trigger another `PodCIDR` allocation
759
- (this time performed by the old controller.)
779
+ Users will have to recreate their Nodes to trigger another `PodCIDR` allocation
780
+ (this time performed by the old controller).
760
781
761
782
The should not be any effect on running workloads. The nodes will continue to
762
783
use their allocated `PodCIDR` even if the underlying `ClusterCidrConfig` object
@@ -765,15 +786,15 @@ is forceably deleted.
765
786
###### What happens if we reenable the feature if it was previously rolled back?
766
787
767
788
The controller is expected to read the existing set of `ClusterCIDRConfig` as
768
- well as the existing Node `PodCIDR` allocations and allocate new PorCIDRs
789
+ well as the existing Node `PodCIDR` allocations and allocate new PodCIDRs
769
790
appropriately.
770
791
771
792
###### Are there any tests for feature enablement/disablement?
772
793
773
- Yes, some integraiotn tets will be added to test this case . They will test the
774
- scenario where some Nodes already have PodCIDRs allocated to them (potentially
775
- from CIDRs not tracked by any `ClusterCIDRConfig`). THis should be sufficient to
776
- cover the enablement/disablment scenarios.
794
+ Not yet, they will be added as part of the graduation to alpha . They will test
795
+ the scenario where some Nodes already have PodCIDRs allocated to them
796
+ (potentially from CIDRs not tracked by any `ClusterCIDRConfig`). This should be
797
+ sufficient to cover the enablement/disablment scenarios.
777
798
778
799
### Rollout, Upgrade and Rollback Planning
779
800
0 commit comments