Skip to content

Conversation

@nairashu
Copy link
Contributor

Reason for Change:
feat: Creating NNC with HomeAz info in AKS-Swift Workflows when CNS starts up behind a configuration flag

Issue Fixed:
N/A

Requirements:

Notes:

@nairashu nairashu requested review from a team as code owners November 19, 2024 21:41
@nairashu nairashu requested a review from paulyufan2 November 19, 2024 21:41
@timraymond
Copy link
Member

@nairashu If this is still WIP, can you mark this as draft?

@nairashu nairashu changed the title [WIP]: feat: Creating NNC with HomeAz info in AKS-Swift Workflows when CNS starts up behind a configuration flag feat: Creating NNC with HomeAz info in AKS-Swift Workflows when CNS starts up behind a configuration flag Dec 17, 2024
Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/thoughts

EnableSubnetScarcity bool
EnableSwiftV2 bool
InitializeFromCNI bool
EnableHomeAz bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be EnableNNCCreation. It will be clearer when AKS reviews toggles to CNS config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically control the component configurations closer to our service feature naming so keeping it related to HomeAz makes more sense as the flag is to use the HomeAzMonitor overall to retrieve the Az and then create the NNC from that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that "Az" is everywhere, so it's probably impractical to change at this point, but "AZ" is an abbreviation, so this should be EnableHomeAZ by typical Go naming conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to EnableHomeAZ

Comment on lines 1312 to 1319
var nnc *v1alpha.NodeNetworkConfig
if nnc, err = directnnccli.Get(ctx, types.NamespacedName{Namespace: "kube-system", Name: nodeName}); err != nil {
logger.Errorf("[Azure CNS] failed to get existing NNC: %v", err)
}

newNNC := createBaseNNC(node)
if nnc == nil {
logger.Printf("[Azure CNS] Creating new base NNC")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error conditions are not clear here. If there is a transient network error, the get should be retried no? Otherwise, is there a way to detect that we reached the apiserver and it returned "not found"? That should be the only time when we attempt to create the nnc.

Copy link
Contributor Author

@nairashu nairashu Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a retrier in here for transient errors.

func (service *HTTPRestService) GetHomeAz(ctx context.Context) (homeAzResponse cns.GetHomeAzResponse) {
service.RLock()
homeAzResponse = service.homeAzMonitor.GetHomeAz(ctx)
service.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be defer-ed to insulate it from panics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the defer but also noticed this file is missing defer in multiple other places

Copy link
Member

@timraymond timraymond Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Leave the campsite cleaner than you found it" applies -- but also I only care about this instance for sake of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally on the same page with that philosophy so did add in the defer above.

EnableSubnetScarcity bool
EnableSwiftV2 bool
InitializeFromCNI bool
EnableHomeAz bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that "Az" is everywhere, so it's probably impractical to change at this point, but "AZ" is an abbreviation, so this should be EnableHomeAZ by typical Go naming conventions.

EnableSubnetScarcity bool
EnableSwiftV2 bool
InitializeFromCNI bool
EnableHomeAZ bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous comment got lost but I think we need further discussion on the name of this field. I don't think EnableHomeAZ is the correct name because:

  • Home AZ is already enabled when mode is direct, which makes this flag confusing.
  • The major feature is the NNC creation, which should be independent of home az being available or not.
    I still think EnableNNCCreation is a better name. Would like to get feedback from @thatmattlong and @rbtr on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, flag should tightly describe what function it does, not the abstract name of the scenario.
CreateNNC would be even better

directscopedcli := nncctrl.NewScopedClient(directnnccli, types.NamespacedName{Namespace: "kube-system", Name: nodeName})

// Create the base NNC CRD if HomeAz is enabled
if cnsconfig.EnableHomeAZ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to extract the scope opened by this flag to a different function. it will help clean up some code paths.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like to see NNC creation decoupled completely from HomeAz

- apiGroups: ["acn.azure.com"]
resources: ["nodenetworkconfigs"]
verbs: ["get", "list", "watch", "patch", "update"]
verbs: ["create", "delete", "get", "list", "watch", "patch", "update"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we'll need delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is to move the node reconcile logic over to CNS.
It makes more sense to create and cleanup the NNC based on when the node is created and deleted. As a result, I added the operational capability to delete the NNC as well to CNS as eventually we can clean up the nodes faster in this case. DNC, DNC-RC and DNCCleanup can work async as it does today to cleanup the NC and other associations.

Why do we want to avoid CNS from doing this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenario could CNS know it should delete an NNC?
I think it's impossible. The NNC should only be deleted once the Node no longer exists. If the Node doesn't exist, CNS cannot be running and can't delete the NNC.
Even DNC-RC does not delete NNCs - when the NNC is created it gets its OwnerRef set to the Node object. Then when the Node object is deleted from Kubernetes, the NNC is automatically garbage collected. These refs use object UUIDs; this is the only safe way to delete an NNC.

Comment on lines +637 to +647
// GetHomeAz - Get the Home Az for the Node where CNS is running.
func (service *HTTPRestService) GetHomeAz(ctx context.Context) (cns.GetHomeAzResponse, error) {
service.RLock()
defer service.RUnlock()
homeAzResponse := service.homeAzMonitor.GetHomeAz(ctx)
if homeAzResponse.Response.ReturnCode == types.NotFound {
return homeAzResponse, errors.New(homeAzResponse.Response.Message)
}

return homeAzResponse, nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what value is this adding over calling service.homeAzMonitor.GetHomeAz directly?

@ramiro-gamarra ramiro-gamarra self-requested a review January 3, 2025 23:29
@github-actions
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Jan 22, 2025
@github-actions
Copy link

Pull request closed due to inactivity.

@github-actions github-actions bot closed this Jan 29, 2025
@github-actions github-actions bot deleted the asn/HomeAz branch January 29, 2025 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Stale due to inactivity.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants