Add flex node added into / remove from Private AKS cluster#55
Add flex node added into / remove from Private AKS cluster#55weiliu2dev wants to merge 12 commits intoAzure:mainfrom
Conversation
- Add private-join command to join Private AKS cluster via Gateway - Add private-leave command with --mode=local|full cleanup options - Add private-install.sh and private-uninstall.sh scripts - Add pkg/privatecluster package with embedded scripts - Add documentation for creating and configuring Private AKS cluster
d4d925c to
69604af
Compare
| @@ -0,0 +1,796 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
We don't do shell script except for the agent installation/uninstallation step, can you change everything to golang?
There was a problem hiding this comment.
agree with you. Ack: This is a POC to show the process for the node added into a private cluster - I'll create a tracking issue to convert shell to Go before GA. The only shell that will remain is the Arc agent installation (official script).
There was a problem hiding this comment.
already replaced with go
| NC='\033[0m' # No Color | ||
|
|
||
| # Configuration | ||
| GATEWAY_NAME="wg-gateway" |
There was a problem hiding this comment.
Please define a new struct for any new configuration needed in pkg/config/structs.go
There was a problem hiding this comment.
yes, will address both together. Since this is a POC, I'll convert shell to Go and define the GatewayConfig struct in a follow-up PR.
commands.go
Outdated
| // NewPrivateJoinCommand creates a new private-join command | ||
| func NewPrivateJoinCommand() *cobra.Command { | ||
| cmd := &cobra.Command{ | ||
| Use: "private-join", |
There was a problem hiding this comment.
Please do not introduce new commands. The aks-flex-node agent --config command handles both node bootstrapping and self-healing and is also the command used by the systemd service. For private clusters, continue using this same command and add a new configuration property as needed.
Similarly, for unbootstrapping, continue using the existing aks-flex-node unbootstrap command, with the appropriate configuration for private clusters.
There was a problem hiding this comment.
ok , I will remove the commands. Note that without the commands, the inline help will no longer be
available
There was a problem hiding this comment.
commands are already removed and merged into agent command.
| fi | ||
|
|
||
| # Remove Arc Agent and Azure resource | ||
| log_info "Removing Arc Agent..." |
There was a problem hiding this comment.
Have many things duplicated with scripts/uninstall.sh, just add a new uninstaller in /pkg/bootstrapper/bootstrapper.go to take of setting/cleaning up for anything related to private cluster.
There was a problem hiding this comment.
gosec found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
ffdc990 to
bd3daff
Compare
ad84d14 to
81f6179
Compare
81f6179 to
204ad26
Compare
|
@microsoft-github-policy-service agree company="Microsoft" |
|
I suggest spending a bit more time reviewing our codebase before starting to contribute. Thanks! |
pkg/components/arc/arc_installer.go
Outdated
|
|
||
| // Step 1: Clean up local agent state | ||
| i.logger.Info("Cleaning up local agent state...") | ||
| disconnectCmd := exec.CommandContext(ctx, "azcmagent", "disconnect", "--force-local-only") |
There was a problem hiding this comment.
use utils.RunCommandWithOutput() so it automatically add sudo if needed. Same with every place use exec.CommandContext below.
pkg/privatecluster/vpn.go
Outdated
| } | ||
|
|
||
| // InstallJQ installs jq locally | ||
| func InstallJQ(ctx context.Context, logger *Logger) error { |
There was a problem hiding this comment.
pkg/config/structs.go
Outdated
| type TargetClusterConfig struct { | ||
| ResourceID string `json:"resourceId"` // Full resource ID of the target AKS cluster | ||
| Location string `json:"location"` // Azure region of the cluster (e.g., "eastus", "westus2") | ||
| Private bool `json:"private"` // Whether this is a private AKS cluster (requires Gateway/VPN setup) |
There was a problem hiding this comment.
rename to IsPrivateCluster
pkg/privatecluster/azure.go
Outdated
|
|
||
| // CheckInstalled verifies Azure CLI is installed | ||
| func (az *AzureCLI) CheckInstalled() error { | ||
| if !CommandExists("az") { |
There was a problem hiding this comment.
The installation of az is enforced in install.sh, but CLI credential isn't the only credential we support, customer can also use SP or MSI, so we shouldn't always require az login
There was a problem hiding this comment.
Removed the unused CheckInstalled/CheckLogin/CheckAndRefreshToken functions. The broader migration from az CLI to Azure SDK (which supports SP/MSI natively) is already in progress with azure_client.go and will be continued in follow-up PRs.
pkg/privatecluster/azure.go
Outdated
|
|
||
| // AKSClusterExists checks if an AKS cluster exists | ||
| func (az *AzureCLI) AKSClusterExists(ctx context.Context, resourceGroup, clusterName string) bool { | ||
| return RunCommandSilent(ctx, "az", "aks", "show", |
There was a problem hiding this comment.
Do not use any az command for resource management. Use track2 SDK client. Since we support multiple auth methods. read https://github.com/wenxuan0923/AKSFlexNode/blob/main/pkg/auth/auth.go
Please convert all the commands in this file to use SDK client. FYI something like this https://github.com/wenxuan0923/AKSFlexNode/blob/6bdb4d86237cacb426e363cc008f3211441b563b/pkg/components/arc/arc_base.go#L42
There was a problem hiding this comment.
Agreed. This is a larger refactor to migrate all az CLI commands to Track2 SDK clients. The current PR already started this
direction with azure_client.go using SDK. I'll convert the remaining az CLI calls in a follow-up PR to keep this one focused.
There was a problem hiding this comment.
I think for a node joining a private cluster POC, the current PR is mainly verifying the "doable". We can improve it later by following PR, is it?
pkg/privatecluster/utils.go
Outdated
| } | ||
|
|
||
| // IsRoot checks if the current process is running as root | ||
| func IsRoot() bool { |
There was a problem hiding this comment.
No it should not run as root.
There was a problem hiding this comment.
Modifying /etc/hosts is just one of many privileged operations in the private cluster flow — WireGuard configuration, systemctl, etc. all require elevated access. Changing the entire private cluster package from "require root" to "sudo on demand" is an big change that I'll address in a follow-up PR.
| } | ||
|
|
||
| // RemoveHostsEntries removes entries matching a pattern from /etc/hosts | ||
| func RemoveHostsEntries(pattern string) error { |
There was a problem hiding this comment.
This is for private AKS cluster DNS resolution. The private cluster API server FQDN (e.g., xxx.privatelink.eastus.azmk8s.io) is not resolvable from outside the VNet. During install, we add the API server IP → FQDN mapping to /etc/hosts so kubelet can reach it via VPN. RemoveHostsEntries cleans up these entries during uninstall.
pkg/privatecluster/scripts.go
Outdated
| ) | ||
|
|
||
| // ScriptRunner provides backward compatibility (Deprecated: use Installer/Uninstaller directly) | ||
| type ScriptRunner struct { |
There was a problem hiding this comment.
What is this for? we only need installer and uninsaller in bootstrapper.go
https://github.com/wenxuan0923/AKSFlexNode/blob/main/pkg/bootstrapper/bootstrapper.go
There was a problem hiding this comment.
In the next PR, I'll migrate the az CLI calls in the private cluster package to Azure Go SDK, and then integrate it into the
bootstrapper as a proper Executor step. This will also unify the auth model to support CLI/SP/MSI via azcore.TokenCredential.
| @@ -0,0 +1,96 @@ | |||
| package privatecluster | |||
There was a problem hiding this comment.
All the config should be put inside https://github.com/wenxuan0923/AKSFlexNode/blob/main/pkg/config/structs.go
There was a problem hiding this comment.
The user-facing configuration (like GatewayConfig) can be moved to config/structs.go. However, runtime types like AKSClusterInfo, VPNConfig, and SSHConfig are populated dynamically during installation, not from user config — they should stay in the package. I'll consolidate this when integrating the private cluster into the bootstrapper framework in the follow-up PR.
pkg/privatecluster/utils.go
Outdated
| ) | ||
|
|
||
| // Logger provides colored logging for the private cluster operations | ||
| type Logger struct { |
There was a problem hiding this comment.
Why need this? we already have logger set up at agent level?
There was a problem hiding this comment.
Agreed. It exists because the private cluster package was developed independently. When I integrate it into the
bootstrapper framework in the follow-up PR, it will use the shared logrus.Logger like all other components, and this custom Logger will be removed.
pkg/privatecluster/installer.go
Outdated
| i.clusterInfo.PrivateFQDN = clusterInfo.PrivateFQDN | ||
| i.logger.Success("AKS cluster: %s (AAD/RBAC enabled)", i.clusterInfo.ClusterName) | ||
|
|
||
| vnetName, vnetRG, err := i.azure.GetVNetInfo(ctx, i.clusterInfo.NodeResourceGroup) |
There was a problem hiding this comment.
You are assuming VNet will always be in Node reource group which is not true for BYO VNet
There was a problem hiding this comment.
The code actually handles BYO VNet correctly — it uses the node resource group only to locate the VMSS,
then extracts the VNet's actual resource group from the VMSS subnet ID. So BYO VNet in a different resource group work fine. I changed the function description here for clearence.
pkg/privatecluster/installer.go
Outdated
| if err := InstallJQ(ctx, i.logger); err != nil { | ||
| return fmt.Errorf("failed to install jq: %w", err) | ||
| } | ||
| if !CommandExists("kubectl") || !CommandExists("kubelogin") { |
There was a problem hiding this comment.
Why install kubectl and kubelogin here?? kubectl already being installed by kube_binaries_installer.go
There was a problem hiding this comment.
The private cluster installer currently runs before the bootstrapper (commands.go:97-101), so kube_binaries_installer hasn't executed yet at that point.
kubectl and kubelogin are needed during the private cluster setup to verify API server connectivity through the VPN tunnel. Specifically, at line 358, we run kubelogin convert-kubeconfig -l azurecli to convert the kubeconfig from static token auth to Azure AD (Entra ID) auth. This is required because the target AKS cluster uses Azure RBAC, and kubelogin acts as a credential plugin that lets kubectl authenticate via Azure AD automatically.
Also note that kube_binaries_installer only installs standard Kubernetes binaries (kubectl, kubelet) — it does not install
kubelogin, which is an AKS-specific tool for Azure AD integration.
There was a problem hiding this comment.
This is closely related to the refactoring discussed above. In the follow-up PR, the private cluster setup will be integrated into the bootstrapper as a step. At that point, the kubectl/kubelogin installation can be consolidated with kube_binaries_installer by having it also handle kubelogin, and the step ordering in the bootstrapper will ensure they are installed before the private cluster connectivity verification runs.
pkg/privatecluster/uninstaller.go
Outdated
| u.removeNodeFromCluster(ctx, hostname) | ||
|
|
||
| // Stop any running aks-flex-node agent process | ||
| u.stopFlexNodeAgent(ctx) |
There was a problem hiding this comment.
Why stopFlexNodeAgent in privatecluster uninstaller? you should only clean up everything related to your own component.
There was a problem hiding this comment.
Removed stopFlexNodeAgent and removeArcAgent from the private cluster uninstaller. These are now handled by the bootstrapper's services.UnInstaller and arc.UnInstaller steps respectively. Kept removeNodeFromCluster here because it needs VPN connectivity to reach the private cluster API server — after the VPN is torn down in the next step, kubectl delete node would no longer work.
pkg/privatecluster/uninstaller.go
Outdated
| } | ||
|
|
||
| // removeArcAgent removes Azure Arc agent | ||
| func (u *Uninstaller) removeArcAgent(ctx context.Context, nodeName string) { |
There was a problem hiding this comment.
This is done by arc uninstaller already.
There was a problem hiding this comment.
Agreed, already removed.
make senses. |
| } | ||
|
|
||
| // ServicePrincipalConfig holds Azure service principal authentication configuration. | ||
| // When provided, service principal authentication will be used instead of Azure CLI. |
There was a problem hiding this comment.
Why removing those comment?
| pipClient *armnetwork.PublicIPAddressesClient | ||
| nicClient *armnetwork.InterfacesClient | ||
| aksClient *armcontainerservice.ManagedClustersClient | ||
| arcClient *armhybridcompute.MachinesClient |
There was a problem hiding this comment.
What is arcClient used for?
| pipName := gateway.Name + "-pip" | ||
| location := i.clusterInfo.Location | ||
|
|
||
| if err := i.azureClient.CreateSubnet(ctx, i.clusterInfo.VNetResourceGroup, i.clusterInfo.VNetName, |
There was a problem hiding this comment.
How you know whether the VNet address space is big enough for new subnet with size SubnetPrefix?
| i.clusterInfo.PrivateFQDN = clusterInfo.PrivateFQDN | ||
| i.logger.Infof("AKS cluster: %s (AAD/RBAC enabled)", i.clusterInfo.ClusterName) | ||
|
|
||
| vnetName, vnetRG, err := i.azureClient.GetVNetInfo(ctx, i.clusterInfo.NodeResourceGroup) |
There was a problem hiding this comment.
What about BYO VNet? It's not in node rg
| ClusterName: clusterName, | ||
| } | ||
|
|
||
| if err := i.phase1EnvironmentCheck(ctx); err != nil { |
There was a problem hiding this comment.
Don't name method this way, what if we need add extra step in the middle in the future?
| // gatewayConfig returns the Gateway configuration, applying any overrides from config. | ||
| func (i *Installer) gatewayConfig() GatewayConfig { | ||
| gw := DefaultGatewayConfig() | ||
| if i.config.Azure.TargetCluster.GatewayVMSize != "" { |
There was a problem hiding this comment.
How is this different with the Azure VPNGateway feature I'm working on? one is managed solution another is manual?
|
|
||
| // GetVNetInfo discovers VNet name and resource group from VMSS subnet configuration. | ||
| func (c *AzureClient) GetVNetInfo(ctx context.Context, nodeResourceGroup string) (vnetName, vnetRG string, err error) { | ||
| pager := c.vmssClient.NewListPager(nodeResourceGroup, nil) |
There was a problem hiding this comment.
Don't assume VMSS nodepool? If you need Vnet ID, just make it a config?
| } | ||
|
|
||
| // DeleteDisks deletes disks matching a name pattern. | ||
| func (c *AzureClient) DeleteDisks(ctx context.Context, resourceGroup, pattern string) error { |
There was a problem hiding this comment.
I believe there is configuration in VM deletion that can delete all attached resources all together.
| } | ||
|
|
||
| // GetAKSClusterInfo retrieves AKS cluster information in a single API call. | ||
| func (c *AzureClient) GetAKSClusterInfo(ctx context.Context, resourceGroup, clusterName string) (*AKSClusterInfo, error) { |
There was a problem hiding this comment.
I don't see the point of this method, everything except PrivateFQDN is already covered in our code. Your code be focusing on the private cluster part only
|
|
||
| // Install installs and configures VPN server | ||
| func (m *VPNServerManager) Install(ctx context.Context) error { | ||
| script := `set -e |
There was a problem hiding this comment.
please don't do shell script
Usage