Skip to content

Add flex node added into / remove from Private AKS cluster#55

Open
weiliu2dev wants to merge 12 commits intoAzure:mainfrom
weiliu2dev:weiliu2/private-cluster-flex
Open

Add flex node added into / remove from Private AKS cluster#55
weiliu2dev wants to merge 12 commits intoAzure:mainfrom
weiliu2dev:weiliu2/private-cluster-flex

Conversation

@weiliu2dev
Copy link
Collaborator

@weiliu2dev weiliu2dev commented Feb 2, 2026

  - Add private-join command to join Private AKS cluster via Gateway
  - Add private-leave command with --mode=local|full cleanup options
  - Add private-install.sh and private-uninstall.sh scripts
  - Add pkg/privatecluster package with embedded scripts
  - Add documentation for creating and configuring Private AKS cluster

Usage

# Join private cluster
sudo ./aks-flex-node private-join --aks-resource-id "<AKS_RESOURCE_ID>"

# Leave - local cleanup (keep Gateway for other nodes)
sudo ./aks-flex-node private-leave --mode=local

# Leave - full cleanup (remove all Azure resources)
sudo ./aks-flex-node private-leave --mode=full --aks-resource-id "<AKS_RESOURCE_ID>"

Files Changed

- commands.go - Add private-join/leave CLI commands
- main.go - Register new commands
- pkg/privatecluster/scripts.go - Shell script wrapper
- pkg/privatecluster/private-install.sh - Join script
- pkg/privatecluster/private-uninstall.sh - Leave script
- pkg/privatecluster/README.md - Usage documentation
- pkg/privatecluster/create_private_cluster.md - Cluster setup guide

  - Add private-join command to join Private AKS cluster via Gateway
  - Add private-leave command with --mode=local|full cleanup options
  - Add private-install.sh and private-uninstall.sh scripts
  - Add pkg/privatecluster package with embedded scripts
  - Add documentation for creating and configuring Private AKS cluster
@weiliu2dev weiliu2dev force-pushed the weiliu2/private-cluster-flex branch from d4d925c to 69604af Compare February 2, 2026 08:44
@@ -0,0 +1,796 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do shell script except for the agent installation/uninstallation step, can you change everything to golang?

Copy link
Collaborator Author

@weiliu2dev weiliu2dev Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with you. Ack: This is a POC to show the process for the node added into a private cluster - I'll create a tracking issue to convert shell to Go before GA. The only shell that will remain is the Arc agent installation (official script).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already replaced with go

NC='\033[0m' # No Color

# Configuration
GATEWAY_NAME="wg-gateway"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define a new struct for any new configuration needed in pkg/config/structs.go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will address both together. Since this is a POC, I'll convert shell to Go and define the GatewayConfig struct in a follow-up PR.

commands.go Outdated
// NewPrivateJoinCommand creates a new private-join command
func NewPrivateJoinCommand() *cobra.Command {
cmd := &cobra.Command{
Use: "private-join",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not introduce new commands. The aks-flex-node agent --config command handles both node bootstrapping and self-healing and is also the command used by the systemd service. For private clusters, continue using this same command and add a new configuration property as needed.

Similarly, for unbootstrapping, continue using the existing aks-flex-node unbootstrap command, with the appropriate configuration for private clusters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , I will remove the commands. Note that without the commands, the inline help will no longer be
available

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commands are already removed and merged into agent command.

fi

# Remove Arc Agent and Azure resource
log_info "Removing Arc Agent..."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have many things duplicated with scripts/uninstall.sh, just add a new uninstaller in /pkg/bootstrapper/bootstrapper.go to take of setting/cleaning up for anything related to private cluster.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gosec found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@weiliu2dev weiliu2dev force-pushed the weiliu2/private-cluster-flex branch 3 times, most recently from ffdc990 to bd3daff Compare February 4, 2026 22:43
@weiliu2dev weiliu2dev force-pushed the weiliu2/private-cluster-flex branch 2 times, most recently from ad84d14 to 81f6179 Compare February 6, 2026 21:48
@weiliu2dev weiliu2dev force-pushed the weiliu2/private-cluster-flex branch from 81f6179 to 204ad26 Compare February 6, 2026 23:02
@weiliu2dev
Copy link
Collaborator Author

@Neo-NZ please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement
@microsoft-github-policy-service agree company="Microsoft“

@weiliu2dev
Copy link
Collaborator Author

@Neo-NZ please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement
@microsoft-github-policy-service agree company="Microsoft“

@microsoft-github-policy-service agree company="Microsoft"

@weiliu2dev weiliu2dev closed this Feb 6, 2026
@weiliu2dev weiliu2dev reopened this Feb 6, 2026
@wenxuan0923
Copy link
Collaborator

I suggest spending a bit more time reviewing our codebase before starting to contribute. Thanks!


// Step 1: Clean up local agent state
i.logger.Info("Cleaning up local agent state...")
disconnectCmd := exec.CommandContext(ctx, "azcmagent", "disconnect", "--force-local-only")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use utils.RunCommandWithOutput() so it automatically add sudo if needed. Same with every place use exec.CommandContext below.

}

// InstallJQ installs jq locally
func InstallJQ(ctx context.Context, logger *Logger) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

type TargetClusterConfig struct {
ResourceID string `json:"resourceId"` // Full resource ID of the target AKS cluster
Location string `json:"location"` // Azure region of the cluster (e.g., "eastus", "westus2")
Private bool `json:"private"` // Whether this is a private AKS cluster (requires Gateway/VPN setup)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to IsPrivateCluster

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.


// CheckInstalled verifies Azure CLI is installed
func (az *AzureCLI) CheckInstalled() error {
if !CommandExists("az") {
Copy link
Collaborator

@wenxuan0923 wenxuan0923 Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installation of az is enforced in install.sh, but CLI credential isn't the only credential we support, customer can also use SP or MSI, so we shouldn't always require az login

Copy link
Collaborator Author

@weiliu2dev weiliu2dev Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused CheckInstalled/CheckLogin/CheckAndRefreshToken functions. The broader migration from az CLI to Azure SDK (which supports SP/MSI natively) is already in progress with azure_client.go and will be continued in follow-up PRs.


// AKSClusterExists checks if an AKS cluster exists
func (az *AzureCLI) AKSClusterExists(ctx context.Context, resourceGroup, clusterName string) bool {
return RunCommandSilent(ctx, "az", "aks", "show",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use any az command for resource management. Use track2 SDK client. Since we support multiple auth methods. read https://github.com/wenxuan0923/AKSFlexNode/blob/main/pkg/auth/auth.go

Please convert all the commands in this file to use SDK client. FYI something like this https://github.com/wenxuan0923/AKSFlexNode/blob/6bdb4d86237cacb426e363cc008f3211441b563b/pkg/components/arc/arc_base.go#L42

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is a larger refactor to migrate all az CLI commands to Track2 SDK clients. The current PR already started this
direction with azure_client.go using SDK. I'll convert the remaining az CLI calls in a follow-up PR to keep this one focused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a node joining a private cluster POC, the current PR is mainly verifying the "doable". We can improve it later by following PR, is it?

}

// IsRoot checks if the current process is running as root
func IsRoot() bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it should not run as root.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying /etc/hosts is just one of many privileged operations in the private cluster flow — WireGuard configuration, systemctl, etc. all require elevated access. Changing the entire private cluster package from "require root" to "sudo on demand" is an big change that I'll address in a follow-up PR.

}

// RemoveHostsEntries removes entries matching a pattern from /etc/hosts
func RemoveHostsEntries(pattern string) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for private AKS cluster DNS resolution. The private cluster API server FQDN (e.g., xxx.privatelink.eastus.azmk8s.io) is not resolvable from outside the VNet. During install, we add the API server IP → FQDN mapping to /etc/hosts so kubelet can reach it via VPN. RemoveHostsEntries cleans up these entries during uninstall.

)

// ScriptRunner provides backward compatibility (Deprecated: use Installer/Uninstaller directly)
type ScriptRunner struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? we only need installer and uninsaller in bootstrapper.go

https://github.com/wenxuan0923/AKSFlexNode/blob/main/pkg/bootstrapper/bootstrapper.go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next PR, I'll migrate the az CLI calls in the private cluster package to Azure Go SDK, and then integrate it into the
bootstrapper as a proper Executor step. This will also unify the auth model to support CLI/SP/MSI via azcore.TokenCredential.

@@ -0,0 +1,96 @@
package privatecluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user-facing configuration (like GatewayConfig) can be moved to config/structs.go. However, runtime types like AKSClusterInfo, VPNConfig, and SSHConfig are populated dynamically during installation, not from user config — they should stay in the package. I'll consolidate this when integrating the private cluster into the bootstrapper framework in the follow-up PR.

)

// Logger provides colored logging for the private cluster operations
type Logger struct {
Copy link
Collaborator

@wenxuan0923 wenxuan0923 Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need this? we already have logger set up at agent level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It exists because the private cluster package was developed independently. When I integrate it into the
bootstrapper framework in the follow-up PR, it will use the shared logrus.Logger like all other components, and this custom Logger will be removed.

i.clusterInfo.PrivateFQDN = clusterInfo.PrivateFQDN
i.logger.Success("AKS cluster: %s (AAD/RBAC enabled)", i.clusterInfo.ClusterName)

vnetName, vnetRG, err := i.azure.GetVNetInfo(ctx, i.clusterInfo.NodeResourceGroup)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are assuming VNet will always be in Node reource group which is not true for BYO VNet

Copy link
Collaborator Author

@weiliu2dev weiliu2dev Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code actually handles BYO VNet correctly — it uses the node resource group only to locate the VMSS,
then extracts the VNet's actual resource group from the VMSS subnet ID. So BYO VNet in a different resource group work fine. I changed the function description here for clearence.

if err := InstallJQ(ctx, i.logger); err != nil {
return fmt.Errorf("failed to install jq: %w", err)
}
if !CommandExists("kubectl") || !CommandExists("kubelogin") {
Copy link
Collaborator

@wenxuan0923 wenxuan0923 Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why install kubectl and kubelogin here?? kubectl already being installed by kube_binaries_installer.go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private cluster installer currently runs before the bootstrapper (commands.go:97-101), so kube_binaries_installer hasn't executed yet at that point.

kubectl and kubelogin are needed during the private cluster setup to verify API server connectivity through the VPN tunnel. Specifically, at line 358, we run kubelogin convert-kubeconfig -l azurecli to convert the kubeconfig from static token auth to Azure AD (Entra ID) auth. This is required because the target AKS cluster uses Azure RBAC, and kubelogin acts as a credential plugin that lets kubectl authenticate via Azure AD automatically.

Also note that kube_binaries_installer only installs standard Kubernetes binaries (kubectl, kubelet) — it does not install
kubelogin, which is an AKS-specific tool for Azure AD integration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is closely related to the refactoring discussed above. In the follow-up PR, the private cluster setup will be integrated into the bootstrapper as a step. At that point, the kubectl/kubelogin installation can be consolidated with kube_binaries_installer by having it also handle kubelogin, and the step ordering in the bootstrapper will ensure they are installed before the private cluster connectivity verification runs.

u.removeNodeFromCluster(ctx, hostname)

// Stop any running aks-flex-node agent process
u.stopFlexNodeAgent(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stopFlexNodeAgent in privatecluster uninstaller? you should only clean up everything related to your own component.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed stopFlexNodeAgent and removeArcAgent from the private cluster uninstaller. These are now handled by the bootstrapper's services.UnInstaller and arc.UnInstaller steps respectively. Kept removeNodeFromCluster here because it needs VPN connectivity to reach the private cluster API server — after the VPN is torn down in the next step, kubectl delete node would no longer work.

}

// removeArcAgent removes Azure Arc agent
func (u *Uninstaller) removeArcAgent(ctx context.Context, nodeName string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done by arc uninstaller already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, already removed.

@weiliu2dev
Copy link
Collaborator Author

I suggest spending a bit more time reviewing our codebase before starting to contribute. Thanks!

make senses.

}

// ServicePrincipalConfig holds Azure service principal authentication configuration.
// When provided, service principal authentication will be used instead of Azure CLI.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing those comment?

pipClient *armnetwork.PublicIPAddressesClient
nicClient *armnetwork.InterfacesClient
aksClient *armcontainerservice.ManagedClustersClient
arcClient *armhybridcompute.MachinesClient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is arcClient used for?

pipName := gateway.Name + "-pip"
location := i.clusterInfo.Location

if err := i.azureClient.CreateSubnet(ctx, i.clusterInfo.VNetResourceGroup, i.clusterInfo.VNetName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you know whether the VNet address space is big enough for new subnet with size SubnetPrefix?

i.clusterInfo.PrivateFQDN = clusterInfo.PrivateFQDN
i.logger.Infof("AKS cluster: %s (AAD/RBAC enabled)", i.clusterInfo.ClusterName)

vnetName, vnetRG, err := i.azureClient.GetVNetInfo(ctx, i.clusterInfo.NodeResourceGroup)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about BYO VNet? It's not in node rg

ClusterName: clusterName,
}

if err := i.phase1EnvironmentCheck(ctx); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't name method this way, what if we need add extra step in the middle in the future?

// gatewayConfig returns the Gateway configuration, applying any overrides from config.
func (i *Installer) gatewayConfig() GatewayConfig {
gw := DefaultGatewayConfig()
if i.config.Azure.TargetCluster.GatewayVMSize != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different with the Azure VPNGateway feature I'm working on? one is managed solution another is manual?


// GetVNetInfo discovers VNet name and resource group from VMSS subnet configuration.
func (c *AzureClient) GetVNetInfo(ctx context.Context, nodeResourceGroup string) (vnetName, vnetRG string, err error) {
pager := c.vmssClient.NewListPager(nodeResourceGroup, nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume VMSS nodepool? If you need Vnet ID, just make it a config?

}

// DeleteDisks deletes disks matching a name pattern.
func (c *AzureClient) DeleteDisks(ctx context.Context, resourceGroup, pattern string) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is configuration in VM deletion that can delete all attached resources all together.

}

// GetAKSClusterInfo retrieves AKS cluster information in a single API call.
func (c *AzureClient) GetAKSClusterInfo(ctx context.Context, resourceGroup, clusterName string) (*AKSClusterInfo, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of this method, everything except PrivateFQDN is already covered in our code. Your code be focusing on the private cluster part only


// Install installs and configures VPN server
func (m *VPNServerManager) Install(ctx context.Context) error {
script := `set -e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't do shell script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants