Skip to content

WIP node setup#77

Draft
bcho wants to merge 9 commits intomainfrom
hbc/stretch-setup
Draft

WIP node setup#77
bcho wants to merge 9 commits intomainfrom
hbc/stretch-setup

Conversation

@bcho
Copy link
Member

@bcho bcho commented Feb 12, 2026

No description provided.

@bcho bcho changed the title WIP node setpu WIP node setup Feb 12, 2026
Comment on lines +21 to +23
- We will limit the support scope to Linux-based nodes and focus on Ubuntu distro for now.
This is because Ubuntu is the widely and commonly available Linux distribution
across the target environments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider focusing on Ubuntu and Azure Linux from the start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 3p clouds do we want to offer AzLinux?

Comment on lines +44 to +45
* `containerd` w/ 2.0+ version;
* `runc`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both? Or one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both!

* TLS bootstrap configurations;
* Other cloud provider binaries;
- NFTables / IPtables installed for Kubernetes network policies;
- Network forward, IP masquerade and bridge settings configured for Kubernetes networking;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VPN components?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we need to check what kind of VPN components are needed here. But for now I don't have a concrete answer so I left it out

- Detailed GPU device plugin requirements and enablement strategies will be addressed in
a separate document.

## Baseline Environment Requirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be possible to more clearly separate (a) binaries and configuration, and (b) stuff that could/should be baked in (if we're baking) and stuff that might not be?

docs/node-env.md Outdated

**Expected behaviors**:

- Produced image is **immutable** and **reproducible** giving the same inputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Immutable? Well, it's an image, and we don't prevent files being modified at runtime?
Reproducible? As far as the contents of the filesystem, maybe -- byte for byte at an image level is probably hard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

immutable means we don't update the same published version after that. Reproducible means we can rebuild the same image with same set of components / binaries in any time. The other files inside the image that are not critical to the kubelet functionalitity, then it's fine to be drifted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will find a way to document this part in the doc. I called them out because I have seen a few incidents in agent baker side caused by changed of the inputs (artifact naming change from package registry for example) caused outage. If we can find a good contract to limit and pin the critical components, then incidents like that could be avoided. cc @cameronmeissner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added two footnotes to explain the terms here


**Inputs**:

- Cluster endpoint (API server URL, CA bundle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps network/VPN configuration/credentials?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps identity credentials

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added two more items to cover them


- All failure handling mechanisms from both Node VHD Image Baking and Node Bootstrapping

### Node Rebooting & Repairing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any of the stuff below really in scope for this document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these operations to make sure the tool we are building here will support for all these scenarios.

* `containerd` w/ 2.0+ version;
* `runc`
- Kubernetes components:
* `kubelet` matching with the target worker node version;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will images be Kubernetes-version specific, or will one image support multiple k8s versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a debatable question, in current agent baker, we bake multiple k8s versions as AKS has multiple versions support. But I think for flex node, we can do in a more limited way so we can reduce the support matrix while providing more stable functionality.

docs/node-env.md Outdated
- Kubernetes components:
* `kubelet` matching with the target worker node version;
* Control plane public CA certificate(s);
* TLS bootstrap configurations;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for flex nodes, it looks like we plan on using bootstrap tokens (at least as a start) - have we thought about mechanisms by which we could avoid bootstrap tokens that would work across all clouds / on-prem environments? (this is probably a project in and of itself, though just curious)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not every cloud support the same set of features, hence I just put "TLS bootstrap configurations" here. It can be static token or Arc or something similar to secure TLS bootstrap setup.


### Additional Requirements

- Node identity for identifying and authenticating the node to cluster control plane;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node identity - are you referring to the client certificate obtained through TLS bootstrapping here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!


### Node VHD Image Baking

**Purpose**: Produce a base node image (VHD or similar) that satisfies baseline

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we thought about how distribution of said base node image would look like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah unfortunately, different clouds will have different ways for doing this. I guess we will end up with maintaining a couple of supported node images in every cloud


### Node Bootstrapping w/ Baking

**Purpose**: In environments without pre-baked images, the bootstrapping process

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artur and I have talked about embedding the provisioning scripts we have in AgentBaker directly into aks-node-controller to accomplish this - I haven't looked around the rest of this repo yet, but it seems like you're dealing with component installation all natively in Golang from the start? If so you wouldn't need to do something like script embedding, though just thought that would be worth mentioning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will end up with having a binary embeded to the node for doing things similar to the aks-node-controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants