You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/concepts-compute.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,7 @@ Huge page usage in workloads refers to the utilization of large memory pages, ty
45
45
46
46
Workloads that involve large data sets or intensive memory operations such as network packet processing, can benefit from huge page usage because it enhances memory performance and reduces memory-related bottlenecks. As a result, users see improved throughput and reduced latency.
47
47
48
-
All virtual machines created on Azure Operator Nexus are backed by 1GiB(1G) hugepages for the requested memory. The kernel running inside the VM can manage these available memory anyway it likes, including the allocation of memory to support hugepages (2M or 1G).
48
+
All virtual machines created on Azure Operator Nexus are backed by 1GiB(1G) hugepages for the requested memory. The kernel running inside the VM can manage these available memory anyway it likes, including the allocation of memory to support hugepages (2M or 1G).
49
49
50
50
### Dual-stack support
51
51
@@ -88,7 +88,7 @@ The following properties reflect the operational state of a BMM:
88
88
-`Control plane`: These BMM runs the Kubernetes control plane agents for Nexus platform cluster.
89
89
-`Management plane`: The BMM runs the Nexus platform agents including controllers and extensions.
90
90
-`Compute plane`: The BMM responsible for running actual tenant workloads including Nexus Kubernetes Clusters and Virtual Machines.
91
-
91
+
92
92
Refer this [link](reference-near-edge-baremetal-machine-roles.md) for more details on Machine Roles.
Copy file name to clipboardExpand all lines: articles/operator-nexus/concepts-rack-resiliency.md
+9-12Lines changed: 9 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,26 +71,23 @@ To maintain Kubernetes control plane (KCP) quorum, Operator Nexus provides autom
71
71
72
72
Here are the triggers for automated remediation:
73
73
74
-
- For all servers (Compute, Management and KCP): if a server fails to provision successfully after six hours, automated remediation occurs. This check includes provisioning a new Bare Metal Machine (BMM) at initial deployment time or provisioning during a Replace action.
75
-
- For all servers (Compute, Management and KCP): if a running node is stuck in a read only root file system mode for 10 minutes, automated remediation occurs.
76
-
- For KCP and Management Plane servers only, if a Kubernetes node is in an Unknown state for 30 minutes, automated remediation occurs.
74
+
* For all servers (Compute, Management and KCP): if a server fails to provision successfully after six hours, automated remediation occurs. This check includes provisioning a new Bare Metal Machine (BMM) at initial deployment time or provisioning during a Replace action.
75
+
* For all servers (Compute, Management and KCP): if a running node is stuck in a read only root file system mode for 10 minutes, automated remediation occurs.
76
+
* For KCP and Management Plane servers only, if a Kubernetes node is in an Unknown state for 30 minutes, automated remediation occurs.
77
77
78
78
### Remediation process
79
79
80
-
- Remediation of a Compute node is now one reprovisioning attempt. If the reprovisioning fails, the node is marked Unhealthy. Reprovisioning no longer continues to retry infinitely, and the Bare Metal Machine is powered off.
81
-
- Remediation of a Management Plane node is to attempt one reboot and then one reprovisioning attempt. If those steps fail, the node is marked Unhealthy.
82
-
- Remediation of a KCP node is to attempt one reboot. If the reboot fails, the node is marked Unhealthy and Nexus triggers the immediate provisioning of the spare KCP node. This process is outlined in the [KCP remediation details](#kcp-remediation-details) section.
83
-
- In all instances, when the Bare Metal Machine is marked unhealthy, the BMM's `detailedStatusMessage` is updated to read `Warning: BMM Node is unhealthy and may require hardware replacement.` The Bare Metal Machine's node is removed from the Kubernetes Cluster, which triggers a node drain. Users need to run a BMM Replace action to return the BMM into service and have it rejoin the Kubernetes Cluster.
84
-
85
-
> [!TIP]
86
-
> When you run a BMM Replace to remediate an unhealthy node, you can monitor progress and steps in the Azure portal JSON view under `properties.actionStates` (Operator Nexus 2509.1+ and API 2025-07-01-preview+). See [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
80
+
* Remediation of a Compute node is now one reprovisioning attempt. If the reprovisioning fails, the node is marked Unhealthy. Reprovisioning no longer continues to retry infinitely, and the Bare Metal Machine is powered off.
81
+
* Remediation of a Management Plane node is to attempt one reboot and then one reprovisioning attempt. If those steps fail, the node is marked Unhealthy.
82
+
* Remediation of a KCP node is to attempt one reboot. If the reboot fails, the node is marked Unhealthy and Nexus triggers the immediate provisioning of the spare KCP node. This process is outlined in the [KCP remediation details](#kcp-remediation-details) section.
83
+
* In all instances, when the Bare Metal Machine is marked unhealthy, the BMM's `detailedStatusMessage` is updated to read `Warning: BMM Node is unhealthy and may require hardware replacement.` The Bare Metal Machine's node is removed from the Kubernetes Cluster, which triggers a node drain. Users need to run a BMM Replace action to return the BMM into service and have it rejoin the Kubernetes Cluster.
87
84
88
85
### KCP remediation details
89
86
90
87
Ongoing control plane resiliency requires a spare KCP node. When KCP node fails remediation and is marked Unhealthy, a deprovisioning of the node occurs. The unhealthy KCP node is exchanged with a suitable healthy Management Plane server. This Management Plane server becomes the new spare KCP node. The failed KCP node is updated and labeled as a Management Plane node. Once the label changes, an attempt to provision the newly labeled management plane node occurs. If it fails to provision, the management plane remediation process takes over. If it fails provisioning or doesn't run successfully, the machine's status remains unhealthy, and the user must fix. The unhealthy condition surfaces to the Bare Metal Machine's (BMM) `detailedStatus` and `detailedStatusMessage` fields in Azure and clears through a BMM Replace action.
91
88
92
-
> [!NOTE]
93
-
>The provisioning retry process doesn't execute on compute and management node pool nodes for systems running the 4.1 NetworkCloud runtime. This capability is available when the Nexus Cluster is updated to the 4.4 runtime.
89
+
> [!NOTE]
90
+
>The provisioning retry process doesn't execute on compute and management node pool nodes for systems running the 4.1 NetworkCloud runtime. This capability is available when the Nexus Cluster is updated to the 4.4 runtime.
Copy file name to clipboardExpand all lines: articles/operator-nexus/concepts-resource-types.md
+8-11Lines changed: 8 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ You can manage the lifecycle of a Network Fabric via Azure using any of the supp
43
43
44
44
### Network racks
45
45
46
-
Network Rack resource is a representation of your on-premises racks from the networking perspective. The number of network racks in an Operator Nexus instance depends on the Network Fabric SKU that was chosen during creation.
46
+
Network Rack resource is a representation of your on-premises racks from the networking perspective. The number of network racks in an Operator Nexus instance depends on the Network Fabric SKU that was chosen during creation.
47
47
48
48
Each network rack consists of Network Devices that are part of that rack. For example - Customer Edge (CE) routers, Top of Rack (ToR) Switches, Management Switches, and Network Packet Brokers (NPB).
49
49
@@ -63,25 +63,25 @@ The lifecycle of the Network Device resources depends on the network rack resour
63
63
64
64
### Isolation domains
65
65
66
-
Isolation Domains enable east-west or north-south connectivity across Operator Nexus instance. They provide the required network connectivity between infrastructure components and also workload components. In principle, there are two types of networks that are established by isolation domains - management network and workload or tenant network.
66
+
Isolation Domains enable east-west or north-south connectivity across Operator Nexus instance. They provide the required network connectivity between infrastructure components and also workload components. In principle, there are two types of networks that are established by isolation domains - management network and workload or tenant network.
67
67
68
68
A management network provides private connectivity that enables communication between the Network Fabric instance that is deployed on-premises and Azure Virtual Network. You can create workload or tenant networks to enable communication between the workloads that are deployed across the Operator Nexus instance.
69
69
70
70
Each isolation domain is associated with a specific Network Fabric resource and has the option to be enabled/disabled. Only when an isolation domain is enabled, it's configured on the network devices, and the configuration is removed once the isolation domain is removed.
71
71
72
72
Primarily, there are two types of isolation domains:
73
73
74
-
- Layer 2 or L2 Isolation Domains
75
-
- Layer 3 or L3 Isolation Domains
74
+
* Layer 2 or L2 Isolation Domains
75
+
* Layer 3 or L3 Isolation Domains
76
76
77
77
Layer 2 isolation domains enable your infrastructure and workloads communicate with each other within or across racks over a Layer 2 network. Layer 2 networks enable east-west communication within your Operator Nexus instance. You can configure an L2 isolation domain with a desired Vlan ID and MTU size, see [Nexus Limits and Quotas](./reference-limits-and-quotas.md) for MTU limits.
78
78
79
79
Layer 3 isolation domains enable your infrastructure and workloads communicate with each other within or across racks over a Layer 3 network. Layer 3 networks enable east-west and north-south communication within and outside your Operator Nexus instance.
80
80
81
81
There are two types of Layer 3 networks that you can create:
82
82
83
-
- Internal Network
84
-
- External Network
83
+
* Internal Network
84
+
* External Network
85
85
86
86
Internal networks enable layer 3 east-west connectivity across racks within the Operator Nexus instance and external networks enable layer 3 north-south connectivity from the Operator Nexus instance to networks outside the instance. A Layer 3 isolation domain must be configured with at least one internal network; external networks are optional.
87
87
@@ -110,17 +110,14 @@ Storage Appliances represent storage arrays used for persistent data storage in
110
110
Bare Metal Machines represent the physical servers in a rack. They are lifecycle managed by the Cluster Manager.
111
111
Bare Metal Machines are used by workloads to host Virtual Machines and Kubernetes clusters.
112
112
113
-
> [!NOTE]
114
-
> Recent or in-progress lifecycle actions for a Bare Metal Machine (for example, Replace, Reimage, Restart) appear in the Azure portal JSON view under `properties.actionStates` (Operator Nexus 2509.1+ and API 2025-07-01-preview+). See [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
115
-
116
113
## Workload components
117
114
118
115
Workload components are resources that you use in hosting your workloads.
119
116
120
117
### Network resources
121
118
122
-
The Network resources represent the virtual networking in support of your workloads hosted on VMs or Kubernetes clusters.
123
-
There are four Network resource types that represent a network attachment to an underlying isolation-domain.
119
+
The Network resources represent the virtual networking in support of your workloads hosted on VMs or Kubernetes clusters.
120
+
There are four Network resource types that represent a network attachment to an underlying isolation-domain.
124
121
125
122
-**Cloud Services Network Resource**: provides VMs/Kubernetes clusters access to cloud services such as DNS, NTP, and user-specified Azure PaaS services. You must create at least one Cloud Services Network (CSN) in each of your Operator Nexus instances. Each CSN can be reused by many VMs and/or tenant clusters.
> If you trigger a corrective action such as Reimage or Replace, you can monitor its status in the Azure portal JSON view under `properties.actionStates` (requires Operator Nexus 2509.1+ and API 2025-07-01-preview+). See [Monitor status in Bare Metal Machine JSON properties](./howto-bare-metal-best-practices.md#monitor-status-in-bare-metal-machine-json-properties).
52
-
53
50
**Example Azure CLI output**
54
51
55
52
This example shows a deployment with two currently degraded BMMs (`compute01` and `compute04`), and two cordoned BMMs (`compute02` and `compute04`).
0 commit comments