Skip to content

Commit d79c4a3

Browse files
committed
Update k8s compute TSG and log info
1 parent 7f38428 commit d79c4a3

File tree

1 file changed

+1
-34
lines changed

1 file changed

+1
-34
lines changed

articles/machine-learning/reference-kubernetes.md

Lines changed: 1 addition & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -103,40 +103,7 @@ For AzureML extension deployment on ARO or OCP cluster, grant privileged access
103103
> * `{EXTENSION-NAME}`: is the extension name specified with the `az k8s-extension create --name` CLI command.
104104
>* `{KUBERNETES-COMPUTE-NAMESPACE}`: is the namespace of the Kubernetes compute specified when attaching the compute to the Azure Machine Learning workspace. Skip configuring `system:serviceaccount:{KUBERNETES-COMPUTE-NAMESPACE}:default` if `KUBERNETES-COMPUTE-NAMESPACE` is `default`.
105105
106-
## AzureML extension components
107-
108-
For Arc-connected cluster, AzureML extension deployment will create [Azure Relay](../azure-relay/relay-what-is-it.md) in Azure cloud, used to route traffic between Azure services and the Kubernetes cluster. For AKS cluster without Arc connected, Azure Relay resource won't be created.
109-
110-
Upon AzureML extension deployment completes, it will create following resources in Kubernetes cluster, depending on each AzureML extension deployment scenario:
111-
112-
|Resource name |Resource type |Training |Inference |Training and Inference| Description | Communication with cloud|
113-
|--|--|--|--|--|--|--|
114-
|relayserver|Kubernetes deployment|**✓**|**✓**|**✓**|relay server is only needed in arc-connected cluster, and won't be installed in AKS cluster. Relay server works with Azure Relay to communicate with the cloud services.|Receive the request of job creation, model deployment from cloud service; sync the job status with cloud service.|
115-
|gateway|Kubernetes deployment|**✓**|**✓**|**✓**|The gateway is used to communicate and send data back and forth.|Send nodes and cluster resource information to cloud services.|
116-
|aml-operator|Kubernetes deployment|**✓**|N/A|**✓**|Manage the lifecycle of training jobs.| Token exchange with the cloud token service for authentication and authorization of Azure Container Registry.|
117-
|metrics-controller-manager|Kubernetes deployment|**✓**|**✓**|**✓**|Manage the configuration for Prometheus|N/A|
118-
|{EXTENSION-NAME}-kube-state-metrics|Kubernetes deployment|**✓**|**✓**|**✓**|Export the cluster-related metrics to Prometheus.|N/A|
119-
|{EXTENSION-NAME}-prometheus-operator|Kubernetes deployment|Optional|Optional|Optional| Provide Kubernetes native deployment and management of Prometheus and related monitoring components.|N/A|
120-
|amlarc-identity-controller|Kubernetes deployment|N/A|**✓**|**✓**|Request and renew Azure Blob/Azure Container Registry token through managed identity.|Token exchange with the cloud token service for authentication and authorization of Azure Container Registry and Azure Blob used by inference/model deployment.|
121-
|amlarc-identity-proxy|Kubernetes deployment|N/A|**✓**|**✓**|Request and renew Azure Blob/Azure Container Registry token through managed identity.|Token exchange with the cloud token service for authentication and authorization of Azure Container Registry and Azure Blob used by inference/model deployment.|
122-
|azureml-fe-v2|Kubernetes deployment|N/A|**✓**|**✓**|The front-end component that routes incoming inference requests to deployed services.|Send service logs to Azure Blob.|
123-
|inference-operator-controller-manager|Kubernetes deployment|N/A|**✓**|**✓**|Manage the lifecycle of inference endpoints. |N/A|
124-
|volcano-admission|Kubernetes deployment|Optional|N/A|Optional|Volcano admission webhook.|N/A|
125-
|volcano-controllers|Kubernetes deployment|Optional|N/A|Optional|Manage the lifecycle of Azure Machine Learning training job pods.|N/A|
126-
|volcano-scheduler |Kubernetes deployment|Optional|N/A|Optional|Used to perform in-cluster job scheduling.|N/A|
127-
|fluent-bit|Kubernetes daemonset|**✓**|**✓**|**✓**|Gather the components' system log.| Upload the components' system log to cloud.|
128-
|{EXTENSION-NAME}-dcgm-exporter|Kubernetes daemonset|Optional|Optional|Optional|dcgm-exporter exposes GPU metrics for Prometheus.|N/A|
129-
|nvidia-device-plugin-daemonset|Kubernetes daemonset|Optional|Optional|Optional|nvidia-device-plugin-daemonset exposes GPUs on each node of your cluster| N/A|
130-
|prometheus-prom-prometheus|Kubernetes statefulset|**✓**|**✓**|**✓**|Gather and send job metrics to cloud.|Send job metrics like cpu/gpu/memory utilization to cloud.|
131-
132-
> [!IMPORTANT]
133-
> * Azure Relay resource is under the same resource group as the Arc cluster resource. It is used to communicate with the Kubernetes cluster and modifying them will break attached compute targets.
134-
> * By default, the kubernetes deployment resources are randomly deployed to 1 or more nodes of the cluster, and daemonset resources are deployed to ALL nodes. If you want to restrict the extension deployment to specific nodes, use `nodeSelector` configuration setting described as below.
135-
136-
> [!NOTE]
137-
> * **{EXTENSION-NAME}:** is the extension name specified with ```az k8s-extension create --name``` CLI command.
138-
139-
### Collected log details
106+
## Collected log details
140107

141108
Some logs about AzureML workloads in the cluster will be collected through extension components, such as status, metrics, life cycle, etc. The following list shows all the log details collected, including the type of logs collected and where they were sent to or stored.
142109

0 commit comments

Comments
 (0)