Update k8s compute TSG and log info

jiaochenlu · jiaochenlu · commit fce59e0a15c7 · 2023-01-02T20:31:49.000+08:00
diff --git a/articles/machine-learning/how-to-deploy-kubernetes-extension.md b/articles/machine-learning/how-to-deploy-kubernetes-extension.md
@@ -46,7 +46,7 @@ In this article, you can learn:
 - Azure Machine Learning does not guarantee support for all preview stage features in AKS. For example, [Azure AD pod identity](../aks/use-azure-ad-pod-identity.md) is not supported.
 - If you've previously followed the steps from [AzureML AKS v1 document](./v1/how-to-create-attach-kubernetes.md) to create or attach your AKS as inference cluster, use the following link to [clean up the legacy azureml-fe related resources](./v1/how-to-create-attach-kubernetes.md#delete-azureml-fe-related-resources) before you continue the next step.
 - We currently don't support attaching your AKS cluster across subscription, which means that your AKS cluster must be in the same subscription as your workspace. 
-   - The workaround to meet your cross-subscription needs is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
+   - The workaround to meet your cross-subscription requirement is to first connect AKS to Azure-ARC and then attach this ARC-Kubernetes resource.
 
 ## Review AzureML extension configuration settings
 
diff --git a/articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md b/articles/machine-learning/how-to-troubleshoot-kubernetes-compute.md
@@ -99,7 +99,7 @@ The compute information is invalid.
 There is a compute target validation process when deploying models to your Kubernetes cluster. This error should occur when the compute information is invalid when validating, for example the compute target is not found, or the configuration of Azure Machine Learning extension has been updated in your Kubernetes cluster. 
 
 You can check the following items to troubleshoot the issue:
-* Check whether the compute target you used is correct and exsiting in your workspace.
+* Check whether the compute target you used is correct and existing in your workspace.
 * Try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
 
 #### ERROR: InvalidComputeNoKubernetesConfiguration
@@ -112,7 +112,7 @@ The compute kubeconfig is invalid.
 
 This error should occur when the system failed to find any configuration to connect to cluster, such as:
 * For Arc-Kubernetes cluster, there is no Azure Relay configuration can be found.
-* For AKS cluster, there is no AKS configuraiton can be found.
+* For AKS cluster, there is no AKS configuration can be found.
 
 To rebuild the configuration of compute connection in your cluster, you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
 
@@ -173,11 +173,11 @@ Cannot found Kubernetes cluster.
 This error should occur when the system cannot find the AKS/Arc-Kubernetes cluster.
 
 You can check the following items to troubleshoot the issue:
-* First, check the cluster resource ID in the Azure Portal to verify whether Kubernetes cluster resource still exist and is running normally.
+* First, check the cluster resource ID in the Azure Portal to verify whether Kubernetes cluster resource still exists and is running normally.
 * If the cluster exists and is running, then you can try to detach and reattach the compute to the workspace. Pay attention to more notes on [reattach](#error-genericcomputeerror).
 
 > [!TIP]
-   > More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](#how-to-troubleshoot-online-endpoints.md).
+   > More troubleshoot guide of common errors when creating/updating the Kubernetes online endpoints and deployments, you can find in [How to troubleshoot online endpoints](how-to-troubleshoot-online-endpoints.md).
 
 
 ## Training guide
diff --git a/articles/machine-learning/how-to-troubleshoot-online-endpoints.md b/articles/machine-learning/how-to-troubleshoot-online-endpoints.md
@@ -485,32 +485,32 @@ Below is a list of reasons you might run into this error when creating/updating
 
 ### ERROR: EndpointNotFound
 
-The reason you might run into this error when creating/updating a Kubernetes online deployments is because the system can't find the endpoint resource for the deployment in the cluster. You should create the deployment in a exist endpoint or create this endpoint first in your cluster.
+The reason you might run into this error when creating/updating Kubernetes online deployments is because the system can't find the endpoint resource for the deployment in the cluster. You should create the deployment in an exist endpoint or create this endpoint first in your cluster.
 
 ### ERROR: ValidateScoringFailed
 
-The reason you might run into this error when creating/updating a Kubernetes online deployments is because the scoring request URL validation failed when processing the model deploying. 
+The reason you might run into this error when creating/updating Kubernetes online deployments is because the scoring request URL validation failed when processing the model deploying. 
 
 In this case, you can first check the endpoint URL and then try to re-deploy the deployment.
 
 ### ERROR: InvalidDeploymentSpec
 
-The reason you might run into this error when creating/updating a Kubernetes online deployments is because the deployment spec is invalid.
+The reason you might run into this error when creating/updating Kubernetes online deployments is because the deployment spec is invalid.
 
 In this case, you can check the error message.
 * Make sure the `instance count` is valid.
 * If you have enabled auto scaling, make sure the `minimum instance count` and `maximum instance count` are both valid.
 
 ### ERROR: ImagePullLoopBackOff
 
-The reason you might run into this error when creating/updating a Kubernetes online deployments is because the images can't be downloaded from the container registry, resulting in the images pull failure. <message>
+The reason you might run into this error when creating/updating Kubernetes online deployments is because the images can't be downloaded from the container registry, resulting in the images pull failure. 
 
 In this case, you can check the cluster network policy and the workspace container registry if cluster can pull image from the container registry.
 
 ### ERROR: KubernetesCrashLoopBackOff
 
 Below is a list of reasons you might run into this error when creating/updating the Kubernetes online endpoints/deployments:
-* One or more pod(s) stuck in CrashLoopBackoff status, you can check if the deployment log exist, and check if there are error messgaes in the log.
+* One or more pod(s) stuck in CrashLoopBackoff status, you can check if the deployment log exists, and check if there are error messages in the log.
 * There is an error in `score.py` and the container crashed when init your score code, please following [ERROR: ResourceNotReady](#error-resourcenotready) part. 
 * Your scoring process needs more memory that your deployment config limit is insufficient, you can try to update the deployment with a larger memory limit. 
 
@@ -523,7 +523,7 @@ Below is a list of reasons you might run into this error when creating/updating
 To mitigate this error, refer to the following steps: 
 * Check the `node selector` definition of the `instance type` you used, and `node label` configuration of your cluster nodes. 
 * Check `instance type` and the node SKU size for AKS cluster or the node resource for Arc-Kubernetes cluster.
-  * If the cluster is under-resourced, you can reduce the instance type resource requirement or use the anohter instance type with smaller resource required. 
+  * If the cluster is under-resourced, you can reduce the instance type resource requirement or use the another instance type with smaller resource required. 
 * If the cluster has no more resource to meet the requirement of the deployment, delete some deployment to release resources.
 
 
diff --git a/articles/machine-learning/reference-kubernetes.md b/articles/machine-learning/reference-kubernetes.md
@@ -147,7 +147,7 @@ Some logs about AzureML workloads in the cluster, such as status, metrics, life
 |aml-operator	| Manage the lifecycle of training jobs.	|The logs contain AML training job pod status in the cluster.|
 |azureml-fe-v2|	The front-end component that routes incoming inference requests to deployed services.	|Access logs at request level, including request Id, start time, response code, error details and durations for request latency. Trace logs for service metadata changes, service running healthy status, etc. for debugging purpose.|
 | gateway	| The gateway is used to communicate and send data back and forth.	| Trace logs on requests from AML services to the clusters.|
-|healthcheck	|--| 	The logs contain azureml namespace resource (AML extension) status to diagnostic what make the extension not functional. |
+|healthcheck	|--| 	The logs contain azureml namespace resource (AML extension) status to diagnose what make the extension not functional. |
 |inference-operator-controller-manager|	Manage the lifecycle of inference endpoints.	|The logs contain AML inference endpoint and deployment pod status in the cluster.|
 | metrics-controller-manager	| Manage the configuration for Prometheus.|Trace logs for status of uploading training job and inference  deployment metrics on CPU utilization and memory utilization.|
 | relayserver	| relayserver is only needed in arc-connected cluster and will not be installed in AKS cluster.| Relayserver works with Azure Relay to communicate with the cloud services.	The logs contain request level info from Azure relay.  |