Add 4519 to v120 upgrade known issue (#466)

w13915984028 · web-flow · commit e985900fc530 · 2023-10-24T11:15:48.000+08:00
Signed-off-by: Jian Wang &lt;w13915984028@gmail.com&gt;
diff --git a/docs/upgrade/v1-1-2-to-v1-2-0.md b/docs/upgrade/v1-1-2-to-v1-2-0.md
@@ -298,7 +298,7 @@ If you notice the upgrade is stuck in the **Upgrading System Service** state for
 1. Check if the `prometheus-rancher-monitoring-prometheus-0` pod is stuck with the status `Terminating`.
 
     ```
-    $ kubectl -n cattle-monitoring-system get pods                                  
+    $ kubectl -n cattle-monitoring-system get pods
     NAME                                         READY   STATUS        RESTARTS   AGE
     prometheus-rancher-monitoring-prometheus-0   0/3     Terminating   0          19d
     ```
@@ -399,15 +399,16 @@ If an upgrade is stuck in an `Upgrading System Service` state for an extended pe
 
 ---
 
-### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in airgapped environment
+### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in air-gapped environment
 
 Harvester does not package the `registry.suse.com/harvester-beta/vmdp:latest` image in the ISO file as of v1.1.0. For Windows VMs before v1.1.0, they used this image as a container disk. However, kubelet may remove old images to free up bytes. Windows VMs can't access an air-gapped environment when this image is removed. You can fix this issue by changing the image to `registry.suse.com/suse/vmdp/vmdp:2.5.4.2` and restarting the Windows VMs.
 
 - Related issue:
   - [[BUG] VMDP Image wrong after upgrade to Harvester 1.2.0](https://github.com/harvester/harvester/issues/4534)
 
 ---
-### 9. Upgrade stuck in the Post-draining state
+
+### 9. An Upgrade is stuck in the Post-draining state
 
 The node might be stuck in the OS upgrade process if you encounter the **Post-draining** state, as shown below.
 
@@ -483,3 +484,111 @@ After performing the steps above, you should pass post-draining with the next re
   - [A potential bug in NewElementalPartitionsFromList which caused upgrade error code 33](https://github.com/rancher/elemental-toolkit/issues/1827)
 - Workaround:
   - https://github.com/harvester/harvester/issues/4526#issuecomment-1732853216
+
+---
+
+### 10. An upgrade is stuck in the Upgrading System Service state due to the `customer provided SSL certificate without IP SAN` error in `fleet-agent`
+
+If an upgrade is stuck in an **Upgrading System Service** state for an extended period, follow these steps to investigate this issue:
+
+1. Find the pods related to the upgrade:
+
+    ```
+    kubectl get pods -A | grep upgrade
+    ```
+
+    Example output:
+
+    ```
+    # kubectl get pods -A | grep upgrade
+    cattle-system               system-upgrade-controller-5685d568ff-tkvxb                 1/1     Running     0              85m
+    harvester-system            hvst-upgrade-vq4hl-apply-manifests-65vv8                   1/1     Running     0              87m  // waiting for managedchart to be ready
+    ..
+    ```
+
+2. The pod `hvst-upgrade-vq4hl-apply-manifests-65vv8` has the following loop log:
+
+    ```
+    Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23
+    Sleep for 5 seconds to retry
+    ```
+
+3. Check the status for all bundles. Note thata couple of bundles are `OutOfSync`:
+
+    ```
+    # kubectl get bundle -A
+    NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
+    ...
+    fleet-local   mcc-local-managed-system-upgrade-controller   1/1
+    fleet-local   mcc-rancher-logging                           0/1                       OutOfSync(1) [Cluster fleet-local/local]
+    fleet-local   mcc-rancher-logging-crd                       0/1                       OutOfSync(1) [Cluster fleet-local/local]
+    fleet-local   mcc-rancher-monitoring                        0/1                       OutOfSync(1) [Cluster fleet-local/local]
+    fleet-local   mcc-rancher-monitoring-crd                    0/1                       WaitApplied(1) [Cluster fleet-local/local]
+    ```
+
+4. The pod `fleet-agent-*` has following error log:
+
+    ```
+    fleet-agent pod log:
+
+    time="2023-09-19T12:18:10Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.122.199/apis/fleet.cattle.io/    v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.122.199 because it doesn't contain any IP SANs"
+    ```
+
+5. Check the `ssl-certificates` settings in Harvester:
+
+    From the command line:
+
+    ```
+    # kubectl get settings.harvesterhci.io ssl-certificates
+    NAME               VALUE
+    ssl-certificates   {"publicCertificate":"-----BEGIN CERTIFICATE-----\nMIIFNDCCAxygAwIBAgIUS7DoHthR/IR30+H/P0pv6HlfOZUwDQYJKoZIhvcNAQEL\nBQAwFjEUMBIGA1UEAwwLZXhhbXBsZS5j...."}
+    ```
+
+    From the Harvester Web UI:
+
+    ![](/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png)
+
+6. Check the `server-url` setting, it is the value of VIP:
+
+    ```
+     # kubectl get settings.management.cattle.io -n cattle-system server-url
+    NAME         VALUE
+    server-url   https://192.168.122.199
+    ```
+
+7. The root cause:
+
+    User sets the self-signed `ssl-certificates` with FQDN in the Harvester settings, but the `server-url` points to the VIP, the `fleet-agent` pod fails to register.
+
+    ```
+    For example: create self-signed certificate for (*).example.com
+
+    openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \
+    -keyout example.key -out example.crt -subj "/CN=example.com" \
+    -addext "subjectAltName=DNS:example.com,DNS:*.example.com"
+
+    The general outputs are: example.crt, example.key
+    ```
+
+8. The workaround:
+
+    Update `server-url` with the value of `https://harv31.example.com`
+
+    ```
+    # kubectl edit settings.management.cattle.io -n cattle-system server-url
+    setting.management.cattle.io/server-url edited
+    ...
+
+    # kubectl get settings.management.cattle.io -n cattle-system server-url
+    NAME         VALUE
+    server-url   https://harv31.example.com
+    ```
+
+    After the workaround is applied, the `fleet-agent` pod is replaced by Rancher automatically and registers successfully, the upgrade continues.
+
+- Related issue:
+  - [[BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN](https://github.com/harvester/harvester/issues/4519)
+- Workaround:
+  - https://github.com/harvester/harvester/issues/4519#issuecomment-1727132383
+
+---
diff --git a/static/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png b/static/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png
diff --git a/versioned_docs/version-v1.2/upgrade/v1-1-2-to-v1-2-0.md b/versioned_docs/version-v1.2/upgrade/v1-1-2-to-v1-2-0.md
@@ -298,7 +298,7 @@ If you notice the upgrade is stuck in the **Upgrading System Service** state for
 1. Check if the `prometheus-rancher-monitoring-prometheus-0` pod is stuck with the status `Terminating`.
 
     ```
-    $ kubectl -n cattle-monitoring-system get pods                                  
+    $ kubectl -n cattle-monitoring-system get pods
     NAME                                         READY   STATUS        RESTARTS   AGE
     prometheus-rancher-monitoring-prometheus-0   0/3     Terminating   0          19d
     ```
@@ -330,7 +330,7 @@ If you notice the upgrade is stuck in the **Upgrading System Service** state for
 
 ---
 
-### 7. Upgrade stuck in the `Upgrading System Service` state
+### 7. An upgrade is stuck in the `Upgrading System Service` state
 
 If an upgrade is stuck in an `Upgrading System Service` state for an extended period, some system services' certificates may have expired. To investigate and resolve this issue, follow these steps:
 
@@ -399,15 +399,16 @@ If an upgrade is stuck in an `Upgrading System Service` state for an extended pe
 
 ---
 
-### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in airgapped environment
+### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in air-gapped environment
 
 Harvester does not package the `registry.suse.com/harvester-beta/vmdp:latest` image in the ISO file as of v1.1.0. For Windows VMs before v1.1.0, they used this image as a container disk. However, kubelet may remove old images to free up bytes. Windows VMs can't access an air-gapped environment when this image is removed. You can fix this issue by changing the image to `registry.suse.com/suse/vmdp/vmdp:2.5.4.2` and restarting the Windows VMs.
 
 - Related issue:
   - [[BUG] VMDP Image wrong after upgrade to Harvester 1.2.0](https://github.com/harvester/harvester/issues/4534)
 
 ---
-### 9. Upgrade stuck in the Post-draining state
+
+### 9. An Upgrade is stuck in the Post-draining state
 
 The node might be stuck in the OS upgrade process if you encounter the **Post-draining** state, as shown below.
 
@@ -484,3 +485,110 @@ After performing the steps above, you should pass post-draining with the next re
 - Workaround:
   - https://github.com/harvester/harvester/issues/4526#issuecomment-1732853216
 
+---
+
+### 10. An upgrade is stuck in the Upgrading System Service state due to the `customer provided SSL certificate without IP SAN` error in `fleet-agent`
+
+If an upgrade is stuck in an **Upgrading System Service** state for an extended period, follow these steps to investigate this issue:
+
+1. Find the pods related to the upgrade:
+
+    ```
+    kubectl get pods -A | grep upgrade
+    ```
+
+    Example output:
+
+    ```
+    # kubectl get pods -A | grep upgrade
+    cattle-system               system-upgrade-controller-5685d568ff-tkvxb                 1/1     Running     0              85m
+    harvester-system            hvst-upgrade-vq4hl-apply-manifests-65vv8                   1/1     Running     0              87m  // waiting for managedchart to be ready
+    ..
+    ```
+
+2. The pod `hvst-upgrade-vq4hl-apply-manifests-65vv8` has the following loop log:
+
+    ```
+    Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23
+    Sleep for 5 seconds to retry
+    ```
+
+3. Check the status for all bundles. Note thata couple of bundles are `OutOfSync`:
+
+    ```
+    # kubectl get bundle -A
+    NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
+    ...
+    fleet-local   mcc-local-managed-system-upgrade-controller   1/1
+    fleet-local   mcc-rancher-logging                           0/1                       OutOfSync(1) [Cluster fleet-local/local]
+    fleet-local   mcc-rancher-logging-crd                       0/1                       OutOfSync(1) [Cluster fleet-local/local]
+    fleet-local   mcc-rancher-monitoring                        0/1                       OutOfSync(1) [Cluster fleet-local/local]
+    fleet-local   mcc-rancher-monitoring-crd                    0/1                       WaitApplied(1) [Cluster fleet-local/local]
+    ```
+
+4. The pod `fleet-agent-*` has following error log:
+
+    ```
+    fleet-agent pod log:
+
+    time="2023-09-19T12:18:10Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.122.199/apis/fleet.cattle.io/    v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.122.199 because it doesn't contain any IP SANs"
+    ```
+
+5. Check the `ssl-certificates` settings in Harvester:
+
+    From the command line:
+
+    ```
+    # kubectl get settings.harvesterhci.io ssl-certificates
+    NAME               VALUE
+    ssl-certificates   {"publicCertificate":"-----BEGIN CERTIFICATE-----\nMIIFNDCCAxygAwIBAgIUS7DoHthR/IR30+H/P0pv6HlfOZUwDQYJKoZIhvcNAQEL\nBQAwFjEUMBIGA1UEAwwLZXhhbXBsZS5j...."}
+    ```
+
+    From the Harvester Web UI:
+
+    ![](/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png)
+
+6. Check the `server-url` setting, it is the value of VIP:
+
+    ```
+     # kubectl get settings.management.cattle.io -n cattle-system server-url
+    NAME         VALUE
+    server-url   https://192.168.122.199
+    ```
+
+7. The root cause:
+
+    User sets the self-signed `ssl-certificates` with FQDN in the Harvester settings, but the `server-url` points to the VIP, the `fleet-agent` pod fails to register.
+
+    ```
+    For example: create self-signed certificate for (*).example.com
+
+    openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \
+    -keyout example.key -out example.crt -subj "/CN=example.com" \
+    -addext "subjectAltName=DNS:example.com,DNS:*.example.com"
+
+    The general outputs are: example.crt, example.key
+    ```
+
+8. The workaround:
+
+    Update `server-url` with the value of `https://harv31.example.com`
+
+    ```
+    # kubectl edit settings.management.cattle.io -n cattle-system server-url
+    setting.management.cattle.io/server-url edited
+    ...
+
+    # kubectl get settings.management.cattle.io -n cattle-system server-url
+    NAME         VALUE
+    server-url   https://harv31.example.com
+    ```
+
+    After the workaround is applied, the `fleet-agent` pod is replaced by Rancher automatically and registers successfully, the upgrade continues.
+
+- Related issue:
+  - [[BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN](https://github.com/harvester/harvester/issues/4519)
+- Workaround:
+  - https://github.com/harvester/harvester/issues/4519#issuecomment-1727132383
+
+---