|
| 1 | +# Past Issues |
| 2 | + |
| 3 | +This document lists past issues that could be of interest if you encounter |
| 4 | +issues with the cluser/presubmit. |
| 5 | + |
| 6 | +## Workflows are failing: DNS resolution of github.com fails. |
| 7 | + |
| 8 | +### Date: 2025-01-27 |
| 9 | + |
| 10 | +### Symptoms: |
| 11 | + |
| 12 | +We noticed GitHub jobs were failing, and the logs showed `hostname lookup |
| 13 | +failed for github.com`. |
| 14 | + |
| 15 | +### Investigation |
| 16 | + |
| 17 | +Initial steps were to check the github status page. No outages. |
| 18 | +Then, we checked internal incident page. No outages. |
| 19 | + |
| 20 | +Then we looked at the instance logs. |
| 21 | +When a node fails, GCP removes it, meaning the logs are not accessible from |
| 22 | +the node page anymore, but can be retrieved either in the global logs. |
| 23 | + |
| 24 | +Looking at the logs, we discovered other services failing to resolve |
| 25 | +hostname, like the metrics container. |
| 26 | + |
| 27 | +In Kubernetes, each cluster runs a `kube-dns` service/pod, which is used |
| 28 | +by other pods to do DNS requests. |
| 29 | +This pod was crashing. |
| 30 | + |
| 31 | +Looking at the node this pod was running on showed a RAM usage close to the |
| 32 | +VM limits. |
| 33 | +At the time, the service instances were running on `e2-small` VMs, which only |
| 34 | +have 2GB of RAM. |
| 35 | +In addition, we recently added more runner nodes by adding a new Windows pool. |
| 36 | +This meant cluster size increased. This caused the cluster management services |
| 37 | +to take more resources to run, and pushed us just above the 2GB limit. |
| 38 | + |
| 39 | +This causes the kube-dns service to be OOM killed, and then caused various DNS |
| 40 | +failures in the cluster. |
| 41 | + |
| 42 | +### Solution |
| 43 | + |
| 44 | +Change the shape of the service pool to be `e2-highcpu-4`, doubling the RAM |
| 45 | +and CPU budget. We also increased the pool size from 2 to 3. |
| 46 | + |
| 47 | +## LLVM dashboard graphs are empty for presubmits |
| 48 | + |
| 49 | +### Date: 2025-01-28 |
| 50 | + |
| 51 | +### Symptoms |
| 52 | + |
| 53 | +The LLVM dashboard was showing empty graphs for the presubmit job runtime and |
| 54 | +queue time. Autoscaling graphs were still working. |
| 55 | + |
| 56 | +### Investigation |
| 57 | + |
| 58 | +The graphs were empty because no new metrics were received, but other GCP |
| 59 | +metrics were still showing. |
| 60 | +Our dashboard has multiple data source: |
| 61 | + - the GCP cluster. |
| 62 | + - the metrics container. |
| 63 | + |
| 64 | +Because we had GCP metrics, it meant the Grafana instance was working, and |
| 65 | +the Grafana Alloy component running in the cluster was also fine. |
| 66 | + |
| 67 | +It was probably the metrics container. |
| 68 | +We checked the heartbeat metric: `metrics_container_heartbeat`. |
| 69 | +This is a simple ping recorded every minutes by the container. If this |
| 70 | +metrics stops emitting, it means something is wrong with the job. |
| 71 | +This metric was still being recorded. |
| 72 | + |
| 73 | +A recent change was made to add the windows version of the premerge check. |
| 74 | +This caused the job name to change, and thus changed the recorded metric |
| 75 | +names from `llvm_premerge_checks_linux_run_time` to |
| 76 | +`llvm_premerge_checks_premerge_checks_linux_run_time`. |
| 77 | + |
| 78 | +### Solution |
| 79 | + |
| 80 | +Change the dashboards to read the new metric name instead of the previous |
| 81 | +name, allowing new data to be shown. |
| 82 | +SLO definitions and alerts also had to be adjusted to look at the new metrics. |
| 83 | + |
| 84 | +## LLVM dashboard graphs are empty for run/queue times |
| 85 | + |
| 86 | +### Date: 2025-01-10 |
| 87 | + |
| 88 | +### Symptoms |
| 89 | + |
| 90 | +The LLVM dashboard was showing empty graphs for the presubmit job runtime and |
| 91 | +queue time. Autoscaling graphs were still working. |
| 92 | + |
| 93 | +### Investigation |
| 94 | + |
| 95 | +Grafana was still recording GCP metrics, but no new data coming from the |
| 96 | +metrics container. |
| 97 | + |
| 98 | +A quick look at the google cloud console showed the metrics container pod was |
| 99 | +crashing. |
| 100 | +Looking at the logs, we saw the script failed to connect to GitHub to get |
| 101 | +the workflow status. Reason was a bad GitHub token. |
| 102 | + |
| 103 | +Because we have no admin access to the GitHub admin organization, we cannot |
| 104 | +emit LLVM owned tokens. A Googler had used its personal account to setup a |
| 105 | +PAT token. This token expired in December, causing the metrics container |
| 106 | +to fail since. |
| 107 | + |
| 108 | +### Solution |
| 109 | + |
| 110 | +Another Googler generated a new token, and replaced it in `Menu > Security > Secrets Manager > llvm-premerge-github-pat`. |
| 111 | +Note: this secret is in the general secret manager, not in `Kubernetes Engine > Secrets & ConfigMaps`. |
| 112 | + |
| 113 | +Once the secret updated, the metrics container had to be restarted: |
| 114 | +- `Menu > Kubernetes Engine > Workflows` |
| 115 | +- select `metrics`. |
| 116 | +- click `Rolling update` |
| 117 | +- set all thresholds to `100%`, the click update. |
| 118 | + |
| 119 | +This will allow GCP to delete the only metrics container pod, and recreate it |
| 120 | +using the new secret value. |
| 121 | +Because we have a single metrics container instance running, we have to but |
| 122 | +all thresholds to `100%`. |
| 123 | + |
| 124 | +In addition, we added a heartbeat metric to the container, and Grafana |
| 125 | +alerting to make sure we detect this kind of failure early. |
0 commit comments