Skip to content

Commit cd78185

Browse files
Matt Pryorm-bull
andauthored
Documentation for debugging CRD-based Zenith (azimuth-cloud#145)
* Documentation for debugging CRD-based Zenith * Add link to CRD docs Co-authored-by: Matt Anson <[email protected]> * Small tweak to wording Co-authored-by: Matt Anson <[email protected]> * And another Co-authored-by: Matt Anson <[email protected]> * More small tweaks Co-authored-by: Matt Anson <[email protected]> * Update docs/debugging/zenith-services.md Co-authored-by: Matt Anson <[email protected]> * Update docs/debugging/zenith-services.md Co-authored-by: Matt Anson <[email protected]> * Update docs/debugging/zenith-services.md Co-authored-by: Matt Anson <[email protected]> * Update docs/debugging/zenith-services.md Co-authored-by: Matt Anson <[email protected]> * Update docs/debugging/zenith-services.md Co-authored-by: Matt Anson <[email protected]> * Update docs/debugging/zenith-services.md Co-authored-by: Matt Anson <[email protected]> --------- Co-authored-by: Matt Anson <[email protected]>
1 parent aab342d commit cd78185

File tree

1 file changed

+130
-44
lines changed

1 file changed

+130
-44
lines changed

docs/debugging/zenith-services.md

Lines changed: 130 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -20,29 +20,74 @@ persist, try restarting the Zenith SSHD:
2020
kubectl -n azimuth rollout restart deployment/zenith-server-sshd
2121
```
2222

23-
## Client not registered in Consul
23+
## Client not appearing in Zenith CRDs
2424

25-
Once a client has connected to SSHD successfully, it should get registered in
26-
[Consul](https://www.consul.io/).
25+
The components of Zenith communicate using three [Kubernetes CRDs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/):
2726

28-
To determine if this is the case, it is useful to access the Consul UI. As discussed
29-
in [Monitoring and alerting](../configuration/14-monitoring.md), the Consul UI
30-
is exposed as `consul.<ingress base domain>`, e.g. `consul.azimuth.example.org`,
31-
and is protected by a username and password.
27+
* `services.zenith.stackhpc.com`
28+
A reserved domain and associated SSH public key.
29+
* `endpoints.zenith.stackhpc.com`
30+
The current endpoints for a Zenith service.
31+
This resource is updated to add the address, port and configuration of the Zenith SSH tunnel as the SSH tunnel is created.
32+
* `leases.zenith.stackhpc.com`
33+
Heartbeat information for an individual SSH tunnel.
34+
Each Zenith SSH tunnel has its own lease resource that is regularly updated with a heartbeat.
3235

33-
The default view shows Consul's view of the services, where you can check if the
34-
service is being registered correctly.
36+
If a Zenith service is not functioning as expected, check the state of the CRDs for
37+
that service.
3538

36-
Clients not registering correctly in Consul usually indicates an issue with Consul
37-
itself. Futher information for debugging Consul issues is provided in
38-
[Debugging Consul](consul.md).
39+
First, check that the service exists and has an SSH key associated:
3940

40-
If the issue persists once Consul issues are ruled out, try restarting SSHD:
41+
```command title="On the K3s node, targetting the HA cluster if deployed"
42+
$ kubectl -n zenith-services get services.zenith
43+
NAME FINGERPRINT AGE
44+
igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn WLo15SbKRadA5q1WIn6dToWT4Q+j05rZ5T+Zc/so4M0 13m
45+
sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 G6sdXwUfvdlosCB2yi40TEf5//ie2bgCxytrig4xpTA 13m
46+
```
4147

42-
```sh title="On the K3s node, targetting the HA cluster if deployed"
43-
kubectl -n azimuth rollout restart deployment/zenith-server-sshd
48+
Next, check that there is at least one lease for the service and verify that it is being
49+
regularly renewed:
50+
51+
```command title="On the K3s node, targetting the HA cluster if deployed"
52+
$ kubectl -n zenith-services get leases.zenith
53+
NAME RENEWED TTL REAP AFTER AGE
54+
sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-fn75d 7s 20 120 13m
55+
igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-7tnqm 5s 20 120 14m
4456
```
4557

58+
Finally, check that the endpoint is registered correctly in the endpoints resource:
59+
60+
```command title="On the K3s node, targetting the HA cluster if deployed"
61+
$ kubectl -n zenith-services get endpoints.zenith igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn -o yaml
62+
apiVersion: zenith.stackhpc.com/v1alpha1
63+
kind: Endpoints
64+
metadata:
65+
creationTimestamp: "2024-05-01T13:24:12Z"
66+
generation: 3
67+
name: igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn
68+
namespace: zenith-services
69+
ownerReferences:
70+
- apiVersion: zenith.stackhpc.com/v1alpha1
71+
blockOwnerDeletion: true
72+
kind: Service
73+
name: igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn
74+
uid: 378b39dc-9fce-4553-865e-edad2dd8d8b0
75+
resourceVersion: "7260"
76+
uid: 62badd6e-a5a7-452d-b346-04c2efd75a6c
77+
spec:
78+
endpoints:
79+
7tnqm:
80+
address: 10.42.0.71
81+
config:
82+
backend-protocol: http
83+
skip-auth: false
84+
port: 42109
85+
status: passing
86+
```
87+
88+
The address should be the pod IP of a Zenith SSHD pod, and the port should be the allocated port
89+
reported by the client.
90+
4691
## OIDC credentials not created
4792

4893
Keycloak OIDC credentials for Zenith services for platforms deployed using Azimuth are created
@@ -67,45 +112,87 @@ the identity operator:
67112
kubectl -n azimuth rollout restart deployment/azimuth-identity-operator
68113
```
69114

115+
If this doesn't work, check the logs for errors:
116+
117+
```sh title="On the K3s node, targetting the HA cluster if deployed"
118+
kubectl -n azimuth logs deployment/azimuth-identity-operator [-f]
119+
```
120+
70121
## Kubernetes resources for the Zenith service have not been created
71122

72-
If the service exists in Consul, it is possible that the process that synchronises Consul
73-
services with Kubernetes resources is not functioning correctly. To check if Kubernetes
74-
resources are being created, run the following command and check that the `Ingress`,
75-
`Service` and `Endpoints` resources have been created for the service:
123+
If the CRDs for the service look correct, it is possible that the component that watches
124+
the Zenith CRDs and creates the Kubernetes ingress for those services is not functioning
125+
correctly.
126+
127+
This component creates Helm releases to deploy the resources for a service, so first check
128+
that a Helm release exists for the service and is in the `deployed` state:
129+
130+
```sh title="On the K3s node, targetting the HA cluster if deployed"
131+
$ helm -n zenith-services list -a
132+
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
133+
igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn zenith-services 1 2024-05-01 13:24:13.36622944 +0000 UTC deployed zenith-service-0.1.0+846a545e main
134+
sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 zenith-services 1 2024-05-01 13:24:41.219330845 +0000 UTC deployed zenith-service-0.1.0+846a545e main
135+
```
136+
137+
Also check the state of the `Ingress`, `Service` and `EndpointSlice`s for the service:
76138

77139
```command title="On the K3s node, targetting the HA cluster if deployed"
78-
$ kubectl -n zenith-services get ingress,service,endpoints
79-
NAME CLASS HOSTS ADDRESS PORTS AGE
80-
ingress.networking.k8s.io/cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g nginx cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g.azimuth.example.org 96.241.100.96 80, 443 2d
81-
ingress.networking.k8s.io/i03xvflgk1zmtcsdm2x5z5lz9qz05027euw nginx i03xvflgk1zmtcsdm2x5z5lz9qz05027euw.azimuth.example.org 96.241.100.96 80, 443 2d
82-
ingress.networking.k8s.io/pxmvy7235x2ggfvf2op615gvz2v59wkqglc nginx pxmvy7235x2ggfvf2op615gvz2v59wkqglc.azimuth.example.org 96.241.100.96 80, 443 2d
83-
ingress.networking.k8s.io/txn3zidfdnru5rg109voh848n51rvicmr1s nginx txn3zidfdnru5rg109voh848n51rvicmr1s.azimuth.example.org 96.241.100.96 80, 443 2d
84-
85-
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
86-
service/cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g ClusterIP 172.27.10.109 <none> 80/TCP 2d
87-
service/i03xvflgk1zmtcsdm2x5z5lz9qz05027euw ClusterIP 172.30.203.227 <none> 80/TCP 2d
88-
service/pxmvy7235x2ggfvf2op615gvz2v59wkqglc ClusterIP 172.28.51.148 <none> 80/TCP 2d
89-
service/txn3zidfdnru5rg109voh848n51rvicmr1s ClusterIP 172.29.136.245 <none> 80/TCP 2d
90-
91-
NAME ENDPOINTS AGE
92-
endpoints/cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g 172.18.152.99:34665 2d
93-
endpoints/i03xvflgk1zmtcsdm2x5z5lz9qz05027euw 172.18.152.99:39409 2d
94-
endpoints/pxmvy7235x2ggfvf2op615gvz2v59wkqglc 172.18.152.99:44761,172.18.152.99:36483,172.18.152.99:46449 2d
95-
endpoints/txn3zidfdnru5rg109voh848n51rvicmr1s 172.18.152.99:45379 2d
140+
$ kubectl -n zenith-services get ingress,service,endpointslice
141+
NAME CLASS HOSTS ADDRESS PORTS AGE
142+
ingress.networking.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc nginx igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 25m
143+
ingress.networking.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn nginx igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 25m
144+
ingress.networking.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 nginx sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 24m
145+
ingress.networking.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc nginx sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 24m
146+
147+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
148+
service/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc ClusterIP 10.43.247.54 <none> 80/TCP,44180/TCP 25m
149+
service/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn ClusterIP 10.43.125.138 <none> 80/TCP 25m
150+
service/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 ClusterIP 10.43.234.107 <none> 80/TCP 24m
151+
service/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc ClusterIP 10.43.7.75 <none> 80/TCP,44180/TCP 24m
152+
153+
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
154+
endpointslice.discovery.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-3010d IPv4 42109 10.42.0.71 25m
155+
endpointslice.discovery.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc-kqx8h IPv4 4180,44180 10.42.0.86 25m
156+
endpointslice.discovery.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-f2086 IPv4 33857 10.42.0.71 24m
157+
endpointslice.discovery.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc-tkvqv IPv4 4180,44180 10.42.0.88 24m
96158
```
97159

98-
!!! tip
160+
!!! tip "Ingress address not assigned"
99161

100-
If an ingress resource does not have an IP, this may be a sign that the ingress controller
162+
If an ingress resource does not have an address, this may be a sign that the ingress controller
101163
is not correctly configured or not functioning correctly.
102164

103-
If they do not exist, try restarting the Zenith sync component:
165+
!!! info "Services with OIDC authentication"
166+
167+
When a service has OIDC authentication enabled, there will be two of each resource for each
168+
service, one of which will have the suffix `-oidc`.
169+
170+
Each service with OIDC authentication enabled gets a standalone service that is responsible
171+
handling the interactions with the OIDC provider. To check the state of these resources, use:
172+
173+
```command title="On the K3s node, targetting the HA cluster if deployed"
174+
$ kubectl -n zenith-services get deploy,po
175+
NAME READY UP-TO-DATE AVAILABLE AGE
176+
deployment.apps/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc 1/1 1 1 33m
177+
deployment.apps/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc 1/1 1 1 33m
178+
179+
NAME READY STATUS RESTARTS AGE
180+
pod/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc-7f6656bd98-9rrj2 1/1 Running 0 33m
181+
pod/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc-7ffbff4cd6-sr75w 1/1 Running 0 33m
182+
```
183+
184+
If any of these resources look incorrect, try restarting the Zenith sync component:
104185

105186
```sh title="On the K3s node, targetting the HA cluster if deployed"
106187
kubectl -n azimuth rollout restart deployment/zenith-server-sync
107188
```
108189

190+
If this doesn't work, check the logs for errors:
191+
192+
```sh title="On the K3s node, targetting the HA cluster if deployed"
193+
kubectl -n azimuth logs deployment/zenith-server-sync [-f]
194+
```
195+
109196
## cert-manager fails to obtain a certificate
110197

111198
If you are using cert-manager to dynamically allocate certificates for Zenith services it
@@ -116,10 +203,9 @@ To check if this is the case, check the state of the certificates for the Zenith
116203

117204
```command title="On the K3s node, targetting the HA cluster if deployed"
118205
$ kubectl -n zenith-services get certificate
119-
NAME READY SECRET AGE
120-
tls-cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g True tls-cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g 2d
121-
tls-i03xvflgk1zmtcsdm2x5z5lz9qz05027euw True tls-i03xvflgk1zmtcsdm2x5z5lz9qz05027euw 2d
122-
tls-txn3zidfdnru5rg109voh848n51rvicmr1s True tls-txn3zidfdnru5rg109voh848n51rvicmr1s 2d
206+
NAME READY SECRET AGE
207+
tls-igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn True tls-igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn 30m
208+
tls-sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 True tls-sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 29m
123209
```
124210

125211
If the certificate for the service is not ready, check the details for the certificate using

0 commit comments

Comments
 (0)