Skip to content

Commit 37c90c2

Browse files
committed
add argus info from Chris
1 parent b2311f9 commit 37c90c2

File tree

6 files changed

+67
-9
lines changed

6 files changed

+67
-9
lines changed

docs/explanations/kubernetes_cluster.rst

Lines changed: 67 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -40,21 +40,79 @@ Three cluster topologies were considered for this project.
4040
DLS Argus Cluster
4141
-----------------
4242

43-
**TODO** this section will give details of the topology and special
43+
This section gives details of the topology and special
4444
configuration used by the DLS argus cluster to enable running
4545
IOCs on a Beamline.
4646

47-
Brief Overview of DLS Argus cluster
48-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47+
Overview
48+
~~~~~~~~
49+
50+
Argus is the production DLS cluster. It comprises 22 bare metal worker nodes, with a 3 node control plane that runs in VMs. The control plane nodes run the K8s master processes such as the API server, controller manager etc. Each control plane node runs an etcd backend.
51+
52+
.. image:: ../images/clusterHA.png
53+
54+
To load balance across the K8s API running on the control plane nodes, there is a haproxy load balancer. The DNS endpoint argus.api.diamond.ac.uk (which all nodes use as the main API endpoint) points to a single haproxy IP. The IP is HA by virtue of a pair of VMs that both run haproxy, bind on all IPs, and use VRRP/keepalived to make sure the IP is always up. Haproxy has the 3 control plane nodes as a target backend.
55+
56+
.. image:: ../images/kubeadm-ha-topology-stacked-etcd.png
57+
58+
The cluster uses Kubeadm to deploy the K8s control plane in containers. It is provided by K8s upstream, and is architecturally similar to Rancher Kubernetes Engine (RKE). Kubeadm supports upgrades/downgrades and easy provisioning of nodes. The cluster is connected using Weave as the CNI. Weave is the only CNI tested that passes Broadcast/Unicast/Multicast (BUM) traffic through the iptables that control network access for pods. Metallb is used as a component to support K8s loadBalancer Service objects. Ingress nginx from nginxinc is used as an ingress controller. Logs are collected from the stdout of all pods using a fluentd daemonset which ships logs to a centralized graylog server. Cluster authentication is via KeyCloak.
59+
60+
The cluster sits in one rack, with a top of rack (TOR) switch/router connecting it to the rest of the network. The cluster nodes sit on the same /22 network which is routable via the TOR router (this router routes the /22 subnet to other racks via OSPF). Metallb pool IPs are allocated from within this /22 to ensure they are globally routable by the OSPF network; the metallb speaker pods respond to ARP requests originating from the TOR router looking for load balanced Service IPs.
61+
62+
63+
**One of the Argus racks**
64+
65+
.. image:: ../images/argus3.jpg
66+
67+
The cluster is built and managed using Ansible. Heavy use of the k8s module enables direct installation of K8s components by talking directly to the K8s API using the k8s module. Ansible also configures the haproxy API load balancer. Prometheus_operator provides the monitoring stack.
68+
69+
Argus is a multi-tenant cluster. Namespaces are used to enforce multi-tenancy. A namespace is created on demand for each user, acting as a sandbox for them to get familiar with K8s. Applications deployed in production get their own “project” namespace. The project namespace has some associated policy that determines who can run pods in the namespace, what data can be accessed, and if pods can run with elevated privileges. This is enforced by a combination of RBAC and Pod Security Policy (PSP). The latter is a deprecated feature in K8s 1.21 and will soon be replaced with Open Policy Agent.
70+
4971

5072
Beamline Local Cluster Nodes
5173
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5274

53-
hostNetwork = true
54-
~~~~~~~~~~~~~~~~~~
75+
As part of the investigation work some worker nodes in Argus have been connected that are physically located at the beamline. These nodes do not share the same rack as Argus, and hence are part of a different routed subnet to the /22 that the control plane and main workers are within. This model assumes one centralised control plane (and some generic workers), and a set of beamline cluster nodes that may be distributed across the network (in different subnets).
76+
77+
The beamline cluster nodes require a few interesting sets of configuration to make this architecture work. See the following subheadings for details.
78+
79+
Metallb Pools
80+
+++++++++++++
81+
82+
Metallb cannot be used to provide loadBalancer services for pods running on the beamline cluster nodes. This is because metallb currently only supports a single pool of IPs to allocate from. In the case of Argus, the pool is allocated from within the /22 in which the control plane (and a few generic workers) sit. Should a pod with a loadBalancer Service IP get brought up on a beamline cluster node, the traffic would not be routable because the beamline TOR switch does not send ARP messages for subnets that it is not directly connected to. This is not an issue running IOCs since they do not make use of loadBalancer Services. There is a feature request for Metallb to support address pools that is currently pending.
83+
84+
Node Labelling and Taints
85+
+++++++++++++++++++++++++
86+
87+
The beamline cluster worker nodes are labelled and tainted with the name of the beamline. This ensures that only pods running IOCs that are relevant to that beamline can be started on the beamline worker nodes. Pods that are to be scheduled there must tolerate the taint, and use node selection based on the label.
88+
89+
Certain utility pods must also tolerate the beamline name taint. Pods such as fluentd (which provides pod log aggregation and shipping to a centralised graylog) need additional tolerations of the taint. However most standard utilities such as Prometheus, Weave (the CNI itself runs in a pod) and kube-proxy all have a toleration of all “noSchedule” taints built in.
90+
91+
Host Network
92+
++++++++++++
93+
94+
In order for IOCs to work within K8s pods, they typically need to see BUM traffic. This is because EPICS uses UDP Broadcast for IOC discovery. There are also other interesting network quirks that IOCs exhibit that make use of the CNI network overlay unsuitable. To get around this, pods running IOCs make use of the host network namespace. In other words, they see the interfaces on the underlying worker nodes, rather than a virtual interface that is connected to the cluster internal network that normal pods see. This is done by setting hostNetwork =  true in the pod spec. Access to the host network namespace requires privileged pods. Whilst this is allowed (Argus uses pod security policy to enforce the attributes of the pods that are scheduled), we do drop the capabilities that are not needed. This reduces the attack surface somewhat. We drop everything except NET_ADMIN and NET_BROADCAST.
95+
96+
Uses for Argus
97+
--------------
5598

56-
Namespaces and Permissions
57-
~~~~~~~~~~~~~~~~~~~~~~~~~~
99+
The central cluster is used for many services other than EPICS IOCs. Below is
100+
a list of current and potential use cases:
58101

59-
Taints and Tolerances
60-
~~~~~~~~~~~~~~~~~~~~~
102+
- Controls IOCs
103+
- Kafka and Spark
104+
- Jenkins
105+
- Sonarqube
106+
- Zocalo
107+
- Jupyterhub
108+
- Business apps (Confluence, Jira etc)
109+
- Monitoring stacks (ElasticSearch, Graylog, Graphite, Nagdash etc)
110+
- Core services (LDAP, Kerberos, Gitlab etc)
111+
- Netbox
112+
- MariaDB
113+
- HT Condor
114+
- Machine Learning toolkits (Kubeflow)
115+
- VM orchestration (Kubevirt/Virtlet)
116+
- Relion
117+
- Storage Systems deployment (Ceph-rook, Portworkx etc)
118+
- XChem Fragalysis

images/argus1.jpg

2.05 MB
Loading

images/argus2.jpg

2.12 MB
Loading

images/argus3.jpg

2.23 MB
Loading

images/clusterHA.png

38.5 KB
Loading
69 KB
Loading

0 commit comments

Comments
 (0)