|
| 1 | +--- |
| 2 | +description: > |
| 3 | + Tips and notes for running a production-grade kcp setup. |
| 4 | +--- |
| 5 | + |
| 6 | +# Production Setup |
| 7 | + |
| 8 | +This document collects notes and tips on how to run a production-grade kcp setup. |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +Running kcp consists of mainly two challenges: |
| 13 | + |
| 14 | +* Running reliable **etcd** clusters for each kcp shard. |
| 15 | +* Running **kcp** and dealing with its **sharding** to distribute load and limit the impact of |
| 16 | + downtimes to a subset of the entire kcp setup. |
| 17 | + |
| 18 | +## Running etcd |
| 19 | + |
| 20 | +Just like Kubernetes, kcp uses [etcd](https://etcd.io/) as its database: each (root)shard uses its own |
| 21 | +etcd cluster. |
| 22 | + |
| 23 | +The etcd documentation already contains a great number of [operations guides](https://etcd.io/docs/v3.7/op-guide/) |
| 24 | +for common operations like performing backups, monitoring the health etc. Administrators should |
| 25 | +familiarize themselves with the practices laid out there. |
| 26 | + |
| 27 | +### Kubernetes |
| 28 | + |
| 29 | +When running etcd inside Kubernetes, an operator can greatly help in running etcd. |
| 30 | +[Etcd Druid](https://gardener.github.io/etcd-druid/) is one of them and offers great support for |
| 31 | +operations tasks and the entire etcd lifecycle. Etcd clusters managed by Etcd Druid can be seamlessly |
| 32 | +used with kcp. |
| 33 | + |
| 34 | +### High Availability |
| 35 | + |
| 36 | +Care should be taken to distribute the etcd pods across availability zones and/or different nodes. |
| 37 | +This ensures that node failure will not immediately bring down an entire etcd cluster. Please refer |
| 38 | +to the [Etcd Druid documentation](https://gardener.github.io/etcd-druid/proposals/01-multi-node-etcd-clusters.html?h=affinity#high-availability) |
| 39 | +for more details and configuration examples. |
| 40 | + |
| 41 | +### TLS |
| 42 | + |
| 43 | +It is highly recommended to enable TLS in etcd to encrypt traffic in-transtit between kcp and etcd. |
| 44 | +When using Kubernetes, [cert-manager](https://cert-manager.io/) is a great choice for managing CAs |
| 45 | +and certificates in your cluster, and it can also provide certificates for use in etcd. |
| 46 | + |
| 47 | +On the kcp side, all that is required is to configure three CLI flags: |
| 48 | + |
| 49 | +* `--etcd-certfile` |
| 50 | +* `--etcd-keyfile` |
| 51 | +* `--etcd-cafile` |
| 52 | + |
| 53 | +When using cert-manager, all three files are available in the Secret that is created for the |
| 54 | +Certificate object. |
| 55 | + |
| 56 | +When using Etcd Druid you have to manually create the necessary certificates or make use of one of |
| 57 | +the community Helm charts like [hajowieland/etcd-druid-certs](https://artifacthub.io/packages/helm/hajowieland/etcd-druid-certs). |
| 58 | + |
| 59 | +### Backups |
| 60 | + |
| 61 | +As with any database, etcd clusters should be backed up regularly. This is especially important with |
| 62 | +etcd because a permanent quorum loss can make the entire database unavailable, even though the data |
| 63 | +is technically in some form still there. |
| 64 | + |
| 65 | +Using an operator like the aforementioned Etcd Druid can greatly help in performing backups and |
| 66 | +restores. |
| 67 | + |
| 68 | +### Encryption |
| 69 | + |
| 70 | +kcp supports encryption-at-rest for its storage backend, allowing administrators to configure |
| 71 | +encryption keys or integration with external key-management systems to encrypt data written to disk. |
| 72 | + |
| 73 | +Please refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/) |
| 74 | +for more information on configuring and using encryption in kcp. |
| 75 | + |
| 76 | +Since each shard and its etcd is independent from other shards, the encryption configuration can be |
| 77 | +different per shard, if desired. |
| 78 | + |
| 79 | +### Scaling |
| 80 | + |
| 81 | +etcd can be scaled to some degree by adding more resources and/or more members to an etcd cluster, |
| 82 | +however [hard limits](https://etcd.io/docs/v3.7/dev-guide/limit/) set an upper boundary. It is |
| 83 | +important to monitor etcd performance to assign resources accordingly. |
| 84 | + |
| 85 | +Note that using scaling solutions like the Vertical Pod Autoscaler (VPA), care must be taken so that |
| 86 | +not too many etcd members restart simultaneously or a permanent loss of quorum can occur, which would |
| 87 | +require restoring etcd from a backup. |
| 88 | + |
| 89 | +## Running kcp |
| 90 | + |
| 91 | +Kubernetes is the native habitat of kcp and its recommended runtime environment. The kcp project |
| 92 | +offers two ways of running kcp in Kubernetes: |
| 93 | + |
| 94 | +* via [Helm chart](https://github.com/kcp-dev/helm-charts/) |
| 95 | +* using the [kcp-operator](https://docs.kcp.io/kcp-operator/) |
| 96 | + |
| 97 | +While still in its early stages, the kcp-operator is aimed to be the recommended approach to running |
| 98 | +kcp: it offers more features than the Helm charts and can actively reconcile missing/changed |
| 99 | +resources on its own. |
| 100 | + |
| 101 | +### Sharding |
| 102 | + |
| 103 | +kcp supports the concept of sharding to spread the workload horizontally across kcp processes. Even |
| 104 | +if the database behind kcp would offer infinite performance at zero cost, kcp itself cannot scale |
| 105 | +vertically indefinitely: each logical cluster requires a minimum of runtime resources, even if the |
| 106 | +cluster is not actively used. |
| 107 | + |
| 108 | +New workspaces in kcp are spread evenly across all available shards, however as of kcp 0.28, this |
| 109 | +does not take into account the current number of logicalclusters on each shard. This means once |
| 110 | +every existing shard has reached its administrator-defined limit, simply adding a new shard will not |
| 111 | +make kcp schedule all new clusters onto it, but still distribute them evenly. There is currently |
| 112 | +no mechanism to mark shards as "full" or unavailable for schedulding and the kcp scheduled does not |
| 113 | +take shard metrics into account. |
| 114 | + |
| 115 | +It's therefore recommended to start with a sharded setup instead of working with a single root shard |
| 116 | +only. This not only improves realiability and performance, but can also help ensure newly developed |
| 117 | +kcp client software does not by accident make false assumptions about sharding. |
| 118 | + |
| 119 | +### High Availability |
| 120 | + |
| 121 | +To improve resilience against node failures, it is strongly recommended to not just spread the |
| 122 | +workload across multiple shards, but also to ensure that shard pods are distributed across nodes or |
| 123 | +availability zones. The same advice for etcd applies to kcp as well: Use anti-affinities to ensure |
| 124 | +pods are scheduled properly. |
| 125 | + |
| 126 | +### Backups |
| 127 | + |
| 128 | +All kcp data is stored in etcd, there is no need to perform a dedicated kcp backup. |
0 commit comments