Skip to content

Commit e9d9d41

Browse files
authored
Merge pull request #3592 from xrstf/prod-docs
begin docs page with production tips
2 parents 7bd0106 + 4ea9d6d commit e9d9d41

File tree

2 files changed

+129
-0
lines changed

2 files changed

+129
-0
lines changed

docs/content/setup/.pages

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ nav:
44
- helm.md
55
- kubectl-plugin.md
66
- integrations.md
7+
- production.md

docs/content/setup/production.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
description: >
3+
Tips and notes for running a production-grade kcp setup.
4+
---
5+
6+
# Production Setup
7+
8+
This document collects notes and tips on how to run a production-grade kcp setup.
9+
10+
## Overview
11+
12+
Running kcp consists of mainly two challenges:
13+
14+
* Running reliable **etcd** clusters for each kcp shard.
15+
* Running **kcp** and dealing with its **sharding** to distribute load and limit the impact of
16+
downtimes to a subset of the entire kcp setup.
17+
18+
## Running etcd
19+
20+
Just like Kubernetes, kcp uses [etcd](https://etcd.io/) as its database: each (root)shard uses its own
21+
etcd cluster.
22+
23+
The etcd documentation already contains a great number of [operations guides](https://etcd.io/docs/v3.7/op-guide/)
24+
for common operations like performing backups, monitoring the health etc. Administrators should
25+
familiarize themselves with the practices laid out there.
26+
27+
### Kubernetes
28+
29+
When running etcd inside Kubernetes, an operator can greatly help in running etcd.
30+
[Etcd Druid](https://gardener.github.io/etcd-druid/) is one of them and offers great support for
31+
operations tasks and the entire etcd lifecycle. Etcd clusters managed by Etcd Druid can be seamlessly
32+
used with kcp.
33+
34+
### High Availability
35+
36+
Care should be taken to distribute the etcd pods across availability zones and/or different nodes.
37+
This ensures that node failure will not immediately bring down an entire etcd cluster. Please refer
38+
to the [Etcd Druid documentation](https://gardener.github.io/etcd-druid/proposals/01-multi-node-etcd-clusters.html?h=affinity#high-availability)
39+
for more details and configuration examples.
40+
41+
### TLS
42+
43+
It is highly recommended to enable TLS in etcd to encrypt traffic in-transtit between kcp and etcd.
44+
When using Kubernetes, [cert-manager](https://cert-manager.io/) is a great choice for managing CAs
45+
and certificates in your cluster, and it can also provide certificates for use in etcd.
46+
47+
On the kcp side, all that is required is to configure three CLI flags:
48+
49+
* `--etcd-certfile`
50+
* `--etcd-keyfile`
51+
* `--etcd-cafile`
52+
53+
When using cert-manager, all three files are available in the Secret that is created for the
54+
Certificate object.
55+
56+
When using Etcd Druid you have to manually create the necessary certificates or make use of one of
57+
the community Helm charts like [hajowieland/etcd-druid-certs](https://artifacthub.io/packages/helm/hajowieland/etcd-druid-certs).
58+
59+
### Backups
60+
61+
As with any database, etcd clusters should be backed up regularly. This is especially important with
62+
etcd because a permanent quorum loss can make the entire database unavailable, even though the data
63+
is technically in some form still there.
64+
65+
Using an operator like the aforementioned Etcd Druid can greatly help in performing backups and
66+
restores.
67+
68+
### Encryption
69+
70+
kcp supports encryption-at-rest for its storage backend, allowing administrators to configure
71+
encryption keys or integration with external key-management systems to encrypt data written to disk.
72+
73+
Please refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/)
74+
for more information on configuring and using encryption in kcp.
75+
76+
Since each shard and its etcd is independent from other shards, the encryption configuration can be
77+
different per shard, if desired.
78+
79+
### Scaling
80+
81+
etcd can be scaled to some degree by adding more resources and/or more members to an etcd cluster,
82+
however [hard limits](https://etcd.io/docs/v3.7/dev-guide/limit/) set an upper boundary. It is
83+
important to monitor etcd performance to assign resources accordingly.
84+
85+
Note that using scaling solutions like the Vertical Pod Autoscaler (VPA), care must be taken so that
86+
not too many etcd members restart simultaneously or a permanent loss of quorum can occur, which would
87+
require restoring etcd from a backup.
88+
89+
## Running kcp
90+
91+
Kubernetes is the native habitat of kcp and its recommended runtime environment. The kcp project
92+
offers two ways of running kcp in Kubernetes:
93+
94+
* via [Helm chart](https://github.com/kcp-dev/helm-charts/)
95+
* using the [kcp-operator](https://docs.kcp.io/kcp-operator/)
96+
97+
While still in its early stages, the kcp-operator is aimed to be the recommended approach to running
98+
kcp: it offers more features than the Helm charts and can actively reconcile missing/changed
99+
resources on its own.
100+
101+
### Sharding
102+
103+
kcp supports the concept of sharding to spread the workload horizontally across kcp processes. Even
104+
if the database behind kcp would offer infinite performance at zero cost, kcp itself cannot scale
105+
vertically indefinitely: each logical cluster requires a minimum of runtime resources, even if the
106+
cluster is not actively used.
107+
108+
New workspaces in kcp are spread evenly across all available shards, however as of kcp 0.28, this
109+
does not take into account the current number of logicalclusters on each shard. This means once
110+
every existing shard has reached its administrator-defined limit, simply adding a new shard will not
111+
make kcp schedule all new clusters onto it, but still distribute them evenly. There is currently
112+
no mechanism to mark shards as "full" or unavailable for schedulding and the kcp scheduled does not
113+
take shard metrics into account.
114+
115+
It's therefore recommended to start with a sharded setup instead of working with a single root shard
116+
only. This not only improves realiability and performance, but can also help ensure newly developed
117+
kcp client software does not by accident make false assumptions about sharding.
118+
119+
### High Availability
120+
121+
To improve resilience against node failures, it is strongly recommended to not just spread the
122+
workload across multiple shards, but also to ensure that shard pods are distributed across nodes or
123+
availability zones. The same advice for etcd applies to kcp as well: Use anti-affinities to ensure
124+
pods are scheduled properly.
125+
126+
### Backups
127+
128+
All kcp data is stored in etcd, there is no need to perform a dedicated kcp backup.

0 commit comments

Comments
 (0)