Skip to content

Commit dd9fa36

Browse files
authored
Merge pull request #27466 from chrisnegus/prod-env-text
Adding new Production Environment section
2 parents b02027a + 2117217 commit dd9fa36

File tree

1 file changed

+290
-1
lines changed
  • content/en/docs/setup/production-environment

1 file changed

+290
-1
lines changed
Lines changed: 290 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,293 @@
11
---
2-
title: Production environment
2+
title: "Production environment"
3+
description: Create a production-quality Kubernetes cluster
34
weight: 30
5+
no_list: true
46
---
7+
<!-- overview -->
8+
9+
A production-quality Kubernetes cluster requires planning and preparation.
10+
If your Kubernetes cluster is to run critical workloads, it must be configured to be resilient.
11+
This page explains steps you can take to set up a production-ready cluster,
12+
or to uprate an existing cluster for production use.
13+
If you're already familiar with production setup and want the links, skip to
14+
[What's next](#what-s-next).
15+
16+
<!-- body -->
17+
18+
## Production considerations
19+
20+
Typically, a production Kubernetes cluster environment has more requirements than a
21+
personal learning, development, or test environment Kubernetes. A production environment may require
22+
secure access by many users, consistent availability, and the resources to adapt
23+
to changing demands.
24+
25+
As you decide where you want your production Kubernetes environment to live
26+
(on premises or in a cloud) and the amount of management you want to take
27+
on or hand to others, consider how your requirements for a Kubernetes cluster
28+
are influenced by the following issues:
29+
30+
- *Availability*: A single-machine Kubernetes [learning environment](/docs/setup/#learning-environment)
31+
has a single point of failure. Creating a highly available cluster means considering:
32+
- Separating the control plane from the worker nodes.
33+
- Replicating the control plane components on multiple nodes.
34+
- Load balancing traffic to the cluster’s {{< glossary_tooltip term_id="kube-apiserver" text="API server" >}}.
35+
- Having enough worker nodes available, or able to quickly become available, as changing workloads warrant it.
36+
37+
- *Scale*: If you expect your production Kubernetes environment to receive a stable amount of
38+
demand, you might be able to set up for the capacity you need and be done. However,
39+
if you expect demand to grow over time or change dramatically based on things like
40+
season or special events, you need to plan how to scale to relieve increased
41+
pressure from more requests to the control plane and worker nodes or scale down to reduce unused
42+
resources.
43+
44+
- *Security and access management*: You have full admin privileges on your own
45+
Kubernetes learning cluster. But shared clusters with important workloads, and
46+
more than one or two users, require a more refined approach to who and what can
47+
access cluster resources. You can use role-based access control
48+
([RBAC](/docs/reference/access-authn-authz/rbac/)) and other
49+
security mechanisms to make sure that users and workloads can get access to the
50+
resources they need, while keeping workloads, and the cluster itself, secure.
51+
You can set limits on the resources that users and workloads can access
52+
by managing [policies](https://kubernetes.io/docs/concepts/policy/) and
53+
[container resources](/docs/concepts/configuration/manage-resources-containers/).
54+
55+
Before building a Kubernetes production environment on your own, consider
56+
handing off some or all of this job to
57+
[Turnkey Cloud Solutions](/docs/setup/production-environment/turnkey-solutions/)
58+
providers or other [Kubernetes Partners](https://kubernetes.io/partners/).
59+
Options include:
60+
61+
- *Serverless*: Just run workloads on third-party equipment without managing
62+
a cluster at all. You will be charged for things like CPU usage, memory, and
63+
disk requests.
64+
- *Managed control plane*: Let the provider manage the scale and availability
65+
of the cluster's control plane, as well as handle patches and upgrades.
66+
- *Managed worker nodes*: Configure pools of nodes to meet your needs,
67+
then the provider makes sure those nodes are available and ready to implement
68+
upgrades when needed.
69+
- *Integration*: There are providers that integrate Kubernetes with other
70+
services you may need, such as storage, container registries, authentication
71+
methods, and development tools.
72+
73+
Whether you build a production Kubernetes cluster yourself or work with
74+
partners, review the following sections to evaluate your needs as they relate
75+
to your cluster’s *control plane*, *worker nodes*, *user access*, and
76+
*workload resources*.
77+
78+
## Production cluster setup
79+
80+
In a production-quality Kubernetes cluster, the control plane manages the
81+
cluster from services that can be spread across multiple computers
82+
in different ways. Each worker node, however, represents a single entity that
83+
is configured to run Kubernetes pods.
84+
85+
### Production control plane
86+
87+
The simplest Kubernetes cluster has the entire control plane and worker node
88+
services running on the same machine. You can grow that environment by adding
89+
worker nodes, as reflected in the diagram illustrated in
90+
[Kubernetes Components](/docs/concepts/overview/components/).
91+
If the cluster is meant to be available for a short period of time, or can be
92+
discarded if something goes seriously wrong, this might meet your needs.
93+
94+
If you need a more permanent, highly available cluster, however, you should
95+
consider ways of extending the control plane. By design, one-machine control
96+
plane services running on a single machine are not highly available.
97+
If keeping the cluster up and running
98+
and ensuring that it can be repaired if something goes wrong is important,
99+
consider these steps:
100+
101+
- *Choose deployment tools*: You can deploy a control plane using tools such
102+
as kubeadm, kops, and kubespray. See
103+
[Installing Kubernetes with deployment tools](/docs/setup/production-environment/tools/)
104+
to learn tips for production-quality deployments using each of those deployment
105+
methods. Different [Container Runtimes](/docs/setup/production-environment/container-runtimes/)
106+
are available to use with your deployments.
107+
- *Manage certificates*: Secure communications between control plane services
108+
are implemented using certificates. Certificates are automatically generated
109+
during deployment or you can generate them using your own certificate authority.
110+
See [PKI certificates and requirements](/docs/setup/best-practices/certificates/) for details.
111+
- *Configure load balancer for apiserver*: Configure a load balancer
112+
to distribute external API requests to the apiserver service instances running on different nodes. See
113+
[Create an External Load Balancer](/docs/tasks/access-application-cluster/create-external-load-balancer/)
114+
for details.
115+
- *Separate and backup etcd service*: The etcd services can either run on the
116+
same machines as other control plane services or run on separate machines, for
117+
extra security and availability. Because etcd stores cluster configuration data,
118+
backing up the etcd database should be done regularly to ensure that you can
119+
repair that database if needed.
120+
See the [etcd FAQ](https://etcd.io/docs/v3.4/faq/) for details on configuring and using etcd.
121+
See [Operating etcd clusters for Kubernetes](/docs/tasks/administer-cluster/configure-upgrade-etcd/)
122+
and [Set up a High Availability etcd cluster with kubeadm](/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/)
123+
for details.
124+
- *Create multiple control plane systems*: For high availability, the
125+
control plane should not be limited to a single machine. If the control plane
126+
services are run by an init service (such as systemd), each service should run on at
127+
least three machines. However, running control plane services as pods in
128+
Kubernetes ensures that the replicated number of services that you request
129+
will always be available.
130+
The scheduler should be fault tolerant,
131+
but not highly available. Some deployment tools set up [Raft](https://raft.github.io/)
132+
consensus algorithm to do leader election of Kubernetes services. If the
133+
primary goes away, another service elects itself and take over.
134+
- *Span multiple zones*: If keeping your cluster available at all times is
135+
critical, consider creating a cluster that runs across multiple data centers,
136+
referred to as zones in cloud environments. Groups of zones are referred to as regions.
137+
By spreading a cluster across
138+
multiple zones in the same region, it can improve the chances that your
139+
cluster will continue to function even if one zone becomes unavailable.
140+
See [Running in multiple zones](/docs/setup/best-practices/multiple-zones/) for details.
141+
- *Manage on-going features*: If you plan to keep your cluster over time,
142+
there are tasks you need to do to maintain its health and security. For example,
143+
if you installed with kubeadm, there are instructions to help you with
144+
[Certificate Management](/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/)
145+
and [Upgrading kubeadm clusters](/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/).
146+
See [Administer a Cluster](/docs/tasks/administer-cluster/)
147+
for a longer list of Kubernetes administrative tasks.
148+
149+
To learn about available options when you run control plane services, see
150+
[kube-apiserver](/docs/reference/command-line-tools-reference/kube-apiserver/),
151+
[kube-controller-manager](/docs/reference/command-line-tools-reference/kube-controller-manager/),
152+
and [kube-scheduler](/docs/reference/command-line-tools-reference/kube-scheduler/)
153+
component pages. For highly available control plane examples, see
154+
[Options for Highly Available topology](/docs/setup/production-environment/tools/kubeadm/ha-topology/),
155+
[Creating Highly Available clusters with kubeadm](/docs/setup/production-environment/tools/kubeadm/high-availability/),
156+
and [Operating etcd clusters for Kubernetes](/docs/tasks/administer-cluster/configure-upgrade-etcd/).
157+
See [Backing up an etcd cluster](/docs/tasks/administer-cluster/configure-upgrade-etcd/#backing-up-an-etcd-cluster)
158+
for information on making an etcd backup plan.
159+
160+
### Production worker nodes
161+
162+
Production-quality workloads need to be resilient and anything they rely
163+
on needs to be resilient (such as CoreDNS). Whether you manage your own
164+
control plane or have a cloud provider do it for you, you still need to
165+
consider how you want to manage your worker nodes (also referred to
166+
simply as *nodes*).
167+
168+
- *Configure nodes*: Nodes can be physical or virtual machines. If you want to
169+
create and manage your own nodes, you can install a supported operating system,
170+
then add and run the appropriate
171+
[Node services](/docs/concepts/overview/components/#node-components). Consider:
172+
- The demands of your workloads when you set up nodes by having appropriate memory, CPU, and disk speed and storage capacity available.
173+
- Whether generic computer systems will do or you have workloads that need GPU processors, Windows nodes, or VM isolation.
174+
- *Validate nodes*: See [Valid node setup](/docs/setup/best-practices/node-conformance/)
175+
for information on how to ensure that a node meets the requirements to join
176+
a Kubernetes cluster.
177+
- *Add nodes to the cluster*: If you are managing your own cluster you can
178+
add nodes by setting up your own machines and either adding them manually or
179+
having them register themselves to the cluster’s apiserver. See the
180+
[Nodes](/docs/concepts/architecture/nodes/) section for information on how to set up Kubernetes to add nodes in these ways.
181+
- *Add Windows nodes to the cluster*: Kubernetes offers support for Windows
182+
worker nodes, allowing you to run workloads implemented in Windows containers. See
183+
[Windows in Kubernetes](/docs/setup/production-environment/windows/) for details.
184+
- *Scale nodes*: Have a plan for expanding the capacity your cluster will
185+
eventually need. See [Considerations for large clusters](/docs/setup/best-practices/cluster-large/)
186+
to help determine how many nodes you need, based on the number of pods and
187+
containers you need to run. If you are managing nodes yourself, this can mean
188+
purchasing and installing your own physical equipment.
189+
- *Autoscale nodes*: Most cloud providers support
190+
[Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#readme)
191+
to replace unhealthy nodes or grow and shrink the number of nodes as demand requires. See the
192+
[Frequently Asked Questions](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
193+
for how the autoscaler works and
194+
[Deployment](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#deployment)
195+
for how it is implemented by different cloud providers. For on-premises, there
196+
are some virtualization platforms that can be scripted to spin up new nodes
197+
based on demand.
198+
- *Set up node health checks*: For important workloads, you want to make sure
199+
that the nodes and pods running on those nodes are healthy. Using the
200+
[Node Problem Detector](/docs/tasks/debug-application-cluster/monitor-node-health/)
201+
daemon, you can ensure your nodes are healthy.
202+
203+
## Production user management
204+
205+
In production, you may be moving from a model where you or a small group of
206+
people are accessing the cluster to where there may potentially be dozens or
207+
hundreds of people. In a learning environment or platform prototype, you might have a single
208+
administrative account for everything you do. In production, you will want
209+
more accounts with different levels of access to different namespaces.
210+
211+
Taking on a production-quality cluster means deciding how you
212+
want to selectively allow access by other users. In particular, you need to
213+
select strategies for validating the identities of those who try to access your
214+
cluster (authentication) and deciding if they have permissions to do what they
215+
are asking (authorization):
216+
217+
- *Authentication*: The apiserver can authenticate users using client
218+
certificates, bearer tokens, an authenticating proxy, or HTTP basic auth.
219+
You can choose which authentication methods you want to use.
220+
Using plugins, the apiserver can leverage your organization’s existing
221+
authentication methods, such as LDAP or Kerberos. See
222+
[Authentication](/docs/reference/access-authn-authz/authentication/)
223+
for a description of these different methods of authenticating Kubernetes users.
224+
- *Authorization*: When you set out to authorize your regular users, you will probably choose between RBAC and ABAC authorization. See [Authorization Overview](/docs/reference/access-authn-authz/authorization/) to review different modes for authorizing user accounts (as well as service account access to your cluster):
225+
- *Role-based access control* ([RBAC](/docs/reference/access-authn-authz/rbac/)): Lets you assign access to your cluster by allowing specific sets of permissions to authenticated users. Permissions can be assigned for a specific namespace (Role) or across the entire cluster (ClusterRole). Then using RoleBindings and ClusterRoleBindings, those permissions can be attached to particular users.
226+
- *Attribute-based access control* ([ABAC](/docs/reference/access-authn-authz/abac/)): Lets you create policies based on resource attributes in the cluster and will allow or deny access based on those attributes. Each line of a policy file identifies versioning properties (apiVersion and kind) and a map of spec properties to match the subject (user or group), resource property, non-resource property (/version or /apis), and readonly. See [Examples](/docs/reference/access-authn-authz/abac/#examples) for details.
227+
228+
As someone setting up authentication and authorization on your production Kubernetes cluster, here are some things to consider:
229+
230+
- *Set the authorization mode*: When the Kubernetes API server
231+
([kube-apiserver](/docs/reference/command-line-tools-reference/kube-apiserver/))
232+
starts, the supported authentication modes must be set using the *--authorization-mode*
233+
flag. For example, that flag in the *kube-adminserver.yaml* file (in */etc/kubernetes/manifests*)
234+
could be set to Node,RBAC. This would allow Node and RBAC authorization for authenticated requests.
235+
- *Create user certificates and role bindings (RBAC)*: If you are using RBAC
236+
authorization, users can create a CertificateSigningRequest (CSR) that can be
237+
signed by the cluster CA. Then you can bind Roles and ClusterRoles to each user.
238+
See [Certificate Signing Requests](/docs/reference/access-authn-authz/certificate-signing-requests/)
239+
for details.
240+
- *Create policies that combine attributes (ABAC)*: If you are using ABAC
241+
authorization, you can assign combinations of attributes to form policies to
242+
authorize selected users or groups to access particular resources (such as a
243+
pod), namespace, or apiGroup. For more information, see
244+
[Examples](/docs/reference/access-authn-authz/abac/#examples).
245+
- *Consider Admission Controllers*: Additional forms of authorization for
246+
requests that can come in through the API server include
247+
[Webhook Token Authentication](/docs/reference/access-authn-authz/authentication/#webhook-token-authentication).
248+
Webhooks and other special authorization types need to be enabled by adding
249+
[Admission Controllers](/docs/reference/access-authn-authz/admission-controllers/)
250+
to the API server.
251+
252+
## Set limits on workload resources
253+
254+
Demands from production workloads can cause pressure both inside and outside
255+
of the Kubernetes control plane. Consider these items when setting up for the
256+
needs of your cluster's workloads:
257+
258+
- *Set namespace limits*: Set per-namespace quotas on things like memory and CPU. See
259+
[Manage Memory, CPU, and API Resources](/docs/tasks/administer-cluster/manage-resources/)
260+
for details. You can also set
261+
[Hierarchical Namespaces](/blog/2020/08/14/introducing-hierarchical-namespaces/)
262+
for inheriting limits.
263+
- *Prepare for DNS demand*: If you expect workloads to massively scale up,
264+
your DNS service must be ready to scale up as well. See
265+
[Autoscale the DNS service in a Cluster](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/).
266+
- *Create additional service accounts*: User accounts determine what users can
267+
do on a cluster, while a service account defines pod access within a particular
268+
namespace. By default, a pod takes on the default service account from its namespace.
269+
See [Managing Service Accounts](/docs/reference/access-authn-authz/service-accounts-admin/)
270+
for information on creating a new service account. For example, you might want to:
271+
- Add secrets that a pod could use to pull images from a particular container registry. See [Configure Service Accounts for Pods](/docs/tasks/configure-pod-container/configure-service-account/) for an example.
272+
- Assign RBAC permissions to a service account. See [ServiceAccount permissions](/docs/reference/access-authn-authz/rbac/#service-account-permissions) for details.
273+
274+
## What's next {#what-s-next}
275+
276+
- Decide if you want to build your own production Kubernetes or obtain one from
277+
available [Turnkey Cloud Solutions](/docs/setup/production-environment/turnkey-solutions/)
278+
or [Kubernetes Partners](https://kubernetes.io/partners/).
279+
- If you choose to build your own cluster, plan how you want to
280+
handle [certificates](/docs/setup/best-practices/certificates/)
281+
and set up high availability for features such as
282+
[etcd](/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/)
283+
and the
284+
[API server](/docs/setup/production-environment/tools/kubeadm/ha-topology/).
285+
- Choose from [kubeadm](/docs/setup/production-environment/tools/kubeadm/), [kops](/docs/setup/production-environment/tools/kops/) or [Kubespray](/docs/setup/production-environment/tools/kubespray/)
286+
deployment methods.
287+
- Configure user management by determining your
288+
[Authentication](/docs/reference/access-authn-authz/authentication/) and
289+
[Authorization](docs/reference/access-authn-authz/authorization/) methods.
290+
- Prepare for application workloads by setting up
291+
[resource limits](docs/tasks/administer-cluster/manage-resources/),
292+
[DNS autoscaling](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/)
293+
and [service accounts](/docs/reference/access-authn-authz/service-accounts-admin/).

0 commit comments

Comments
 (0)