|
1 | 1 | ---
|
2 |
| -title: Production environment |
| 2 | +title: "Production environment" |
| 3 | +description: Create a production-quality Kubernetes cluster |
3 | 4 | weight: 30
|
| 5 | +no_list: true |
4 | 6 | ---
|
| 7 | +<!-- overview --> |
| 8 | + |
| 9 | +A production-quality Kubernetes cluster requires planning and preparation. |
| 10 | +If your Kubernetes cluster is to run critical workloads, it must be configured to be resilient. |
| 11 | +This page explains steps you can take to set up a production-ready cluster, |
| 12 | +or to uprate an existing cluster for production use. |
| 13 | +If you're already familiar with production setup and want the links, skip to |
| 14 | +[What's next](#what-s-next). |
| 15 | + |
| 16 | +<!-- body --> |
| 17 | + |
| 18 | +## Production considerations |
| 19 | + |
| 20 | +Typically, a production Kubernetes cluster environment has more requirements than a |
| 21 | +personal learning, development, or test environment Kubernetes. A production environment may require |
| 22 | +secure access by many users, consistent availability, and the resources to adapt |
| 23 | +to changing demands. |
| 24 | + |
| 25 | +As you decide where you want your production Kubernetes environment to live |
| 26 | +(on premises or in a cloud) and the amount of management you want to take |
| 27 | +on or hand to others, consider how your requirements for a Kubernetes cluster |
| 28 | +are influenced by the following issues: |
| 29 | + |
| 30 | +- *Availability*: A single-machine Kubernetes [learning environment](/docs/setup/#learning-environment) |
| 31 | +has a single point of failure. Creating a highly available cluster means considering: |
| 32 | + - Separating the control plane from the worker nodes. |
| 33 | + - Replicating the control plane components on multiple nodes. |
| 34 | + - Load balancing traffic to the cluster’s {{< glossary_tooltip term_id="kube-apiserver" text="API server" >}}. |
| 35 | + - Having enough worker nodes available, or able to quickly become available, as changing workloads warrant it. |
| 36 | + |
| 37 | +- *Scale*: If you expect your production Kubernetes environment to receive a stable amount of |
| 38 | +demand, you might be able to set up for the capacity you need and be done. However, |
| 39 | +if you expect demand to grow over time or change dramatically based on things like |
| 40 | +season or special events, you need to plan how to scale to relieve increased |
| 41 | +pressure from more requests to the control plane and worker nodes or scale down to reduce unused |
| 42 | +resources. |
| 43 | + |
| 44 | +- *Security and access management*: You have full admin privileges on your own |
| 45 | +Kubernetes learning cluster. But shared clusters with important workloads, and |
| 46 | +more than one or two users, require a more refined approach to who and what can |
| 47 | +access cluster resources. You can use role-based access control |
| 48 | +([RBAC](/docs/reference/access-authn-authz/rbac/)) and other |
| 49 | +security mechanisms to make sure that users and workloads can get access to the |
| 50 | +resources they need, while keeping workloads, and the cluster itself, secure. |
| 51 | +You can set limits on the resources that users and workloads can access |
| 52 | +by managing [policies](https://kubernetes.io/docs/concepts/policy/) and |
| 53 | +[container resources](/docs/concepts/configuration/manage-resources-containers/). |
| 54 | + |
| 55 | +Before building a Kubernetes production environment on your own, consider |
| 56 | +handing off some or all of this job to |
| 57 | +[Turnkey Cloud Solutions](/docs/setup/production-environment/turnkey-solutions/) |
| 58 | +providers or other [Kubernetes Partners](https://kubernetes.io/partners/). |
| 59 | +Options include: |
| 60 | + |
| 61 | +- *Serverless*: Just run workloads on third-party equipment without managing |
| 62 | +a cluster at all. You will be charged for things like CPU usage, memory, and |
| 63 | +disk requests. |
| 64 | +- *Managed control plane*: Let the provider manage the scale and availability |
| 65 | +of the cluster's control plane, as well as handle patches and upgrades. |
| 66 | +- *Managed worker nodes*: Configure pools of nodes to meet your needs, |
| 67 | +then the provider makes sure those nodes are available and ready to implement |
| 68 | +upgrades when needed. |
| 69 | +- *Integration*: There are providers that integrate Kubernetes with other |
| 70 | +services you may need, such as storage, container registries, authentication |
| 71 | +methods, and development tools. |
| 72 | + |
| 73 | +Whether you build a production Kubernetes cluster yourself or work with |
| 74 | +partners, review the following sections to evaluate your needs as they relate |
| 75 | +to your cluster’s *control plane*, *worker nodes*, *user access*, and |
| 76 | +*workload resources*. |
| 77 | + |
| 78 | +## Production cluster setup |
| 79 | + |
| 80 | +In a production-quality Kubernetes cluster, the control plane manages the |
| 81 | +cluster from services that can be spread across multiple computers |
| 82 | +in different ways. Each worker node, however, represents a single entity that |
| 83 | +is configured to run Kubernetes pods. |
| 84 | + |
| 85 | +### Production control plane |
| 86 | + |
| 87 | +The simplest Kubernetes cluster has the entire control plane and worker node |
| 88 | +services running on the same machine. You can grow that environment by adding |
| 89 | +worker nodes, as reflected in the diagram illustrated in |
| 90 | +[Kubernetes Components](/docs/concepts/overview/components/). |
| 91 | +If the cluster is meant to be available for a short period of time, or can be |
| 92 | +discarded if something goes seriously wrong, this might meet your needs. |
| 93 | + |
| 94 | +If you need a more permanent, highly available cluster, however, you should |
| 95 | +consider ways of extending the control plane. By design, one-machine control |
| 96 | +plane services running on a single machine are not highly available. |
| 97 | +If keeping the cluster up and running |
| 98 | +and ensuring that it can be repaired if something goes wrong is important, |
| 99 | +consider these steps: |
| 100 | + |
| 101 | +- *Choose deployment tools*: You can deploy a control plane using tools such |
| 102 | +as kubeadm, kops, and kubespray. See |
| 103 | +[Installing Kubernetes with deployment tools](/docs/setup/production-environment/tools/) |
| 104 | +to learn tips for production-quality deployments using each of those deployment |
| 105 | +methods. Different [Container Runtimes](/docs/setup/production-environment/container-runtimes/) |
| 106 | +are available to use with your deployments. |
| 107 | +- *Manage certificates*: Secure communications between control plane services |
| 108 | +are implemented using certificates. Certificates are automatically generated |
| 109 | +during deployment or you can generate them using your own certificate authority. |
| 110 | +See [PKI certificates and requirements](/docs/setup/best-practices/certificates/) for details. |
| 111 | +- *Configure load balancer for apiserver*: Configure a load balancer |
| 112 | +to distribute external API requests to the apiserver service instances running on different nodes. See |
| 113 | +[Create an External Load Balancer](/docs/tasks/access-application-cluster/create-external-load-balancer/) |
| 114 | +for details. |
| 115 | +- *Separate and backup etcd service*: The etcd services can either run on the |
| 116 | +same machines as other control plane services or run on separate machines, for |
| 117 | +extra security and availability. Because etcd stores cluster configuration data, |
| 118 | +backing up the etcd database should be done regularly to ensure that you can |
| 119 | +repair that database if needed. |
| 120 | +See the [etcd FAQ](https://etcd.io/docs/v3.4/faq/) for details on configuring and using etcd. |
| 121 | +See [Operating etcd clusters for Kubernetes](/docs/tasks/administer-cluster/configure-upgrade-etcd/) |
| 122 | +and [Set up a High Availability etcd cluster with kubeadm](/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/) |
| 123 | +for details. |
| 124 | +- *Create multiple control plane systems*: For high availability, the |
| 125 | +control plane should not be limited to a single machine. If the control plane |
| 126 | +services are run by an init service (such as systemd), each service should run on at |
| 127 | +least three machines. However, running control plane services as pods in |
| 128 | +Kubernetes ensures that the replicated number of services that you request |
| 129 | +will always be available. |
| 130 | +The scheduler should be fault tolerant, |
| 131 | +but not highly available. Some deployment tools set up [Raft](https://raft.github.io/) |
| 132 | +consensus algorithm to do leader election of Kubernetes services. If the |
| 133 | +primary goes away, another service elects itself and take over. |
| 134 | +- *Span multiple zones*: If keeping your cluster available at all times is |
| 135 | +critical, consider creating a cluster that runs across multiple data centers, |
| 136 | +referred to as zones in cloud environments. Groups of zones are referred to as regions. |
| 137 | +By spreading a cluster across |
| 138 | +multiple zones in the same region, it can improve the chances that your |
| 139 | +cluster will continue to function even if one zone becomes unavailable. |
| 140 | +See [Running in multiple zones](/docs/setup/best-practices/multiple-zones/) for details. |
| 141 | +- *Manage on-going features*: If you plan to keep your cluster over time, |
| 142 | +there are tasks you need to do to maintain its health and security. For example, |
| 143 | +if you installed with kubeadm, there are instructions to help you with |
| 144 | +[Certificate Management](/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/) |
| 145 | +and [Upgrading kubeadm clusters](/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/). |
| 146 | +See [Administer a Cluster](/docs/tasks/administer-cluster/) |
| 147 | +for a longer list of Kubernetes administrative tasks. |
| 148 | + |
| 149 | +To learn about available options when you run control plane services, see |
| 150 | +[kube-apiserver](/docs/reference/command-line-tools-reference/kube-apiserver/), |
| 151 | +[kube-controller-manager](/docs/reference/command-line-tools-reference/kube-controller-manager/), |
| 152 | +and [kube-scheduler](/docs/reference/command-line-tools-reference/kube-scheduler/) |
| 153 | +component pages. For highly available control plane examples, see |
| 154 | +[Options for Highly Available topology](/docs/setup/production-environment/tools/kubeadm/ha-topology/), |
| 155 | +[Creating Highly Available clusters with kubeadm](/docs/setup/production-environment/tools/kubeadm/high-availability/), |
| 156 | +and [Operating etcd clusters for Kubernetes](/docs/tasks/administer-cluster/configure-upgrade-etcd/). |
| 157 | +See [Backing up an etcd cluster](/docs/tasks/administer-cluster/configure-upgrade-etcd/#backing-up-an-etcd-cluster) |
| 158 | +for information on making an etcd backup plan. |
| 159 | + |
| 160 | +### Production worker nodes |
| 161 | + |
| 162 | +Production-quality workloads need to be resilient and anything they rely |
| 163 | +on needs to be resilient (such as CoreDNS). Whether you manage your own |
| 164 | +control plane or have a cloud provider do it for you, you still need to |
| 165 | +consider how you want to manage your worker nodes (also referred to |
| 166 | +simply as *nodes*). |
| 167 | + |
| 168 | +- *Configure nodes*: Nodes can be physical or virtual machines. If you want to |
| 169 | +create and manage your own nodes, you can install a supported operating system, |
| 170 | +then add and run the appropriate |
| 171 | +[Node services](/docs/concepts/overview/components/#node-components). Consider: |
| 172 | + - The demands of your workloads when you set up nodes by having appropriate memory, CPU, and disk speed and storage capacity available. |
| 173 | + - Whether generic computer systems will do or you have workloads that need GPU processors, Windows nodes, or VM isolation. |
| 174 | +- *Validate nodes*: See [Valid node setup](/docs/setup/best-practices/node-conformance/) |
| 175 | +for information on how to ensure that a node meets the requirements to join |
| 176 | +a Kubernetes cluster. |
| 177 | +- *Add nodes to the cluster*: If you are managing your own cluster you can |
| 178 | +add nodes by setting up your own machines and either adding them manually or |
| 179 | +having them register themselves to the cluster’s apiserver. See the |
| 180 | +[Nodes](/docs/concepts/architecture/nodes/) section for information on how to set up Kubernetes to add nodes in these ways. |
| 181 | +- *Add Windows nodes to the cluster*: Kubernetes offers support for Windows |
| 182 | +worker nodes, allowing you to run workloads implemented in Windows containers. See |
| 183 | +[Windows in Kubernetes](/docs/setup/production-environment/windows/) for details. |
| 184 | +- *Scale nodes*: Have a plan for expanding the capacity your cluster will |
| 185 | +eventually need. See [Considerations for large clusters](/docs/setup/best-practices/cluster-large/) |
| 186 | +to help determine how many nodes you need, based on the number of pods and |
| 187 | +containers you need to run. If you are managing nodes yourself, this can mean |
| 188 | +purchasing and installing your own physical equipment. |
| 189 | +- *Autoscale nodes*: Most cloud providers support |
| 190 | +[Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#readme) |
| 191 | +to replace unhealthy nodes or grow and shrink the number of nodes as demand requires. See the |
| 192 | +[Frequently Asked Questions](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md) |
| 193 | +for how the autoscaler works and |
| 194 | +[Deployment](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#deployment) |
| 195 | +for how it is implemented by different cloud providers. For on-premises, there |
| 196 | +are some virtualization platforms that can be scripted to spin up new nodes |
| 197 | +based on demand. |
| 198 | +- *Set up node health checks*: For important workloads, you want to make sure |
| 199 | +that the nodes and pods running on those nodes are healthy. Using the |
| 200 | +[Node Problem Detector](/docs/tasks/debug-application-cluster/monitor-node-health/) |
| 201 | +daemon, you can ensure your nodes are healthy. |
| 202 | + |
| 203 | +## Production user management |
| 204 | + |
| 205 | +In production, you may be moving from a model where you or a small group of |
| 206 | +people are accessing the cluster to where there may potentially be dozens or |
| 207 | +hundreds of people. In a learning environment or platform prototype, you might have a single |
| 208 | +administrative account for everything you do. In production, you will want |
| 209 | +more accounts with different levels of access to different namespaces. |
| 210 | + |
| 211 | +Taking on a production-quality cluster means deciding how you |
| 212 | +want to selectively allow access by other users. In particular, you need to |
| 213 | +select strategies for validating the identities of those who try to access your |
| 214 | +cluster (authentication) and deciding if they have permissions to do what they |
| 215 | +are asking (authorization): |
| 216 | + |
| 217 | +- *Authentication*: The apiserver can authenticate users using client |
| 218 | +certificates, bearer tokens, an authenticating proxy, or HTTP basic auth. |
| 219 | +You can choose which authentication methods you want to use. |
| 220 | +Using plugins, the apiserver can leverage your organization’s existing |
| 221 | +authentication methods, such as LDAP or Kerberos. See |
| 222 | +[Authentication](/docs/reference/access-authn-authz/authentication/) |
| 223 | +for a description of these different methods of authenticating Kubernetes users. |
| 224 | +- *Authorization*: When you set out to authorize your regular users, you will probably choose between RBAC and ABAC authorization. See [Authorization Overview](/docs/reference/access-authn-authz/authorization/) to review different modes for authorizing user accounts (as well as service account access to your cluster): |
| 225 | + - *Role-based access control* ([RBAC](/docs/reference/access-authn-authz/rbac/)): Lets you assign access to your cluster by allowing specific sets of permissions to authenticated users. Permissions can be assigned for a specific namespace (Role) or across the entire cluster (ClusterRole). Then using RoleBindings and ClusterRoleBindings, those permissions can be attached to particular users. |
| 226 | + - *Attribute-based access control* ([ABAC](/docs/reference/access-authn-authz/abac/)): Lets you create policies based on resource attributes in the cluster and will allow or deny access based on those attributes. Each line of a policy file identifies versioning properties (apiVersion and kind) and a map of spec properties to match the subject (user or group), resource property, non-resource property (/version or /apis), and readonly. See [Examples](/docs/reference/access-authn-authz/abac/#examples) for details. |
| 227 | + |
| 228 | +As someone setting up authentication and authorization on your production Kubernetes cluster, here are some things to consider: |
| 229 | + |
| 230 | +- *Set the authorization mode*: When the Kubernetes API server |
| 231 | +([kube-apiserver](/docs/reference/command-line-tools-reference/kube-apiserver/)) |
| 232 | +starts, the supported authentication modes must be set using the *--authorization-mode* |
| 233 | +flag. For example, that flag in the *kube-adminserver.yaml* file (in */etc/kubernetes/manifests*) |
| 234 | +could be set to Node,RBAC. This would allow Node and RBAC authorization for authenticated requests. |
| 235 | +- *Create user certificates and role bindings (RBAC)*: If you are using RBAC |
| 236 | +authorization, users can create a CertificateSigningRequest (CSR) that can be |
| 237 | +signed by the cluster CA. Then you can bind Roles and ClusterRoles to each user. |
| 238 | +See [Certificate Signing Requests](/docs/reference/access-authn-authz/certificate-signing-requests/) |
| 239 | +for details. |
| 240 | +- *Create policies that combine attributes (ABAC)*: If you are using ABAC |
| 241 | +authorization, you can assign combinations of attributes to form policies to |
| 242 | +authorize selected users or groups to access particular resources (such as a |
| 243 | +pod), namespace, or apiGroup. For more information, see |
| 244 | +[Examples](/docs/reference/access-authn-authz/abac/#examples). |
| 245 | +- *Consider Admission Controllers*: Additional forms of authorization for |
| 246 | +requests that can come in through the API server include |
| 247 | +[Webhook Token Authentication](/docs/reference/access-authn-authz/authentication/#webhook-token-authentication). |
| 248 | +Webhooks and other special authorization types need to be enabled by adding |
| 249 | +[Admission Controllers](/docs/reference/access-authn-authz/admission-controllers/) |
| 250 | +to the API server. |
| 251 | + |
| 252 | +## Set limits on workload resources |
| 253 | + |
| 254 | +Demands from production workloads can cause pressure both inside and outside |
| 255 | +of the Kubernetes control plane. Consider these items when setting up for the |
| 256 | +needs of your cluster's workloads: |
| 257 | + |
| 258 | +- *Set namespace limits*: Set per-namespace quotas on things like memory and CPU. See |
| 259 | +[Manage Memory, CPU, and API Resources](/docs/tasks/administer-cluster/manage-resources/) |
| 260 | +for details. You can also set |
| 261 | +[Hierarchical Namespaces](/blog/2020/08/14/introducing-hierarchical-namespaces/) |
| 262 | +for inheriting limits. |
| 263 | +- *Prepare for DNS demand*: If you expect workloads to massively scale up, |
| 264 | +your DNS service must be ready to scale up as well. See |
| 265 | +[Autoscale the DNS service in a Cluster](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/). |
| 266 | +- *Create additional service accounts*: User accounts determine what users can |
| 267 | +do on a cluster, while a service account defines pod access within a particular |
| 268 | +namespace. By default, a pod takes on the default service account from its namespace. |
| 269 | +See [Managing Service Accounts](/docs/reference/access-authn-authz/service-accounts-admin/) |
| 270 | +for information on creating a new service account. For example, you might want to: |
| 271 | + - Add secrets that a pod could use to pull images from a particular container registry. See [Configure Service Accounts for Pods](/docs/tasks/configure-pod-container/configure-service-account/) for an example. |
| 272 | + - Assign RBAC permissions to a service account. See [ServiceAccount permissions](/docs/reference/access-authn-authz/rbac/#service-account-permissions) for details. |
| 273 | + |
| 274 | +## What's next {#what-s-next} |
| 275 | + |
| 276 | +- Decide if you want to build your own production Kubernetes or obtain one from |
| 277 | +available [Turnkey Cloud Solutions](/docs/setup/production-environment/turnkey-solutions/) |
| 278 | +or [Kubernetes Partners](https://kubernetes.io/partners/). |
| 279 | +- If you choose to build your own cluster, plan how you want to |
| 280 | +handle [certificates](/docs/setup/best-practices/certificates/) |
| 281 | +and set up high availability for features such as |
| 282 | +[etcd](/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/) |
| 283 | +and the |
| 284 | +[API server](/docs/setup/production-environment/tools/kubeadm/ha-topology/). |
| 285 | +- Choose from [kubeadm](/docs/setup/production-environment/tools/kubeadm/), [kops](/docs/setup/production-environment/tools/kops/) or [Kubespray](/docs/setup/production-environment/tools/kubespray/) |
| 286 | +deployment methods. |
| 287 | +- Configure user management by determining your |
| 288 | +[Authentication](/docs/reference/access-authn-authz/authentication/) and |
| 289 | +[Authorization](docs/reference/access-authn-authz/authorization/) methods. |
| 290 | +- Prepare for application workloads by setting up |
| 291 | +[resource limits](docs/tasks/administer-cluster/manage-resources/), |
| 292 | +[DNS autoscaling](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/) |
| 293 | +and [service accounts](/docs/reference/access-authn-authz/service-accounts-admin/). |
0 commit comments