Skip to content
This repository was archived by the owner on Oct 16, 2024. It is now read-only.

Commit ee8dec9

Browse files
authored
Merge pull request #207 from Yelp/mpiano/node_migration_docs
2 parents c231584 + ee5b960 commit ee8dec9

File tree

4 files changed

+170
-1
lines changed

4 files changed

+170
-1
lines changed

docs/source/configuration.rst

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,14 @@ The following is an example configuration file for the core Clusterman service a
6767
# How frequently the batch should run to collect metrics.
6868
run_interval_seconds: 60
6969
70+
node_migration:
71+
# Maximum number of worker prcesses the batch can spawn
72+
# (every worker can handle a single migration for a pool)
73+
max_worker_processes: 6
74+
75+
# How frequently the batch should check for migration triggers.
76+
run_interval_seconds: 60
77+
7078
clusters:
7179
cluster-name:
7280
aws_region: us-west-2
@@ -153,6 +161,18 @@ The following is an example configuration file for a particular Clusterman pool:
153161
- paramA: 'typeA'
154162
- paramB: 10
155163
164+
node_migration:
165+
trigger:
166+
max_uptime: 90d
167+
event: true
168+
strategy:
169+
rate: 5
170+
prescaling: '2%'
171+
precedence: highest_uptime
172+
bootstrap_wait: 5m
173+
bootstrap_timeout: 15m
174+
disable_autoscaling: false
175+
expected_duration: 2h
156176
157177
The ``resource-groups`` section provides information for loading resource groups in the pool manager.
158178

@@ -167,6 +187,11 @@ not present, then the ``autoscale_signal`` from the service configuration will b
167187
For required metrics, there can be any number of sections, each defining one desired metric. The metric type must be
168188
one of :ref:`metric_types`.
169189

190+
The ``node_migration`` section contains settings controlling how Clusterman should be recycling nodes
191+
inside the pool. Enabling this configuration is useful for keeping the average uptime of your pool low and/or
192+
be able to perform adhoc migrations of the nodes according to some conditional parameter.
193+
See :ref:`node_migration_configuration` for all details.
194+
170195
Reloading
171196
---------
172197
The Clusterman batches will automatically reload on changes to the clusterman service config file and the AWS

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ and simulate how changes to autoscaling logic will impact the cost and performan
3434
manage
3535
simulator
3636
tools
37+
node_migration
3738

3839

3940
.. toctree::

docs/source/node_migration.rst

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
Node Migration
2+
==============
3+
4+
*Node Migration* is a functionality which allows Clusterman to recycle nodes of a pool
5+
according to various criteria, in order to reduce the amount of manual work necessary
6+
when performing infrastructure migrations.
7+
8+
**NOTE**: this is only compatible with Kubernetes clusters.
9+
10+
11+
Node Migration Batch
12+
--------------------
13+
14+
The *Node Migration batch* is the entrypoint of the migration logic. It takes care of fetching migration trigger
15+
events, spawning the worker processes actually performing the node recycling procedures, and monitoring their health.
16+
17+
Batch specific configuration values are described as part of the main service configuration in :ref:`service_configuration`.
18+
19+
The batch code can be invoked from the ``clusterman.batch.node_migration`` Python module.
20+
21+
22+
.. _node_migration_configuration:
23+
24+
Pool Configuration
25+
------------------
26+
27+
The behaviour of the migration logic for a pool is controlled by the ``node_migration`` section of the pool configuration.
28+
The allowed values for the migration settings are as follows:
29+
30+
* ``trigger``:
31+
32+
* ``max_uptime``: if set, monitor nodes' uptime to ensure it stays lower than the provided value; human readable time string (e.g. 30d).
33+
* ``event``: if set to ``true``, accept async migration trigger for this pool; details about event triggers are described below in :ref:`node_migration_trigger`.
34+
35+
* ``strategy``:
36+
37+
* ``rate``: rate at which nodes are selected for termination; percentage or absolute value (required).
38+
* ``prescaling``: if set, pool size is increased by this amount before performing node recycling; percentage or absolute value (0 by default).
39+
* ``precedence``: precedence with which nodes are selected for termination; ``highest_uptime`` or ``lowest_task_count`` (uptime by default).
40+
* ``bootstrap_wait``: indicative time necessary for a node to be ready to run workloads after boot; human readable time string (3 minutes by default).
41+
* ``bootstrap_timeout``: maximum wait for nodes to be ready after boot; human readable time string (10 minutes by default).
42+
43+
* ``disable_autoscaling``: turn off autoscaler while recycling instances (false by default).
44+
45+
* ``expected_duration``: estimated duration for migration of the whole pool; human readable time string (1 day by default).
46+
47+
See :ref:`pool_configuration` for how an example configuration block would look like.
48+
49+
50+
.. _node_migration_trigger:
51+
52+
Migration Event Trigger
53+
-----------------------
54+
55+
Migration trigger events are submitted as Kubernetes custom resources of type ``nodemigration``.
56+
They can be easily generated and submitted by using the ``clusterman migrate`` CLI sub-command and it related options.
57+
The manifest for the custom resource defintion is as follows:
58+
59+
60+
.. code-block:: yaml
61+
62+
---
63+
apiVersion: apiextensions.k8s.io/v1
64+
kind: CustomResourceDefinition
65+
metadata:
66+
name: nodemigrations.clusterman.yelp.com
67+
spec:
68+
scope: Cluster
69+
group: clusterman.yelp.com
70+
names:
71+
plural: nodemigrations
72+
singular: nodemigration
73+
kind: NodeMigration
74+
versions:
75+
- name: v1
76+
served: true
77+
storage: true
78+
schema:
79+
openAPIV3Schema:
80+
type: object
81+
required:
82+
- spec
83+
properties:
84+
spec:
85+
type: object
86+
required:
87+
- cluster
88+
- pool
89+
- condition
90+
properties:
91+
cluster:
92+
type: string
93+
pool:
94+
type: string
95+
label_selectors:
96+
type: array
97+
items:
98+
type: string
99+
condition:
100+
type: object
101+
properties:
102+
trait:
103+
type: string
104+
enum: [kernel, lsbrelease, instance_type, uptime]
105+
target:
106+
type: string
107+
operator:
108+
type: string
109+
enum: [gt, ge, eq, ne, lt, le, in, notin]
110+
111+
112+
In more readable terms, an example resource manifest would look like:
113+
114+
.. code-block:: yaml
115+
116+
---
117+
apiVersion: "clusterman.yelp.com/v1"
118+
kind: NodeMigration
119+
metadata:
120+
name: my-test-migration-220912
121+
labels:
122+
clusterman.yelp.com/migration_status: pending
123+
spec:
124+
cluster: kubestage
125+
pool: default
126+
condition:
127+
trait: uptime
128+
operator: lt
129+
target: 90d
130+
131+
132+
The fields in each migration event allow to control which nodes are affected by the event
133+
and what is the desired final condition for them. More specifically:
134+
135+
* ``cluster``: name of the cluster to be targeted.
136+
* ``pool``: name of the pool to be targeted.
137+
* ``label_selectors``: list of additional Kubernetes label selectors to filter affected nodes.
138+
* ``condition``: the desired final state for the node, i.e. all nodes must be have kernel version higher than X.
139+
140+
* ``trait``: metadata to be compared; currently supports ``kernel``, ``lsbrelease``, ``instance_type``, or ``uptime``.
141+
* ``operator``: comparison operator; supports ``gt``, ``ge``, ``eq``, ``ne``, ``lt``, ``le``, ``in``, ``notin``.
142+
* ``target``: right side of the comparison expression, e.g. a kernel version or an instance type;
143+
may be a single string or a comma separated list when using ``in`` / ``notin`` operators.

tox.ini

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ commands =
4848
[testenv:docs]
4949
envdir = .tox/docs
5050
deps =
51-
-rrequirements-doc.txt
51+
-rrequirements-docs.txt
5252
changedir = docs
5353
commands =
5454
sphinx-build -b html -d build/doctrees source build/html

0 commit comments

Comments
 (0)