Skip to content

Commit f45a39b

Browse files
authored
Merge pull request ceph#53988 from ljflores/wip-read-balancer-mgr-module
2 parents 2218c35 + cb10c0d commit f45a39b

File tree

12 files changed

+403
-28
lines changed

12 files changed

+403
-28
lines changed

PendingReleaseNotes

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,10 @@ CephFS: Disallow delegating preallocated inode ranges to clients. Config
113113
* RBD: The ``try-netlink`` mapping option for rbd-nbd has become the default
114114
and is now deprecated. If the NBD netlink interface is not supported by the
115115
kernel, then the mapping is retried using the legacy ioctl interface.
116+
* RADOS: Read balancing may now be managed automatically via the balancer
117+
manager module. Users may choose between two new modes: ``upmap-read``, which
118+
offers upmap and read optimization simultaneously, or ``read``, which may be used
119+
to only optimize reads. For more detailed information see https://docs.ceph.com/en/latest/rados/operations/read-balancer/#online-optimization.
116120

117121
>=18.0.0
118122

doc/dev/balancer-design.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,3 @@ Plans for the Next Version
5555
--------------------------
5656

5757
1. Improve behavior for heterogeneous OSDs in a pool
58-
2. Offer read balancing as an online option to the balancer manager module

doc/rados/operations/balancer.rst

Lines changed: 42 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ To check the current status of the balancer, run the following command:
2121
Automatic balancing
2222
-------------------
2323

24-
When the balancer is in ``upmap`` mode, the automatic balancing feature is
25-
enabled by default. For more details, see :ref:`upmap`. To disable the
26-
balancer, run the following command:
24+
When the balancer is in ``upmap`` mode, which is the default, the automatic
25+
upmap balancing feature is enabled. For more details, see :ref:`upmap`.
26+
To disable the balancer, run the following command:
2727

2828
.. prompt:: bash $
2929

@@ -34,6 +34,10 @@ The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode.
3434
``crush-compat`` mode, the balancer automatically makes small changes to the
3535
data distribution in order to ensure that OSDs are utilized equally.
3636

37+
Additional modes include ``upmap-read`` and ``read``. ``upmap-read`` mode
38+
combines the upmap balancer with the read balancer so that both writes
39+
and reads are optimized. ``read`` mode can be used when only read optimization
40+
is desired. For more details, see :ref:`read_balancer`.
3741

3842
Throttling
3943
----------
@@ -102,7 +106,7 @@ and then run the following command:
102106
Modes
103107
-----
104108

105-
There are two supported balancer modes:
109+
There are four supported balancer modes:
106110

107111
#. **crush-compat**. This mode uses the compat weight-set feature (introduced
108112
in Luminous) to manage an alternative set of weights for devices in the
@@ -135,13 +139,45 @@ There are two supported balancer modes:
135139

136140
To use ``upmap``, all clients must be Luminous or newer.
137141

138-
The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by
139-
running the following command:
142+
#. **read**. In Reef and later releases, the OSDMap can store explicit
143+
mappings for individual primary OSDs as exceptions to the normal CRUSH
144+
placement calculation. These ``pg-upmap-primary`` entries provide fine-grained
145+
control over primary PG mappings. This mode optimizes the placement of individual
146+
primary PGs in order to achieve balanced reads, or primary PGs, in a cluster.
147+
In ``read`` mode, upmap behavior is not excercised, so this mode is best for
148+
uses cases in which only read balancing is desired.
149+
150+
To use ``pg-upmap-primary``, all clients must be Reef or newer. For more
151+
details about client compatibility, see :ref:`read_balancer`.
152+
153+
#. **upmap-read**. This balancer mode combines optimization benefits of
154+
both ``upmap`` and ``read`` mode. Like in ``read`` mode, ``upmap-read``
155+
makes use of ``pg-upmap-primary``. As such, only Reef and later clients
156+
are compatible. For more details about client compatibility, see
157+
:ref:`read_balancer`.
158+
159+
``upmap-read`` is highly recommended for achieving the ``upmap`` mode's
160+
offering of balanced PG distribution as well as the ``read`` mode's
161+
offering of balanced reads.
162+
163+
The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by running the following command:
140164

141165
.. prompt:: bash $
142166

143167
ceph balancer mode crush-compat
144168

169+
The mode can be changed to ``read`` by running the following command:
170+
171+
.. prompt:: bash $
172+
173+
ceph balancer mode read
174+
175+
The mode can be changed to ``upmap-read`` by running the following command:
176+
177+
.. prompt:: bash $
178+
179+
ceph balancer mode upmap-read
180+
145181
Supervised optimization
146182
-----------------------
147183

doc/rados/operations/read-balancer.rst

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,50 @@ you may want to try improving your read performance with the read balancer.
1717
Online Optimization
1818
===================
1919

20-
At present, there is no online option for the read balancer. However, we plan to add
21-
the read balancer as an option to the :ref:`balancer` in the next Ceph version
22-
so it can be enabled to run automatically in the background like the upmap balancer.
20+
Enabling
21+
--------
22+
23+
To enable automatic read balancing, you must turn on the *balancer module*
24+
(enabled by default in new clusters) and set the mode to ``read`` or ``upmap-read``:
25+
26+
.. prompt:: bash $
27+
28+
ceph balancer on
29+
ceph balancer mode <read|upmap-read>
30+
31+
Both ``read`` and ``upmap-read`` mode make use of ``pg-upmap-primary``. In order
32+
to use ``pg-upmap-primary``, the cluster cannot have any pre-Reef clients.
33+
34+
If you want to use a different balancer or if you want to make your
35+
own custom ``pg-upmap-primary`` entries, you might want to turn off the balancer in
36+
order to avoid conflict:
37+
38+
.. prompt:: bash $
39+
40+
ceph balancer off
41+
42+
To allow use of the new feature on an existing cluster, you must restrict the
43+
cluster to supporting only Reef (and newer) clients. To do so, run the
44+
following command:
45+
46+
.. prompt:: bash $
47+
48+
ceph osd set-require-min-compat-client reef
49+
50+
This command will fail if any pre-Reef clients or daemons are connected to
51+
the monitors. To see which client versions are in use, run the following
52+
command:
53+
54+
.. prompt:: bash $
55+
56+
ceph features
57+
58+
Balancer Module
59+
---------------
60+
61+
The `balancer` module for ``ceph-mgr`` will automatically balance the number of
62+
primary PGs per OSD if set to ``read`` or ``upmap-read`` mode. See :ref:`balancer`
63+
for more information.
2364

2465
Offline Optimization
2566
====================
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
tasks:
2+
- exec:
3+
mon.a:
4+
- ceph config set mgr mgr/balancer/log_level debug
5+
- ceph balancer status
6+
- ceph osd set-require-min-compat-client reef
7+
- ceph balancer mode read
8+
- ceph balancer on
9+
- ceph balancer status
10+
- ceph balancer status detail
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
tasks:
2+
- exec:
3+
mon.a:
4+
- ceph config set mgr mgr/balancer/log_level debug
5+
- ceph balancer status
6+
- ceph osd set-require-min-compat-client reef
7+
- ceph balancer mode upmap-read
8+
- ceph balancer on
9+
- ceph balancer status
10+
- ceph balancer status detail

src/mgr/PyOSDMap.cc

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,36 @@ static PyObject *osdmap_calc_pg_upmaps(BasePyOSDMap* self, PyObject *args)
162162
return PyLong_FromLong(r);
163163
}
164164

165+
static PyObject *osdmap_balance_primaries(BasePyOSDMap* self, PyObject *args)
166+
{
167+
int pool_id;
168+
BasePyOSDMapIncremental *incobj;
169+
if (!PyArg_ParseTuple(args, "iO:balance_primaries",
170+
&pool_id, &incobj)) {
171+
return nullptr;
172+
}
173+
auto check_pool = self->osdmap->get_pg_pool(pool_id);
174+
if (!check_pool) {
175+
derr << __func__ << " pool '" << pool_id
176+
<< "' does not exist" << dendl;
177+
return nullptr;
178+
}
179+
dout(10) << __func__ << " osdmap " << self->osdmap
180+
<< " pool_id " << pool_id
181+
<< " inc " << incobj->inc
182+
<< dendl;
183+
PyThreadState *tstate = PyEval_SaveThread();
184+
OSDMap tmp_osd_map;
185+
tmp_osd_map.deepish_copy_from(*(self->osdmap));
186+
int r = self->osdmap->balance_primaries(g_ceph_context,
187+
pool_id,
188+
incobj->inc,
189+
tmp_osd_map);
190+
PyEval_RestoreThread(tstate);
191+
dout(10) << __func__ << " r = " << r << dendl;
192+
return PyLong_FromLong(r);
193+
}
194+
165195
static PyObject *osdmap_map_pool_pgs_up(BasePyOSDMap* self, PyObject *args)
166196
{
167197
int poolid;
@@ -324,6 +354,8 @@ PyMethodDef BasePyOSDMap_methods[] = {
324354
"Get pools that have CRUSH rules that TAKE the given root"},
325355
{"_calc_pg_upmaps", (PyCFunction)osdmap_calc_pg_upmaps, METH_VARARGS,
326356
"Calculate new pg-upmap values"},
357+
{"_balance_primaries", (PyCFunction)osdmap_balance_primaries, METH_VARARGS,
358+
"Calculate new pg-upmap-primary values"},
327359
{"_map_pool_pgs_up", (PyCFunction)osdmap_map_pool_pgs_up, METH_VARARGS,
328360
"Calculate up set mappings for all PGs in a pool"},
329361
{"_pg_to_up_acting_osds", (PyCFunction)osdmap_pg_to_up_acting_osds, METH_VARARGS,

src/osd/OSDMap.cc

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5143,6 +5143,12 @@ int OSDMap::balance_primaries(
51435143
num_changes++;
51445144
}
51455145
}
5146+
} else { // clear out any mappings that were made since the score didn't improve
5147+
for (auto [pg, mapped] : prim_pgs_to_check) {
5148+
if (mapped) {
5149+
pending_inc->new_pg_upmap_primary.erase(pg);
5150+
}
5151+
}
51465152
}
51475153

51485154
ldout(cct, 10) << __func__ << " num_changes " << num_changes << dendl;

0 commit comments

Comments
 (0)