Skip to content

Commit 46cfee7

Browse files
authored
Merge pull request ceph#51845 from ssdohammer-sl/doc-update-deduplication
doc:dev update how to use deduplication Reviewed-by: Zac Dover <[email protected]>
2 parents 69cfd0e + 9b74a33 commit 46cfee7

File tree

1 file changed

+189
-20
lines changed

1 file changed

+189
-20
lines changed

doc/dev/deduplication.rst

Lines changed: 189 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -157,25 +157,46 @@ How to use deduplication
157157
Ceph provides deduplication using RADOS machinery.
158158
Below we explain how to perform deduplication.
159159

160+
Prerequisite
161+
------------
162+
163+
If the Ceph cluster is started from Ceph mainline, users need to check
164+
``ceph-test`` package which is including ceph-dedup-tool is installed.
165+
166+
Deatiled Instructions
167+
---------------------
168+
169+
Users can use ceph-dedup-tool with ``estimate``, ``sample-dedup``,
170+
``chunk-scrub``, and ``chunk-repair`` operations. To provide better
171+
convenience for users, we have enabled necessary operations through
172+
ceph-dedup-tool, and we recommend using the following operations freely
173+
by using any types of scripts.
174+
160175

161176
1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
177+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
162178

163179
.. code:: bash
164180
165-
ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
166-
--chunk-algorithm fixed|fastcdc --fingerprint-algorithm sha1|sha256|sha512
167-
--max-thread THREAD_COUNT
181+
ceph-dedup-tool --op estimate
182+
--pool [BASE_POOL]
183+
--chunk-size [CHUNK_SIZE]
184+
--chunk-algorithm [fixed|fastcdc]
185+
--fingerprint-algorithm [sha1|sha256|sha512]
186+
--max-thread [THREAD_COUNT]
168187
169188
This CLI command will show how much storage space can be saved when deduplication
170189
is applied on the pool. If the amount of the saved space is higher than user's expectation,
171190
the pool probably is worth performing deduplication.
172-
Users should specify $POOL where the object---the users want to perform
173-
deduplication---is stored. The users also need to run ceph-dedup-tool multiple time
191+
Users should specify the ``BASE_POOL``, within which the object targeted for deduplication
192+
is stored. The users also need to run ceph-dedup-tool multiple time
174193
with varying ``chunk_size`` to find the optimal chunk size. Note that the
175194
optimal value probably differs in the content of each object in case of fastcdc
176-
chunk algorithm (not fixed). Example output:
195+
chunk algorithm (not fixed).
196+
197+
Example output:
177198

178-
::
199+
.. code:: bash
179200
180201
{
181202
"chunk_algo": "fastcdc",
@@ -204,54 +225,202 @@ represents the standard deviation of the chunk size.
204225

205226

206227
2. Create chunk pool.
228+
^^^^^^^^^^^^^^^^^^^^^
207229

208230
.. code:: bash
209231
210-
ceph osd pool create CHUNK_POOL
232+
ceph osd pool create [CHUNK_POOL]
211233
212234
213235
3. Run dedup command (there are two ways).
236+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
237+
238+
- **sample-dedup**
214239

215240
.. code:: bash
216241
217-
ceph-dedup-tool --op sample-dedup --pool POOL --chunk-pool CHUNK_POOL --chunk-size
218-
CHUNK_SIZE --chunk-algorithm fastcdc --fingerprint-algorithm sha1|sha256|sha512
219-
--chunk-dedup-threshold THRESHOLD --max-thread THREAD_COUNT ----sampling-ratio SAMPLE_RATIO
220-
--wakeup-period WAKEUP_PERIOD --loop --snap
242+
ceph-dedup-tool --op sample-dedup
243+
--pool [BASE_POOL]
244+
--chunk-pool [CHUNK_POOL]
245+
--chunk-size [CHUNK_SIZE]
246+
--chunk-algorithm [fastcdc]
247+
--fingerprint-algorithm [sha1|sha256|sha512]
248+
--chunk-dedup-threshold [THRESHOLD]
249+
--max-thread [THREAD_COUNT]
250+
--sampling-ratio [SAMPLE_RATIO]
251+
--wakeup-period [WAKEUP_PERIOD]
252+
--loop
253+
--snap
221254
222255
The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
223-
the ``POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
256+
the ``BASE_POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
224257
perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
225258
If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.
226259

260+
Example output:
261+
262+
.. code:: bash
263+
264+
$ bin/ceph df
265+
--- RAW STORAGE ---
266+
CLASS SIZE AVAIL USED RAW USED %RAW USED
267+
ssd 303 GiB 294 GiB 9.0 GiB 9.0 GiB 2.99
268+
TOTAL 303 GiB 294 GiB 9.0 GiB 9.0 GiB 2.99
269+
270+
--- POOLS ---
271+
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
272+
.mgr 1 1 577 KiB 2 1.7 MiB 0 97 GiB
273+
base 2 32 2.0 GiB 517 6.0 GiB 2.02 97 GiB
274+
chunk 3 32 0 B 0 0 B 0 97 GiB
275+
276+
$ bin/ceph-dedup-tool --op sample-dedup --pool base --chunk-pool chunk
277+
--fingerprint-algorithm sha1 --chunk-algorithm fastcdc --loop --sampling-ratio 100
278+
--chunk-dedup-threshold 2 --chunk-size 8192 --max-thread 4 --wakeup-period 60
279+
280+
$ bin/ceph df
281+
--- RAW STORAGE ---
282+
CLASS SIZE AVAIL USED RAW USED %RAW USED
283+
ssd 303 GiB 298 GiB 5.4 GiB 5.4 GiB 1.80
284+
TOTAL 303 GiB 298 GiB 5.4 GiB 5.4 GiB 1.80
285+
286+
--- POOLS ---
287+
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
288+
.mgr 1 1 577 KiB 2 1.7 MiB 0 98 GiB
289+
base 2 32 452 MiB 262 1.3 GiB 0.50 98 GiB
290+
chunk 3 32 258 MiB 25.91k 938 MiB 0.31 98 GiB
291+
292+
- **object dedup**
293+
227294
.. code:: bash
228295
229-
ceph-dedup-tool --op object-dedup --pool POOL --object OID --chunk-pool CHUNK_POOL
230-
--fingerprint-algorithm sha1|sha256|sha512 --dedup-cdc-chunk-size CHUNK_SIZE
296+
ceph-dedup-tool --op object-dedup
297+
--pool [BASE_POOL]
298+
--object [OID]
299+
--chunk-pool [CHUNK_POOL]
300+
--fingerprint-algorithm [sha1|sha256|sha512]
301+
--dedup-cdc-chunk-size [CHUNK_SIZE]
231302
232303
The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
233304
All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
234305
the results of step 1 above.
235306
Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
236-
such as ``FP`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
307+
such as ``fingerprint-algorithm`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
237308
Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
238309
``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
239310
The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
240-
object size in ``POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
241-
311+
object size in ``BASE_POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
242312

243313
4. Read/write I/Os
314+
^^^^^^^^^^^^^^^^^^
244315

245316
After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
246317
completely compatible with existing RAODS operations.
247318

248319

249320
5. Run scrub to fix reference count
321+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
250322

251323
Reference mismatches can on rare occasions occur to false positives when handling reference counts for
252324
deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:
253325

254326
.. code:: bash
255327
256-
ceph-dedup-tool --op chunk-scrub --op chunk-scrub --chunk-pool CHUNK_POOL --pool POOL --max-thread THREAD_COUNT
257-
328+
ceph-dedup-tool --op chunk-scrub
329+
--chunk-pool [CHUNK_POOL]
330+
--pool [POOL]
331+
--max-thread [THREAD_COUNT]
332+
333+
The ``chunk-scrub`` command identifies reference mismatches between a
334+
metadata object and a chunk object. The ``chunk-pool`` parameter tells
335+
where the target chunk objects are located to the ceph-dedup-tool.
336+
337+
Example output:
338+
339+
A reference mismatch is intentionally created by inserting a reference (dummy-obj) into a chunk object (2ac67f70d3dd187f8f332bb1391f61d4e5c9baae) by using chunk-get-ref.
340+
341+
.. code:: bash
342+
343+
$ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
344+
{
345+
"type": "by_object",
346+
"count": 2,
347+
"refs": [
348+
{
349+
"oid": "testfile2",
350+
"key": "",
351+
"snapid": -2,
352+
"hash": 2905889452,
353+
"max": 0,
354+
"pool": 2,
355+
"namespace": ""
356+
},
357+
{
358+
"oid": "dummy-obj",
359+
"key": "",
360+
"snapid": -2,
361+
"hash": 1203585162,
362+
"max": 0,
363+
"pool": 2,
364+
"namespace": ""
365+
}
366+
]
367+
}
368+
369+
$ bin/ceph-dedup-tool --op chunk-scrub --chunk-pool chunk --max-thread 10
370+
10 seconds is set as report period by default
371+
join
372+
join
373+
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
374+
--done--
375+
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae ref 10:5102bde2:::dummy-obj:head: referencing pool does not exist
376+
--done--
377+
Total object : 1
378+
Examined object : 1
379+
Damaged object : 1
380+
381+
6. Repair a mismatched chunk reference
382+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
383+
384+
If any reference mismatches occur after the ``chunk-scrub``, it is
385+
recommended to perform the ``chunk-repair`` operation to fix reference
386+
mismatches. The ``chunk-repair`` operation helps in resolving the
387+
reference mismatch and restoring consistency.
388+
389+
.. code:: bash
390+
391+
ceph-dedup-tool --op chunk-repair
392+
--chunk-pool [CHUNK_POOL_NAME]
393+
--object [CHUNK_OID]
394+
--target-ref [TARGET_OID]
395+
--target-ref-pool-id [TARGET_POOL_ID]
396+
397+
``chunk-repair`` fixes the ``target-ref``, which is a wrong reference of
398+
an ``object``. To fix it correctly, the users must enter the correct
399+
``TARGET_OID`` and ``TARGET_POOL_ID``.
400+
401+
.. code:: bash
402+
403+
$ bin/ceph-dedup-tool --op chunk-repair --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae --target-ref dummy-obj --target-ref-pool-id 10
404+
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae has 1 references for dummy-obj
405+
dummy-obj has 0 references for 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
406+
fix dangling reference from 1 to 0
407+
408+
$ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
409+
{
410+
"type": "by_object",
411+
"count": 1,
412+
"refs": [
413+
{
414+
"oid": "testfile2",
415+
"key": "",
416+
"snapid": -2,
417+
"hash": 2905889452,
418+
"max": 0,
419+
"pool": 2,
420+
"namespace": ""
421+
}
422+
]
423+
}
424+
425+
426+

0 commit comments

Comments
 (0)