Skip to content

Commit c9ab079

Browse files
authored
Merge pull request ceph#56782 from joscollin/wip-B65171-support-replication-start-end-notifications
cephfs_mirror: provide metrics for last successful snapshot sync Reviewed-by: Venky Shankar <[email protected]>
2 parents 92ad0c8 + 091625e commit c9ab079

File tree

4 files changed

+161
-26
lines changed

4 files changed

+161
-26
lines changed

doc/cephfs/cephfs-mirroring.rst

Lines changed: 98 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -341,7 +341,8 @@ command is of format `filesystem-name@filesystem-id peer-uuid`::
341341
"id": 120,
342342
"name": "snap1",
343343
"sync_duration": 0.079997898999999997,
344-
"sync_time_stamp": "274900.558797s"
344+
"sync_time_stamp": "274900.558797s",
345+
"sync_bytes": 52428800
345346
},
346347
"snaps_synced": 2,
347348
"snaps_deleted": 0,
@@ -359,6 +360,32 @@ A directory can be in one of the following states::
359360
- `syncing`: The directory is currently being synchronized
360361
- `failed`: The directory has hit upper limit of consecutive failures
361362

363+
When a directory is currently being synchronized, the mirror daemon marks it as `syncing` and
364+
`fs mirror peer status` shows the snapshot being synchronized under the `current_syncing_snap`::
365+
366+
$ ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror peer status cephfs@360 a2dc7784-e7a1-4723-b103-03ee8d8768f8
367+
{
368+
"/d0": {
369+
"state": "syncing",
370+
"current_syncing_snap": {
371+
"id": 121,
372+
"name": "snap2"
373+
},
374+
"last_synced_snap": {
375+
"id": 120,
376+
"name": "snap1",
377+
"sync_duration": 0.079997898999999997,
378+
"sync_time_stamp": "274900.558797s",
379+
"sync_bytes": 52428800
380+
},
381+
"snaps_synced": 2,
382+
"snaps_deleted": 0,
383+
"snaps_renamed": 0
384+
}
385+
}
386+
387+
The mirror daemon marks it back to `idle`, when the syncing completes.
388+
362389
When a directory experiences a configured number of consecutive synchronization failures, the
363390
mirror daemon marks it as `failed`. Synchronization for these directories is retried.
364391
By default, the number of consecutive failures before a directory is marked as failed
@@ -374,12 +401,13 @@ E.g., adding a regular file for synchronization would result in failed status::
374401
"/d0": {
375402
"state": "idle",
376403
"last_synced_snap": {
377-
"id": 120,
378-
"name": "snap1",
379-
"sync_duration": 0.079997898999999997,
380-
"sync_time_stamp": "274900.558797s"
404+
"id": 121,
405+
"name": "snap2",
406+
"sync_duration": 300,
407+
"sync_time_stamp": "500900.600797s",
408+
"sync_bytes": 78643200
381409
},
382-
"snaps_synced": 2,
410+
"snaps_synced": 3,
383411
"snaps_deleted": 0,
384412
"snaps_renamed": 0
385413
},
@@ -398,6 +426,69 @@ created, the mirror daemon will clear the failed state upon successful synchroni
398426
When mirroring is disabled, the respective `fs mirror status` command for the file system
399427
will not show up in command help.
400428

429+
Metrics
430+
-------
431+
432+
CephFS exports mirroring metrics as :ref:`Labeled Perf Counters` which will be consumed by the OCP/ODF Dashboard to provide monitoring of the Geo Replication. These metrics can be used to measure the progress of cephfs_mirror syncing and thus provide the monitoring capability. CephFS exports the following mirroring metrics, which are displayed using the ``counter dump`` command.
433+
434+
.. list-table:: Mirror Status Metrics
435+
:widths: 25 25 75
436+
:header-rows: 1
437+
438+
* - Name
439+
- Type
440+
- Description
441+
* - mirroring_peers
442+
- Gauge
443+
- The number of peers involved in mirroring
444+
* - directory_count
445+
- Gauge
446+
- The total number of directories being synchronized
447+
* - mirrored_filesystems
448+
- Gauge
449+
- The total number of filesystems which are mirrored
450+
* - mirror_enable_failures
451+
- Counter
452+
- Enable mirroring failures
453+
454+
.. list-table:: Replication Metrics
455+
:widths: 25 25 75
456+
:header-rows: 1
457+
458+
* - Name
459+
- Type
460+
- Description
461+
* - snaps_synced
462+
- Counter
463+
- The total number of snapshots successfully synchronized
464+
* - sync_bytes
465+
- Counter
466+
- The total bytes being synchronized
467+
* - sync_failures
468+
- Counter
469+
- The total number of failed snapshot synchronizations
470+
* - snaps_deleted
471+
- Counter
472+
- The total number of snapshots deleted
473+
* - snaps_renamed
474+
- Counter
475+
- The total number of snapshots renamed
476+
* - avg_sync_time
477+
- Gauge
478+
- The average time (ms) taken by all snapshot synchronizations
479+
* - last_synced_start
480+
- Gauge
481+
- The sync start time (ms) of the last synced snapshot
482+
* - last_synced_end
483+
- Gauge
484+
- The sync end time (ms) of the last synced snapshot
485+
* - last_synced_duration
486+
- Gauge
487+
- The time duration (ms) of the last synchronization
488+
* - last_synced_bytes
489+
- counter
490+
- The total bytes being synchronized for the last synced snapshot
491+
401492
Configuration Options
402493
---------------------
403494

@@ -410,6 +501,7 @@ Configuration Options
410501
.. confval:: cephfs_mirror_retry_failed_directories_interval
411502
.. confval:: cephfs_mirror_restart_mirror_on_failure_interval
412503
.. confval:: cephfs_mirror_mount_timeout
504+
.. confval:: cephfs_mirror_perf_stats_prio
413505

414506
Re-adding Peers
415507
---------------

qa/tasks/cephfs/test_mirroring.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -549,7 +549,8 @@ def test_cephfs_mirror_stats(self):
549549

550550
# create a bunch of files in a directory to snap
551551
self.mount_a.run_shell(["mkdir", "d0"])
552-
self.mount_a.create_n_files('d0/file', 50, sync=True)
552+
for i in range(50):
553+
self.mount_a.write_n_mb(os.path.join('d0', f'file.{i}'), 1)
553554

554555
self.enable_mirroring(self.primary_fs_name, self.primary_fs_id)
555556
self.add_directory(self.primary_fs_name, self.primary_fs_id, '/d0')
@@ -567,17 +568,20 @@ def test_cephfs_mirror_stats(self):
567568
"client.mirror_remote@ceph", '/d0', 'snap0', 1)
568569
self.verify_snapshot('d0', 'snap0')
569570

570-
# check snaps_synced
571+
# check perf counters
571572
res = self.mirror_daemon_command(f'counter dump for fs: {self.primary_fs_name}', 'counter', 'dump')
572573
second = res[TestMirroring.PERF_COUNTER_KEY_NAME_CEPHFS_MIRROR_PEER][0]
573574
self.assertGreater(second["counters"]["snaps_synced"], first["counters"]["snaps_synced"])
575+
self.assertGreater(second["counters"]["last_synced_start"], first["counters"]["last_synced_start"])
576+
self.assertGreater(second["counters"]["last_synced_end"], second["counters"]["last_synced_start"])
577+
self.assertGreater(second["counters"]["last_synced_duration"], 0)
578+
self.assertEquals(second["counters"]["last_synced_bytes"], 52428800) # last_synced_bytes = 50 files of 1MB size each
574579

575580
# some more IO
576-
self.mount_a.run_shell(["mkdir", "d0/d00"])
577-
self.mount_a.run_shell(["mkdir", "d0/d01"])
581+
for i in range(75):
582+
self.mount_a.write_n_mb(os.path.join('d0', f'more_file.{i}'), 1)
578583

579-
self.mount_a.create_n_files('d0/d00/more_file', 20, sync=True)
580-
self.mount_a.create_n_files('d0/d01/some_more_file', 75, sync=True)
584+
time.sleep(60)
581585

582586
# take another snapshot
583587
self.mount_a.run_shell(["mkdir", "d0/.snap/snap1"])
@@ -587,10 +591,14 @@ def test_cephfs_mirror_stats(self):
587591
"client.mirror_remote@ceph", '/d0', 'snap1', 2)
588592
self.verify_snapshot('d0', 'snap1')
589593

590-
# check snaps_synced
594+
# check perf counters
591595
res = self.mirror_daemon_command(f'counter dump for fs: {self.primary_fs_name}', 'counter', 'dump')
592596
third = res[TestMirroring.PERF_COUNTER_KEY_NAME_CEPHFS_MIRROR_PEER][0]
593597
self.assertGreater(third["counters"]["snaps_synced"], second["counters"]["snaps_synced"])
598+
self.assertGreater(third["counters"]["last_synced_start"], second["counters"]["last_synced_end"])
599+
self.assertGreater(third["counters"]["last_synced_end"], third["counters"]["last_synced_start"])
600+
self.assertGreater(third["counters"]["last_synced_duration"], 0)
601+
self.assertEquals(third["counters"]["last_synced_bytes"], 78643200) # last_synced_bytes = 75 files of 1MB size each
594602

595603
# delete a snapshot
596604
self.mount_a.run_shell(["rmdir", "d0/.snap/snap0"])

src/tools/cephfs_mirror/PeerReplayer.cc

Lines changed: 38 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,10 @@ enum {
3838
l_cephfs_mirror_peer_replayer_snap_sync_failures,
3939
l_cephfs_mirror_peer_replayer_avg_sync_time,
4040
l_cephfs_mirror_peer_replayer_sync_bytes,
41+
l_cephfs_mirror_peer_replayer_last_synced_start,
42+
l_cephfs_mirror_peer_replayer_last_synced_end,
43+
l_cephfs_mirror_peer_replayer_last_synced_duration,
44+
l_cephfs_mirror_peer_replayer_last_synced_bytes,
4145
l_cephfs_mirror_peer_replayer_last,
4246
};
4347

@@ -203,6 +207,14 @@ PeerReplayer::PeerReplayer(CephContext *cct, FSMirror *fs_mirror,
203207
"avg_sync_time", "Average Sync Time", "asyn", prio);
204208
plb.add_u64_counter(l_cephfs_mirror_peer_replayer_sync_bytes,
205209
"sync_bytes", "Sync Bytes", "sbye", prio);
210+
plb.add_time(l_cephfs_mirror_peer_replayer_last_synced_start,
211+
"last_synced_start", "Last Synced Start", "lsst", prio);
212+
plb.add_time(l_cephfs_mirror_peer_replayer_last_synced_end,
213+
"last_synced_end", "Last Synced End", "lsen", prio);
214+
plb.add_time(l_cephfs_mirror_peer_replayer_last_synced_duration,
215+
"last_synced_duration", "Last Synced Duration", "lsdn", prio);
216+
plb.add_u64_counter(l_cephfs_mirror_peer_replayer_last_synced_bytes,
217+
"last_synced_bytes", "Last Synced Bytes", "lsbt", prio);
206218
m_perf_counters = plb.create_perf_counters();
207219
m_cct->get_perfcounters_collection()->add(m_perf_counters);
208220
}
@@ -751,8 +763,9 @@ int PeerReplayer::remote_file_op(const std::string &dir_root, const std::string
751763
return r;
752764
}
753765
if (m_perf_counters) {
754-
m_perf_counters->inc(l_cephfs_mirror_peer_replayer_sync_bytes, stx.stx_size);
766+
m_perf_counters->inc(l_cephfs_mirror_peer_replayer_sync_bytes, stx.stx_size);
755767
}
768+
inc_sync_bytes(dir_root, stx.stx_size);
756769
} else if (S_ISLNK(stx.stx_mode)) {
757770
// free the remote link before relinking
758771
r = ceph_unlinkat(m_remote_mount, fh.r_fd_dir_root, epath.c_str(), 0);
@@ -1668,9 +1681,17 @@ int PeerReplayer::do_sync_snaps(const std::string &dir_root) {
16681681
"cephfs_mirror_max_snapshot_sync_per_cycle");
16691682

16701683
dout(10) << ": synchronizing from snap-id=" << it->first << dendl;
1684+
double start = 0;
1685+
double end = 0;
1686+
double duration = 0;
16711687
for (; it != local_snap_map.end(); ++it) {
1688+
if (m_perf_counters) {
1689+
start = std::chrono::duration_cast<std::chrono::milliseconds>(clock::now().time_since_epoch()).count();
1690+
utime_t t;
1691+
t.set_from_double(start);
1692+
m_perf_counters->tset(l_cephfs_mirror_peer_replayer_last_synced_start, t);
1693+
}
16721694
set_current_syncing_snap(dir_root, it->first, it->second);
1673-
auto start = clock::now();
16741695
boost::optional<Snapshot> prev = boost::none;
16751696
if (last_snap_id != 0) {
16761697
prev = std::make_pair(last_snap_name, last_snap_id);
@@ -1684,16 +1705,18 @@ int PeerReplayer::do_sync_snaps(const std::string &dir_root) {
16841705
}
16851706
if (m_perf_counters) {
16861707
m_perf_counters->inc(l_cephfs_mirror_peer_replayer_snaps_synced);
1687-
}
1688-
std::chrono::duration<double> duration = clock::now() - start;
1689-
1690-
utime_t d;
1691-
d.set_from_double(duration.count());
1692-
if (m_perf_counters) {
1693-
m_perf_counters->tinc(l_cephfs_mirror_peer_replayer_avg_sync_time, d);
1694-
}
1695-
1696-
set_last_synced_stat(dir_root, it->first, it->second, duration.count());
1708+
end = std::chrono::duration_cast<std::chrono::milliseconds>(clock::now().time_since_epoch()).count();
1709+
utime_t t;
1710+
t.set_from_double(end);
1711+
m_perf_counters->tset(l_cephfs_mirror_peer_replayer_last_synced_end, t);
1712+
duration = end - start;
1713+
t.set_from_double(duration);
1714+
m_perf_counters->tinc(l_cephfs_mirror_peer_replayer_avg_sync_time, t);
1715+
m_perf_counters->tset(l_cephfs_mirror_peer_replayer_last_synced_duration, t);
1716+
m_perf_counters->set(l_cephfs_mirror_peer_replayer_last_synced_bytes, m_snap_sync_stats.at(dir_root).sync_bytes);
1717+
}
1718+
1719+
set_last_synced_stat(dir_root, it->first, it->second, duration);
16971720
if (--snaps_per_cycle == 0) {
16981721
break;
16991722
}
@@ -1801,6 +1824,9 @@ void PeerReplayer::peer_status(Formatter *f) {
18011824
f->dump_float("sync_duration", *sync_stat.last_sync_duration);
18021825
f->dump_stream("sync_time_stamp") << sync_stat.last_synced;
18031826
}
1827+
if (sync_stat.last_sync_bytes) {
1828+
f->dump_unsigned("sync_bytes", *sync_stat.last_sync_bytes);
1829+
}
18041830
f->close_section();
18051831
}
18061832
f->dump_unsigned("snaps_synced", sync_stat.synced_snap_count);

src/tools/cephfs_mirror/PeerReplayer.h

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,8 @@ class PeerReplayer {
149149
uint64_t renamed_snap_count = 0;
150150
monotime last_synced = clock::zero();
151151
boost::optional<double> last_sync_duration;
152+
boost::optional<uint64_t> last_sync_bytes; //last sync bytes for display in status
153+
uint64_t sync_bytes = 0; //sync bytes counter, independently for each directory sync.
152154
};
153155

154156
void _inc_failed_count(const std::string &dir_root) {
@@ -187,6 +189,8 @@ class PeerReplayer {
187189
const std::string &snap_name) {
188190
std::scoped_lock locker(m_lock);
189191
_set_last_synced_snap(dir_root, snap_id, snap_name);
192+
auto &sync_stat = m_snap_sync_stats.at(dir_root);
193+
sync_stat.sync_bytes = 0;
190194
}
191195
void set_current_syncing_snap(const std::string &dir_root, uint64_t snap_id,
192196
const std::string &snap_name) {
@@ -216,9 +220,14 @@ class PeerReplayer {
216220
auto &sync_stat = m_snap_sync_stats.at(dir_root);
217221
sync_stat.last_synced = clock::now();
218222
sync_stat.last_sync_duration = duration;
223+
sync_stat.last_sync_bytes = sync_stat.sync_bytes;
219224
++sync_stat.synced_snap_count;
220225
}
221-
226+
void inc_sync_bytes(const std::string &dir_root, const uint64_t& b) {
227+
std::scoped_lock locker(m_lock);
228+
auto &sync_stat = m_snap_sync_stats.at(dir_root);
229+
sync_stat.sync_bytes += b;
230+
}
222231
bool should_backoff(const std::string &dir_root, int *retval) {
223232
if (m_fs_mirror->is_blocklisted()) {
224233
*retval = -EBLOCKLISTED;

0 commit comments

Comments
 (0)