Add recover_status parser lustrefs_exporter by breuhan · Pull Request #118 · whamcloud/lustrefs-exporter

breuhan · 2025-10-27T10:52:06Z

Added capture of recovery_status to lustrefs_exporter and added these new metrics:
RecoveryDuration
RecoveryTimeRemaining
RecoveryTotalClients

Demo:

Input:

obdfilter.200NVX2-OST0000.recovery_status=
status: COMPLETE
recovery_start: 1761567698
recovery_duration: 1
completed_clients: 8/8
replayed_requests: 0
last_transno: 17184814233
VBR: DISABLED
IR: ENABLED
obdfilter.200NVX2-OST0003.recovery_status=
status: COMPLETE
recovery_start: 1759494116
recovery_duration: 15
completed_clients: 8/8
replayed_requests: 0
last_transno: 12934942643
VBR: DISABLED
IR: DISABLED
obdfilter.200NVX2-OST0004.recovery_status=
status: COMPLETE
recovery_start: 1759494116
recovery_duration: 14
completed_clients: 8/8
replayed_requests: 0
last_transno: 12934956643
VBR: DISABLED
IR: DISABLED
obdfilter.200NVX2-OST0007.recovery_status=
status: COMPLETE
recovery_start: 1759494116
recovery_duration: 14
completed_clients: 8/8
replayed_requests: 0
last_transno: 12934943652
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0000.recovery_status=
status: COMPLETE
recovery_start: 1759494099
recovery_duration: 55
completed_clients: 22/22
replayed_requests: 0
last_transno: 8763765977
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0003.recovery_status=
status: COMPLETE
recovery_start: 1759494104
recovery_duration: 50
completed_clients: 22/22
replayed_requests: 0
last_transno: 8719550640
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0004.recovery_status=
status: COMPLETE
recovery_start: 1759494104
recovery_duration: 50
completed_clients: 22/22
replayed_requests: 0
last_transno: 8763716647
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0007.recovery_status=
status: COMPLETE
recovery_start: 1759494104
recovery_duration: 50
completed_clients: 22/22
replayed_requests: 0
last_transno: 13014521934
VBR: DISABLED
IR: DISABLED

Output:

# HELP recovery_status Gives the recovery status off a target. 0=Complete 1=Inactive 2=Waiting 3=WaitingForClients 4=Recovering 5=Unknown.
# TYPE recovery_status gauge
recovery_status{kind="MDT",target="200NVX2-MDT0000"} 0
recovery_status{kind="MDT",target="200NVX2-MDT0003"} 0
recovery_status{kind="MDT",target="200NVX2-MDT0004"} 0
recovery_status{kind="MDT",target="200NVX2-MDT0007"} 0
recovery_status{kind="OST",target="200NVX2-OST0000"} 0
recovery_status{kind="OST",target="200NVX2-OST0003"} 0
recovery_status{kind="OST",target="200NVX2-OST0004"} 0
recovery_status{kind="OST",target="200NVX2-OST0007"} 0
# HELP recovery_status_completed_clients Gives the count of clients that complete the recovery on a target.
# TYPE recovery_status_completed_clients gauge
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0000"} 22
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0003"} 22
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0004"} 22
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0007"} 22
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0000"} 8
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0003"} 8
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0004"} 8
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0007"} 8
# HELP recovery_status_duration_seconds Gives the total duration in seconds of the recovery on a target.
# TYPE recovery_status_duration_seconds gauge
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0000"} 55
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0003"} 50
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0004"} 50
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0007"} 50
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0000"} 1
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0003"} 15
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0004"} 14
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0007"} 14

codecov · 2025-10-27T10:53:20Z

Codecov Report

❌ Patch coverage is 76.00000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.34%. Comparing base (6efdfe1) to head (7268720).

Files with missing lines	Patch %	Lines
lustrefs-exporter/src/brw_stats.rs	16.66%	30 Missing ⚠️

Additional details and impacted files

@@                             Coverage Diff                             @@
##           spoutn1k/EHT-1348-history-in-the-making     #118      +/-   ##
===========================================================================
- Coverage                                    89.60%   89.34%   -0.26%     
===========================================================================
  Files                                           44       44              
  Lines                                         5375     5470      +95     
  Branches                                      5375     5470      +95     
===========================================================================
+ Hits                                          4816     4887      +71     
- Misses                                         484      509      +25     
+ Partials                                        75       74       -1

Flag	Coverage Δ
2_14_0_ddn133	`36.61% <22.47%> (+1.87%)`	⬆️
2_14_0_ddn145	`38.55% <34.83%> (+2.82%)`	⬆️
all-tests	`89.34% <76.00%> (-0.26%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-10-27T10:54:23Z

Bencher Report

Branch	breuhan/parsing_full_recovery_status
Testbed	ci-runner

Click to view all benchmark results

Benchmark	Latency	Benchmark Result nanoseconds (ns) (Result Δ%)	Lower Boundary nanoseconds (ns) (Limit %)	Upper Boundary nanoseconds (ns) (Limit %)
parse_benchmarks/combine_performance	📈 view plot 🚷 view threshold	124,620,000.00 ns (-57.53%) Baseline: 293,399,285.71 ns	-764,455,415.43 ns (-613.43%)	1,351,253,986.86 ns (9.22%)

🐰 View full continuous benchmarking report in Bencher

github-actions · 2025-10-27T10:56:36Z

Bencher Report

Branch	breuhan/parsing_full_recovery_status
Testbed	ci-runner

⚠️ WARNING: No Threshold found!
Without a Threshold, no Alerts will ever be generated.
LL Hits (hits)
Estimated Cycles (cycles)
I1mr (misses (reads))
LL Hit Rate (hits (%))
Total read+write (reads/writes)
Dr (reads)
RAM Hit Rate (hits (%))
D1mw (misses (writes))
L1 Hits (hits)
Dw (writes)
I1 Miss Rate (misses (%))
DLmw (misses (writes))
DLmr (misses (reads))
ILmr (misses (reads))
L1 Hit Rate (hits (%))
D1mr (misses (reads))
LL Miss Rate (misses (%))
LLd Miss Rate (misses (%))
LLi Miss Rate (misses (%))
RAM Hits (hits)
D1 Miss Rate (misses (%))
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results

Benchmark	D1 Miss Rate	misses (%)	D1mr	misses (reads) x 1e3	D1mw	misses (writes) x 1e3	DLmr	misses (reads)	DLmw	misses (writes) x 1e3	Dr	reads x 1e6	Dw	writes x 1e6	Estimated Cycles	cycles x 1e6	I1 Miss Rate	misses (%)	I1mr	misses (reads) x 1e3	ILmr	misses (reads)	Instructions	Benchmark Result instructions x 1e6 (Result Δ%)	Lower Boundary instructions x 1e6 (Limit %)	Upper Boundary instructions x 1e6 (Limit %)	L1 Hit Rate	hits (%)	L1 Hits	hits x 1e6	LL Hit Rate	hits (%)	LL Hits	hits x 1e3	LL Miss Rate	misses (%)	LLd Miss Rate	misses (%)	LLi Miss Rate	misses (%)	RAM Hit Rate	hits (%)	RAM Hits	hits x 1e3	Total read+write	reads/writes x 1e6
lustre_metrics::memory_benches::bench_encode_lustre_metrics with_setup:generate_records()	📈 view plot ⚠️ NO THRESHOLD	0.93 %	📈 view plot ⚠️ NO THRESHOLD	25.00 reads x 1e3	📈 view plot ⚠️ NO THRESHOLD	9.21 writes x 1e3	📈 view plot ⚠️ NO THRESHOLD	117.00 reads	📈 view plot ⚠️ NO THRESHOLD	6.45 writes x 1e3	📈 view plot ⚠️ NO THRESHOLD	2.47 x 1e6	📈 view plot ⚠️ NO THRESHOLD	1.23 x 1e6	📈 view plot ⚠️ NO THRESHOLD	14.81 x 1e6	📈 view plot ⚠️ NO THRESHOLD	0.01 %	📈 view plot ⚠️ NO THRESHOLD	1.06 reads x 1e3	📈 view plot ⚠️ NO THRESHOLD	890.00 reads	📈 view plot 🚷 view threshold	10.75 x 1e6 (-23.14%) Baseline: 13.99 x 1e6	2.43 x 1e6 (22.65%)	25.54 x 1e6 (42.09%)	📈 view plot ⚠️ NO THRESHOLD	99.76 %	📈 view plot ⚠️ NO THRESHOLD	14.41 x 1e6	📈 view plot ⚠️ NO THRESHOLD	0.19 %	📈 view plot ⚠️ NO THRESHOLD	27.83 x 1e3	📈 view plot ⚠️ NO THRESHOLD	0.05 %	📈 view plot ⚠️ NO THRESHOLD	0.18 %	📈 view plot ⚠️ NO THRESHOLD	0.01 %	📈 view plot ⚠️ NO THRESHOLD	0.05 %	📈 view plot ⚠️ NO THRESHOLD	7.46 x 1e3	📈 view plot ⚠️ NO THRESHOLD	14.44 x 1e6

🐰 View full continuous benchmarking report in Bencher

github-actions · 2025-10-27T10:57:36Z

Bencher Report

Branch	breuhan/parsing_full_recovery_status
Testbed	ci-runner

⚠️ WARNING: No Threshold found!
Without a Threshold, no Alerts will ever be generated.
avg_runtime_rss_mib (Measure (MiB))
peak_over_start_rss_ratio (Measure (units))
virtual_growth_mib (Measure (MiB))
avg_runtime_virtual_mib (Measure (MiB))
peak_virtual_mib (Measure (MiB))
end_rss_mib (Measure (MiB))
memory_growth_mib (Measure (MiB))
start_rss_mib (Measure (MiB))
start_virtual_mib (Measure (MiB))
end_virtual_mib (Measure (MiB))
peak_over_start_virtual_ratio (Measure (units))
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results

Benchmark	avg_runtime_rss_mib	Measure (MiB)	avg_runtime_virtual_mib	Measure (MiB)	end_rss_mib	Measure (MiB)	end_virtual_mib	Measure (MiB)	memory_growth_mib	Measure (MiB)	peak_over_start_rss_ratio	Measure (units)	peak_over_start_virtual_ratio	Measure (units)	peak_rss_mib	Benchmark Result Measure (MiB) (Result Δ%)	Lower Boundary Measure (MiB) (Limit %)	Upper Boundary Measure (MiB) (Limit %)	peak_virtual_mib	Measure (MiB)	start_rss_mib	Measure (MiB)	start_virtual_mib	Measure (MiB)	virtual_growth_mib	Measure (MiB)
scrape_allocations	📈 view plot ⚠️ NO THRESHOLD	43.09 MiB	📈 view plot ⚠️ NO THRESHOLD	893.91 MiB	📈 view plot ⚠️ NO THRESHOLD	43.16 MiB	📈 view plot ⚠️ NO THRESHOLD	894.01 MiB	📈 view plot ⚠️ NO THRESHOLD	0.33 MiB	📈 view plot ⚠️ NO THRESHOLD	1.03 units	📈 view plot ⚠️ NO THRESHOLD	1.02 units	📈 view plot 🚷 view threshold	45.14 MiB (-46.82%) Baseline: 84.88 MiB	-142.07 MiB (-314.72%)	311.82 MiB (14.48%)	📈 view plot ⚠️ NO THRESHOLD	944.80 MiB	📈 view plot ⚠️ NO THRESHOLD	42.83 MiB	📈 view plot ⚠️ NO THRESHOLD	888.06 MiB	📈 view plot ⚠️ NO THRESHOLD	5.95 MiB

🐰 View full continuous benchmarking report in Bencher

Copilot

Pull Request Overview

This PR enhances the recovery status parser to collect and export four additional metrics for Lustre filesystem recovery monitoring: completed clients, duration, time remaining, and total clients involved in recovery operations.

Added support for parsing recovery_duration, time_remaining, and total_clients fields from recovery status output
Introduced new Prometheus metrics for recovery duration, time remaining, and total client counts
Updated test fixtures and snapshots to validate the new metrics extraction

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
lustre-collector/src/recovery_status_parser.rs	Extended parser to extract duration, time remaining, and total client metrics from recovery status
lustre-collector/src/types.rs	Added new TargetStats variants for the additional recovery metrics
lustre-collector/src/parser.rs	Integrated recovery status parser into main parsing flow
lustrefs-exporter/src/brw_stats.rs	Added Prometheus metric families and registration for the new recovery metrics
lustrefs-exporter/src/lib.rs	Added new metric names to the validation list and improved error handling
Test fixtures and snapshots	Updated test data and expected outputs to validate new metric extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lustre-collector/src/recovery_status_parser.rs

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

notion-workspace · 2025-10-28T14:52:40Z

GCP-38 Information on recovery_status of Lustre targets

enhance recovery status parser to include additional metrics: - RecoveryDuration - RecoveryTimeRemaining - RecoveryTotalClients

lustre-collector/src/parser.rs

lustre-collector/src/recovery_status_parser.rs

...efs_exporter__tests__valid_fixture_lustre-2.14.0_ddn212__2.14.0_ddn212_recovery.txt.histsnap

johnsonw

A few comments. Also, since this also updates the lustre-collector I believe we will need to update the version in EMF once this lands.

breuhan self-assigned this Oct 27, 2025

breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 9334a78 to 1478c07 Compare October 27, 2025 11:15

breuhan added the enhancement New feature or request label Oct 27, 2025

breuhan added this to the next-calver milestone Oct 27, 2025

breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 1478c07 to 4daf4d8 Compare October 27, 2025 16:35

breuhan requested review from Copilot, johnsonw, jparris and spoutn1k October 27, 2025 16:40

Copilot AI reviewed Oct 27, 2025

View reviewed changes

lustre-collector/src/recovery_status_parser.rs Show resolved Hide resolved

lustre-collector/src/recovery_status_parser.rs Show resolved Hide resolved

Copilot AI reviewed Oct 27, 2025

View reviewed changes

breuhan changed the title ~~Enhance recovery status parser to include additional metrics~~ Add recover_status parser lustrefs_exporter Oct 28, 2025

breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 4daf4d8 to bcfff9a Compare October 28, 2025 15:04

breuhan marked this pull request as ready for review October 28, 2025 15:04

breuhan requested a review from jgrund as a code owner October 28, 2025 15:04

Add recovery_status to lustrefs_exporter and

e42468a

enhance recovery status parser to include additional metrics: - RecoveryDuration - RecoveryTimeRemaining - RecoveryTotalClients

breuhan force-pushed the breuhan/parsing_full_recovery_status branch from bcfff9a to e42468a Compare November 6, 2025 19:15

jparris previously approved these changes Nov 10, 2025

View reviewed changes