Skip to content

Add recover_status parser lustrefs_exporter#118

Merged
jparris merged 3 commits intomainfrom
breuhan/parsing_full_recovery_status
Dec 19, 2025
Merged

Add recover_status parser lustrefs_exporter#118
jparris merged 3 commits intomainfrom
breuhan/parsing_full_recovery_status

Conversation

@breuhan
Copy link
Contributor

@breuhan breuhan commented Oct 27, 2025

Added capture of recovery_status to lustrefs_exporter and added these new metrics:
RecoveryDuration
RecoveryTimeRemaining
RecoveryTotalClients

Demo:

Input:

obdfilter.200NVX2-OST0000.recovery_status=
status: COMPLETE
recovery_start: 1761567698
recovery_duration: 1
completed_clients: 8/8
replayed_requests: 0
last_transno: 17184814233
VBR: DISABLED
IR: ENABLED
obdfilter.200NVX2-OST0003.recovery_status=
status: COMPLETE
recovery_start: 1759494116
recovery_duration: 15
completed_clients: 8/8
replayed_requests: 0
last_transno: 12934942643
VBR: DISABLED
IR: DISABLED
obdfilter.200NVX2-OST0004.recovery_status=
status: COMPLETE
recovery_start: 1759494116
recovery_duration: 14
completed_clients: 8/8
replayed_requests: 0
last_transno: 12934956643
VBR: DISABLED
IR: DISABLED
obdfilter.200NVX2-OST0007.recovery_status=
status: COMPLETE
recovery_start: 1759494116
recovery_duration: 14
completed_clients: 8/8
replayed_requests: 0
last_transno: 12934943652
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0000.recovery_status=
status: COMPLETE
recovery_start: 1759494099
recovery_duration: 55
completed_clients: 22/22
replayed_requests: 0
last_transno: 8763765977
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0003.recovery_status=
status: COMPLETE
recovery_start: 1759494104
recovery_duration: 50
completed_clients: 22/22
replayed_requests: 0
last_transno: 8719550640
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0004.recovery_status=
status: COMPLETE
recovery_start: 1759494104
recovery_duration: 50
completed_clients: 22/22
replayed_requests: 0
last_transno: 8763716647
VBR: DISABLED
IR: DISABLED
mdt.200NVX2-MDT0007.recovery_status=
status: COMPLETE
recovery_start: 1759494104
recovery_duration: 50
completed_clients: 22/22
replayed_requests: 0
last_transno: 13014521934
VBR: DISABLED
IR: DISABLED

Output:

# HELP recovery_status Gives the recovery status off a target. 0=Complete 1=Inactive 2=Waiting 3=WaitingForClients 4=Recovering 5=Unknown.
# TYPE recovery_status gauge
recovery_status{kind="MDT",target="200NVX2-MDT0000"} 0
recovery_status{kind="MDT",target="200NVX2-MDT0003"} 0
recovery_status{kind="MDT",target="200NVX2-MDT0004"} 0
recovery_status{kind="MDT",target="200NVX2-MDT0007"} 0
recovery_status{kind="OST",target="200NVX2-OST0000"} 0
recovery_status{kind="OST",target="200NVX2-OST0003"} 0
recovery_status{kind="OST",target="200NVX2-OST0004"} 0
recovery_status{kind="OST",target="200NVX2-OST0007"} 0
# HELP recovery_status_completed_clients Gives the count of clients that complete the recovery on a target.
# TYPE recovery_status_completed_clients gauge
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0000"} 22
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0003"} 22
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0004"} 22
recovery_status_completed_clients{kind="MDT",target="200NVX2-MDT0007"} 22
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0000"} 8
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0003"} 8
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0004"} 8
recovery_status_completed_clients{kind="OST",target="200NVX2-OST0007"} 8
# HELP recovery_status_duration_seconds Gives the total duration in seconds of the recovery on a target.
# TYPE recovery_status_duration_seconds gauge
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0000"} 55
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0003"} 50
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0004"} 50
recovery_status_duration_seconds{kind="MDT",target="200NVX2-MDT0007"} 50
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0000"} 1
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0003"} 15
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0004"} 14
recovery_status_duration_seconds{kind="OST",target="200NVX2-OST0007"} 14

@breuhan breuhan self-assigned this Oct 27, 2025
@codecov
Copy link

codecov bot commented Oct 27, 2025

Codecov Report

❌ Patch coverage is 76.00000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.34%. Comparing base (6efdfe1) to head (7268720).

Files with missing lines Patch % Lines
lustrefs-exporter/src/brw_stats.rs 16.66% 30 Missing ⚠️
Additional details and impacted files
@@                             Coverage Diff                             @@
##           spoutn1k/EHT-1348-history-in-the-making     #118      +/-   ##
===========================================================================
- Coverage                                    89.60%   89.34%   -0.26%     
===========================================================================
  Files                                           44       44              
  Lines                                         5375     5470      +95     
  Branches                                      5375     5470      +95     
===========================================================================
+ Hits                                          4816     4887      +71     
- Misses                                         484      509      +25     
+ Partials                                        75       74       -1     
Flag Coverage Δ
2_14_0_ddn133 36.61% <22.47%> (+1.87%) ⬆️
2_14_0_ddn145 38.55% <34.83%> (+2.82%) ⬆️
all-tests 89.34% <76.00%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Oct 27, 2025

🐰 Bencher Report

Branchbreuhan/parsing_full_recovery_status
Testbedci-runner
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
nanoseconds (ns)
(Result Δ%)
Lower Boundary
nanoseconds (ns)
(Limit %)
Upper Boundary
nanoseconds (ns)
(Limit %)
parse_benchmarks/combine_performance📈 view plot
🚷 view threshold
124,620,000.00 ns
(-57.53%)Baseline: 293,399,285.71 ns
-764,455,415.43 ns
(-613.43%)
1,351,253,986.86 ns
(9.22%)
🐰 View full continuous benchmarking report in Bencher

@github-actions
Copy link

github-actions bot commented Oct 27, 2025

🐰 Bencher Report

Branchbreuhan/parsing_full_recovery_status
Testbedci-runner

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkD1 Miss Ratemisses (%)D1mrmisses (reads) x 1e3D1mwmisses (writes) x 1e3DLmrmisses (reads)DLmwmisses (writes) x 1e3Drreads x 1e6Dwwrites x 1e6Estimated Cyclescycles x 1e6I1 Miss Ratemisses (%)I1mrmisses (reads) x 1e3ILmrmisses (reads)InstructionsBenchmark Result
instructions x 1e6
(Result Δ%)
Lower Boundary
instructions x 1e6
(Limit %)
Upper Boundary
instructions x 1e6
(Limit %)
L1 Hit Ratehits (%)L1 Hitshits x 1e6LL Hit Ratehits (%)LL Hitshits x 1e3LL Miss Ratemisses (%)LLd Miss Ratemisses (%)LLi Miss Ratemisses (%)RAM Hit Ratehits (%)RAM Hitshits x 1e3Total read+writereads/writes x 1e6
lustre_metrics::memory_benches::bench_encode_lustre_metrics with_setup:generate_records()📈 view plot
⚠️ NO THRESHOLD
0.93 %📈 view plot
⚠️ NO THRESHOLD
25.00 reads x 1e3📈 view plot
⚠️ NO THRESHOLD
9.21 writes x 1e3📈 view plot
⚠️ NO THRESHOLD
117.00 reads📈 view plot
⚠️ NO THRESHOLD
6.45 writes x 1e3📈 view plot
⚠️ NO THRESHOLD
2.47 x 1e6📈 view plot
⚠️ NO THRESHOLD
1.23 x 1e6📈 view plot
⚠️ NO THRESHOLD
14.81 x 1e6📈 view plot
⚠️ NO THRESHOLD
0.01 %📈 view plot
⚠️ NO THRESHOLD
1.06 reads x 1e3📈 view plot
⚠️ NO THRESHOLD
890.00 reads📈 view plot
🚷 view threshold
10.75 x 1e6
(-23.14%)Baseline: 13.99 x 1e6
2.43 x 1e6
(22.65%)
25.54 x 1e6
(42.09%)
📈 view plot
⚠️ NO THRESHOLD
99.76 %📈 view plot
⚠️ NO THRESHOLD
14.41 x 1e6📈 view plot
⚠️ NO THRESHOLD
0.19 %📈 view plot
⚠️ NO THRESHOLD
27.83 x 1e3📈 view plot
⚠️ NO THRESHOLD
0.05 %📈 view plot
⚠️ NO THRESHOLD
0.18 %📈 view plot
⚠️ NO THRESHOLD
0.01 %📈 view plot
⚠️ NO THRESHOLD
0.05 %📈 view plot
⚠️ NO THRESHOLD
7.46 x 1e3📈 view plot
⚠️ NO THRESHOLD
14.44 x 1e6
🐰 View full continuous benchmarking report in Bencher

@github-actions
Copy link

github-actions bot commented Oct 27, 2025

🐰 Bencher Report

Branchbreuhan/parsing_full_recovery_status
Testbedci-runner

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
Benchmarkavg_runtime_rss_mibMeasure (MiB)avg_runtime_virtual_mibMeasure (MiB)end_rss_mibMeasure (MiB)end_virtual_mibMeasure (MiB)memory_growth_mibMeasure (MiB)peak_over_start_rss_ratioMeasure (units)peak_over_start_virtual_ratioMeasure (units)peak_rss_mibBenchmark Result
Measure (MiB)
(Result Δ%)
Lower Boundary
Measure (MiB)
(Limit %)
Upper Boundary
Measure (MiB)
(Limit %)
peak_virtual_mibMeasure (MiB)start_rss_mibMeasure (MiB)start_virtual_mibMeasure (MiB)virtual_growth_mibMeasure (MiB)
scrape_allocations📈 view plot
⚠️ NO THRESHOLD
43.09 MiB📈 view plot
⚠️ NO THRESHOLD
893.91 MiB📈 view plot
⚠️ NO THRESHOLD
43.16 MiB📈 view plot
⚠️ NO THRESHOLD
894.01 MiB📈 view plot
⚠️ NO THRESHOLD
0.33 MiB📈 view plot
⚠️ NO THRESHOLD
1.03 units📈 view plot
⚠️ NO THRESHOLD
1.02 units📈 view plot
🚷 view threshold
45.14 MiB
(-46.82%)Baseline: 84.88 MiB
-142.07 MiB
(-314.72%)
311.82 MiB
(14.48%)
📈 view plot
⚠️ NO THRESHOLD
944.80 MiB📈 view plot
⚠️ NO THRESHOLD
42.83 MiB📈 view plot
⚠️ NO THRESHOLD
888.06 MiB📈 view plot
⚠️ NO THRESHOLD
5.95 MiB
🐰 View full continuous benchmarking report in Bencher

@breuhan breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 9334a78 to 1478c07 Compare October 27, 2025 11:15
@breuhan breuhan added the enhancement New feature or request label Oct 27, 2025
@breuhan breuhan added this to the next-calver milestone Oct 27, 2025
@breuhan breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 1478c07 to 4daf4d8 Compare October 27, 2025 16:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the recovery status parser to collect and export four additional metrics for Lustre filesystem recovery monitoring: completed clients, duration, time remaining, and total clients involved in recovery operations.

  • Added support for parsing recovery_duration, time_remaining, and total_clients fields from recovery status output
  • Introduced new Prometheus metrics for recovery duration, time remaining, and total client counts
  • Updated test fixtures and snapshots to validate the new metrics extraction

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lustre-collector/src/recovery_status_parser.rs Extended parser to extract duration, time remaining, and total client metrics from recovery status
lustre-collector/src/types.rs Added new TargetStats variants for the additional recovery metrics
lustre-collector/src/parser.rs Integrated recovery status parser into main parsing flow
lustrefs-exporter/src/brw_stats.rs Added Prometheus metric families and registration for the new recovery metrics
lustrefs-exporter/src/lib.rs Added new metric names to the validation list and improved error handling
Test fixtures and snapshots Updated test data and expected outputs to validate new metric extraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@breuhan breuhan changed the title Enhance recovery status parser to include additional metrics Add recover_status parser lustrefs_exporter Oct 28, 2025
@notion-workspace
Copy link

@breuhan breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 4daf4d8 to bcfff9a Compare October 28, 2025 15:04
@breuhan breuhan marked this pull request as ready for review October 28, 2025 15:04
@breuhan breuhan requested a review from jgrund as a code owner October 28, 2025 15:04
enhance recovery status parser to include additional metrics:

- RecoveryDuration
- RecoveryTimeRemaining
- RecoveryTotalClients
@breuhan breuhan force-pushed the breuhan/parsing_full_recovery_status branch from bcfff9a to e42468a Compare November 6, 2025 19:15
jparris
jparris previously approved these changes Nov 10, 2025
Copy link
Contributor

@johnsonw johnsonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Also, since this also updates the lustre-collector I believe we will need to update the version in EMF once this lands.

@breuhan breuhan changed the base branch from main to spoutn1k/EHT-1348-history-in-the-making November 18, 2025 12:01
@breuhan breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 0240dbd to 7268720 Compare November 18, 2025 12:06
@breuhan breuhan requested review from johnsonw and jparris November 18, 2025 18:14
@spoutn1k spoutn1k force-pushed the spoutn1k/EHT-1348-history-in-the-making branch from a789529 to dbc3217 Compare December 3, 2025 10:48
@breuhan breuhan force-pushed the breuhan/parsing_full_recovery_status branch from 7268720 to d2519eb Compare December 9, 2025 15:47
@breuhan breuhan changed the base branch from spoutn1k/EHT-1348-history-in-the-making to main December 9, 2025 15:48
@jparris jparris merged commit ee3df94 into main Dec 19, 2025
24 checks passed
@jparris jparris deleted the breuhan/parsing_full_recovery_status branch December 19, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants