Skip to content

Add support for lustre OSD cache statistics and update related tests#97

Merged
johnsonw merged 6 commits intomainfrom
breuhan/add_lustre_cache_metrics
Oct 22, 2025
Merged

Add support for lustre OSD cache statistics and update related tests#97
johnsonw merged 6 commits intomainfrom
breuhan/add_lustre_cache_metrics

Conversation

@breuhan
Copy link
Contributor

@breuhan breuhan commented Jun 18, 2025

This PR will add lustre cache metrics to lustrefs-exporter. These are already existing stats that are now exposed.
Example of the source stats:

#  lctl get_param osd-ldiskfs.*.stats
osd-ldiskfs.testfs-MDT0000.stats=
snapshot_time             1749240333.839663548 secs.nsecs
start_time                1748242077.730760020 secs.nsecs
elapsed_time              998256.108903528 secs.nsecs
osd-ldiskfs.testfs-OST0000.stats=
snapshot_time             1749240333.839676322 secs.nsecs
start_time                1748242070.560760890 secs.nsecs
elapsed_time              998263.278915432 secs.nsecs
get_page                  1324419919 samples [usecs] 0 3365 1283811895 24240675855
cache_access              1130512151 samples [pages] 1 1024 83720152057
cache_miss                1130512151 samples [pages] 1 1024 83720152057
osd-ldiskfs.testfs-OST0001.stats=
snapshot_time             1749240333.839699900 secs.nsecs
start_time                1748242070.558966984 secs.nsecs
elapsed_time              998263.280732916 secs.nsecs
get_page                  1309638591 samples [usecs] 0 5169 1269579026 23871296444
cache_access              1117889640 samples [pages] 1 1024 82781693532
cache_miss                1117889640 samples [pages] 1 1024 82781693532

This will result in new OTEL exposed metrics (just an example)

# HELP lustre_cache_access_total The total number cache accesses.
# TYPE lustre_cache_access_total counter
lustre_cache_access_total{component="ost",operation="cache_access",target="exatest-OST0003",otel_scope_name="lustre"} 297
# HELP lustre_cache_hit_total The total of cache hits.
# TYPE lustre_cache_hit_total counter
lustre_cache_hit_total{component="ost",operation="cache_hit",target="exatest-OST0003",otel_scope_name="lustre"} 0
# HELP lustre_cache_miss_total The total number cache misses.
# TYPE lustre_cache_miss_total counter
lustre_cache_miss_total{component="ost",operation="cache_miss",target="exatest-OST0003",otel_scope_name="lustre"} 297

@breuhan breuhan self-assigned this Jun 18, 2025
@codecov
Copy link

codecov bot commented Jun 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.93%. Comparing base (2c177ec) to head (1fce52e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #97      +/-   ##
==========================================
+ Coverage   93.85%   93.93%   +0.08%     
==========================================
  Files          44       44              
  Lines        5498     5573      +75     
  Branches     5498     5573      +75     
==========================================
+ Hits         5160     5235      +75     
  Misses        269      269              
  Partials       69       69              
Flag Coverage Δ
2_14_0_ddn133 34.74% <6.06%> (+<0.01%) ⬆️
2_14_0_ddn145 35.72% <6.06%> (+<0.01%) ⬆️
all-tests 93.93% <100.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

Benchmark for 9d8d245

Click to view benchmark
Test Base PR %
jobstats otel 100 1675.7±33.03µs 1682.6±56.87µs +0.41%
jobstats otel 1000 15.5±0.24ms 15.5±0.14ms 0.00%

@breuhan breuhan force-pushed the breuhan/add_lustre_cache_metrics branch from c8e95b5 to a409782 Compare June 18, 2025 12:34
@github-actions
Copy link

Benchmark for b2bde47

Click to view benchmark
Test Base PR %
jobstats otel 100 1678.4±36.49µs 1707.2±79.15µs +1.72%
jobstats otel 1000 15.6±0.13ms 15.7±0.13ms +0.64%

@whamcloud whamcloud deleted a comment from notion-workspace bot Jun 18, 2025
@breuhan breuhan requested review from RDruon and Copilot June 18, 2025 13:38
@breuhan breuhan marked this pull request as ready for review June 18, 2025 13:38
@breuhan breuhan requested a review from jgrund as a code owner June 18, 2025 13:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends Lustre OSD support by adding new cache-related statistics to the exporter and collector, and updates accompanying tests and fixtures to reflect these metrics.

  • Added OpenTelemetry counters and handlers for get_page, cache_access, cache_hit, cache_miss, and many_credits in the exporter.
  • Extended the OSD parser in the collector to emit these stats and updated its test snapshots.
  • Updated JSON fixtures and Prometheus/OTel snapshot files to include the new stats.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
lustrefs-exporter/src/stats.rs Added new cache metrics counters, descriptions, and handlers
lustrefs-exporter/src/fixtures/stats.json Inserted JSON entries for OSD cache stats
lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats.snap Added Prometheus snapshot lines for new cache metrics
lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats_otel.snap Added OTel snapshot lines for new cache metrics
lustre-collector/src/stats_parser.rs Updated parser to recognize osd in name_count_units
lustre-collector/src/osd_parser.rs Introduced Stats variant in OSD parser with new stats()
lustre-collector/src/fixtures/osd.txt Added OSD stats lines to base fixture
lustre-collector/src/fixtures/osd_active.txt Added OSD stats lines to active fixture
lustre-collector/src/snapshots/lustre_collector__tests__params.snap Included osd-*.*.stats in parameter list snapshot
lustre-collector/src/snapshots/lustre_collector__stats_parser__tests__stats.snap Added new Stat entries for cache metrics to parser snapshot
lustre-collector/src/snapshots/lustre_collector__osd_parser__tests__osd_stats.snap Updated empty-stats test for OSD parser
lustre-collector/src/snapshots/lustre_collector__osd_parser__tests__osd_active_stats.snap Added active-stats snapshot for cache metrics in OSD parser
Comments suppressed due to low confidence (5)

lustrefs-exporter/src/stats.rs:94

  • The description for cache_access_total is missing the word "of"; consider updating to "The total number of cache accesses."
                    .with_description("The total number cache accesses.")

lustrefs-exporter/src/stats.rs:98

  • The description for cache_hit_total is unclear; update it to something like "The total number of cache hits."
                    .with_description("The total number hits misses.")

lustrefs-exporter/src/stats.rs:102

  • The description for cache_miss_total should include "of"; consider changing it to "The total number of cache misses."
                    .with_description("The total number cache misses.")

lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats_otel.snap:13

  • The snapshot is missing entries for lustre_cache_hit_total (HELP, TYPE, and metric line) to cover the implemented metric.
lustre_cache_access_total{component="ost",operation="cache_access",target="exatest-OST0003",otel_scope_name="lustre"} 297

lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats.snap:861

  • The Prometheus snapshot is missing entries for lustre_cache_hit_total; please add HELP, TYPE, and the metric line for it.
lustre_cache_miss_total{component="ost",operation="cache_miss",target="exatest-OST0003",otel_scope_name="lustre"} 297

@breuhan breuhan force-pushed the breuhan/add_lustre_cache_metrics branch from a409782 to 882eb89 Compare June 18, 2025 13:51
@github-actions
Copy link

Benchmark for b20ced4

Click to view benchmark
Test Base PR %
jobstats otel 100 1671.8±38.75µs 1692.5±31.65µs +1.24%
jobstats otel 1000 15.5±0.10ms 15.5±0.14ms 0.00%

RDruon
RDruon previously approved these changes Jun 19, 2025
@johnsonw
Copy link
Contributor

Please add a description to the top.

Copy link
Contributor

@johnsonw johnsonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good overall. There are a couple of spots where we should add a newline. This is minor.

@github-actions
Copy link

Benchmark for c08f345

Click to view benchmark
Test Base PR %
jobstats otel 100 1671.8±26.60µs 1678.7±40.99µs +0.41%
jobstats otel 1000 15.5±0.11ms 15.5±0.12ms 0.00%

@breuhan breuhan requested a review from johnsonw June 23, 2025 18:32
RDruon
RDruon previously approved these changes Jun 24, 2025
johnsonw
johnsonw previously approved these changes Jun 24, 2025
Copy link
Contributor

@jgrund jgrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question

@breuhan breuhan dismissed stale reviews from johnsonw and RDruon via 5bf27ef June 26, 2025 14:06
@breuhan breuhan force-pushed the breuhan/add_lustre_cache_metrics branch from 2edf1ab to 5bf27ef Compare June 26, 2025 14:06
@breuhan breuhan requested a review from RDruon June 26, 2025 14:07
@mdiep25 mdiep25 requested review from johnsonw and spoutn1k October 15, 2025 15:28
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@breuhan breuhan force-pushed the breuhan/add_lustre_cache_metrics branch from f77ce82 to 1e5a266 Compare October 15, 2025 20:29
@breuhan breuhan requested a review from jparris October 15, 2025 20:30
Copy link
Contributor

@johnsonw johnsonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small comments but one of the things that may be missing is cache_total. I see this in the stats.json but it doesn't look like it's being handled in the code. Can you confirm? In either case, we need to add a test to ensure it's being covered.

@breuhan breuhan force-pushed the breuhan/add_lustre_cache_metrics branch 2 times, most recently from be69529 to b623db1 Compare October 16, 2025 14:52
@breuhan breuhan requested a review from johnsonw October 16, 2025 15:37
Copy link
Contributor

@johnsonw johnsonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small comments

@johnsonw
Copy link
Contributor

Please post a demo where you hit the scrape endpoint and show the new metrics being collected in the output.

@breuhan breuhan force-pushed the breuhan/add_lustre_cache_metrics branch from b623db1 to 9342644 Compare October 17, 2025 10:34
jparris
jparris previously approved these changes Oct 20, 2025
@breuhan breuhan requested a review from johnsonw October 20, 2025 16:41
@breuhan
Copy link
Contributor Author

breuhan commented Oct 21, 2025

This is the output from a real system running on the latest version

# HELP lustre_get_page_total The total number of times the linux page cache was used.
# TYPE lustre_get_page_total counter
lustre_get_page_total{component="ost",operation="get_page",target="200NVX2-OST0000"} 12933703
lustre_get_page_total{component="ost",operation="get_page",target="200NVX2-OST0003"} 22435773
lustre_get_page_total{component="ost",operation="get_page",target="200NVX2-OST0004"} 20381702
lustre_get_page_total{component="ost",operation="get_page",target="200NVX2-OST0007"} 26645652
# HELP lustre_cache_access_total The total number of cache accesses.
# TYPE lustre_cache_access_total counter
lustre_cache_access_total{component="ost",operation="cache_access",target="200NVX2-OST0000"} 7019565
lustre_cache_access_total{component="ost",operation="cache_access",target="200NVX2-OST0003"} 7305334
lustre_cache_access_total{component="ost",operation="cache_access",target="200NVX2-OST0004"} 6828081
lustre_cache_access_total{component="ost",operation="cache_access",target="200NVX2-OST0007"} 6863591
# HELP lustre_cache_miss_total The total number of cache misses.
# TYPE lustre_cache_miss_total counter
lustre_cache_miss_total{component="ost",operation="cache_miss",target="200NVX2-OST0000"} 7019565
lustre_cache_miss_total{component="ost",operation="cache_miss",target="200NVX2-OST0003"} 7305334
lustre_cache_miss_total{component="ost",operation="cache_miss",target="200NVX2-OST0004"} 6828081
lustre_cache_miss_total{component="ost",operation="cache_miss",target="200NVX2-OST0007"} 6863591

@johnsonw
Copy link
Contributor

Tested on GCP instance and verified that lustre_get_page_total, lustre_cache_access_total, lustre_cache_miss_total, and lustre_cache_hit_total are included in the exporter output:

# HELP lustre_get_page_total The total number of times the linux page cache was used.
# TYPE lustre_get_page_total counter
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0000"} 1963
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0001"} 1591
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0002"} 1933
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0003"} 1778
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0004"} 1597
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0005"} 1974
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0006"} 1780
lustre_get_page_total{component="ost",operation="get_page",target="emftest-OST0007"} 1590
# HELP lustre_cache_access_total The total number of cache accesses.
# TYPE lustre_cache_access_total counter
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0000"} 372
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0001"} 264
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0002"} 372
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0003"} 317
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0004"} 263
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0005"} 371
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0006"} 317
lustre_cache_access_total{component="ost",operation="cache_access",target="emftest-OST0007"} 255
# HELP lustre_cache_miss_total The total number of cache misses.
# TYPE lustre_cache_miss_total counter
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0000"} 372
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0001"} 264
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0002"} 372
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0003"} 317
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0004"} 263
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0005"} 371
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0006"} 317
lustre_cache_miss_total{component="ost",operation="cache_miss",target="emftest-OST0007"} 255
# HELP lustre_cache_hit_total The total number of cache hits.
# TYPE lustre_cache_hit_total counter
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0000"} 739
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0001"} 519
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0002"} 742
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0003"} 633
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0004"} 528
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0005"} 741
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0006"} 633
lustre_cache_hit_total{component="ost",operation="cache_hit",target="emftest-OST0007"} 508

johnsonw
johnsonw previously approved these changes Oct 22, 2025
spoutn1k
spoutn1k previously approved these changes Oct 22, 2025
@github-actions
Copy link

github-actions bot commented Oct 22, 2025

🐰 Bencher Report

Branchbreuhan/add_lustre_cache_metrics
Testbedci-runner

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkD1 Miss Ratemisses (%)D1mrmisses (reads) x 1e3D1mwmisses (writes) x 1e3DLmrmisses (reads)DLmwmisses (writes) x 1e3Drreads x 1e6Dwwrites x 1e6Estimated Cyclescycles x 1e6I1 Miss Ratemisses (%)I1mrmisses (reads) x 1e3ILmrmisses (reads)InstructionsBenchmark Result
instructions x 1e6
(Result Δ%)
Lower Boundary
instructions x 1e6
(Limit %)
Upper Boundary
instructions x 1e6
(Limit %)
L1 Hit Ratehits (%)L1 Hitshits x 1e6LL Hit Ratehits (%)LL Hitshits x 1e3LL Miss Ratemisses (%)LLd Miss Ratemisses (%)LLi Miss Ratemisses (%)RAM Hit Ratehits (%)RAM Hitshits x 1e3Total read+writereads/writes x 1e6
lustre_metrics::memory_benches::bench_encode_lustre_metrics with_setup:generate_records()📈 view plot
⚠️ NO THRESHOLD
0.92 %📈 view plot
⚠️ NO THRESHOLD
25.00 reads x 1e3📈 view plot
⚠️ NO THRESHOLD
9.09 writes x 1e3📈 view plot
⚠️ NO THRESHOLD
111.00 reads📈 view plot
⚠️ NO THRESHOLD
6.48 writes x 1e3📈 view plot
⚠️ NO THRESHOLD
2.47 x 1e6📈 view plot
⚠️ NO THRESHOLD
1.22 x 1e6📈 view plot
⚠️ NO THRESHOLD
14.79 x 1e6📈 view plot
⚠️ NO THRESHOLD
0.01 %📈 view plot
⚠️ NO THRESHOLD
1.03 reads x 1e3📈 view plot
⚠️ NO THRESHOLD
875.00 reads📈 view plot
🚷 view threshold
10.74 x 1e6
(-20.24%)Baseline: 13.46 x 1e6
2.76 x 1e6
(25.73%)
24.16 x 1e6
(44.44%)
📈 view plot
⚠️ NO THRESHOLD
99.76 %📈 view plot
⚠️ NO THRESHOLD
14.39 x 1e6📈 view plot
⚠️ NO THRESHOLD
0.19 %📈 view plot
⚠️ NO THRESHOLD
27.66 x 1e3📈 view plot
⚠️ NO THRESHOLD
0.05 %📈 view plot
⚠️ NO THRESHOLD
0.18 %📈 view plot
⚠️ NO THRESHOLD
0.01 %📈 view plot
⚠️ NO THRESHOLD
0.05 %📈 view plot
⚠️ NO THRESHOLD
7.46 x 1e3📈 view plot
⚠️ NO THRESHOLD
14.43 x 1e6
🐰 View full continuous benchmarking report in Bencher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants