Add support for lustre OSD cache statistics and update related tests#97
Add support for lustre OSD cache statistics and update related tests#97
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #97 +/- ##
==========================================
+ Coverage 93.85% 93.93% +0.08%
==========================================
Files 44 44
Lines 5498 5573 +75
Branches 5498 5573 +75
==========================================
+ Hits 5160 5235 +75
Misses 269 269
Partials 69 69
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Benchmark for 9d8d245Click to view benchmark
|
c8e95b5 to
a409782
Compare
Benchmark for b2bde47Click to view benchmark
|
There was a problem hiding this comment.
Pull Request Overview
This PR extends Lustre OSD support by adding new cache-related statistics to the exporter and collector, and updates accompanying tests and fixtures to reflect these metrics.
- Added OpenTelemetry counters and handlers for
get_page,cache_access,cache_hit,cache_miss, andmany_creditsin the exporter. - Extended the OSD parser in the collector to emit these stats and updated its test snapshots.
- Updated JSON fixtures and Prometheus/OTel snapshot files to include the new stats.
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| lustrefs-exporter/src/stats.rs | Added new cache metrics counters, descriptions, and handlers |
| lustrefs-exporter/src/fixtures/stats.json | Inserted JSON entries for OSD cache stats |
| lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats.snap | Added Prometheus snapshot lines for new cache metrics |
| lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats_otel.snap | Added OTel snapshot lines for new cache metrics |
| lustre-collector/src/stats_parser.rs | Updated parser to recognize osd in name_count_units |
| lustre-collector/src/osd_parser.rs | Introduced Stats variant in OSD parser with new stats() |
| lustre-collector/src/fixtures/osd.txt | Added OSD stats lines to base fixture |
| lustre-collector/src/fixtures/osd_active.txt | Added OSD stats lines to active fixture |
| lustre-collector/src/snapshots/lustre_collector__tests__params.snap | Included osd-*.*.stats in parameter list snapshot |
| lustre-collector/src/snapshots/lustre_collector__stats_parser__tests__stats.snap | Added new Stat entries for cache metrics to parser snapshot |
| lustre-collector/src/snapshots/lustre_collector__osd_parser__tests__osd_stats.snap | Updated empty-stats test for OSD parser |
| lustre-collector/src/snapshots/lustre_collector__osd_parser__tests__osd_active_stats.snap | Added active-stats snapshot for cache metrics in OSD parser |
Comments suppressed due to low confidence (5)
lustrefs-exporter/src/stats.rs:94
- The description for
cache_access_totalis missing the word "of"; consider updating to "The total number of cache accesses."
.with_description("The total number cache accesses.")
lustrefs-exporter/src/stats.rs:98
- The description for
cache_hit_totalis unclear; update it to something like "The total number of cache hits."
.with_description("The total number hits misses.")
lustrefs-exporter/src/stats.rs:102
- The description for
cache_miss_totalshould include "of"; consider changing it to "The total number of cache misses."
.with_description("The total number cache misses.")
lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats_otel.snap:13
- The snapshot is missing entries for
lustre_cache_hit_total(HELP, TYPE, and metric line) to cover the implemented metric.
lustre_cache_access_total{component="ost",operation="cache_access",target="exatest-OST0003",otel_scope_name="lustre"} 297
lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats.snap:861
- The Prometheus snapshot is missing entries for
lustre_cache_hit_total; please add HELP, TYPE, and the metric line for it.
lustre_cache_miss_total{component="ost",operation="cache_miss",target="exatest-OST0003",otel_scope_name="lustre"} 297
a409782 to
882eb89
Compare
Benchmark for b20ced4Click to view benchmark
|
|
Please add a description to the top. |
johnsonw
left a comment
There was a problem hiding this comment.
I think this looks good overall. There are a couple of spots where we should add a newline. This is minor.
Benchmark for c08f345Click to view benchmark
|
lustrefs-exporter/src/snapshots/lustrefs_exporter__tests__stats.snap
Outdated
Show resolved
Hide resolved
2edf1ab to
5bf27ef
Compare
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
f77ce82 to
1e5a266
Compare
johnsonw
left a comment
There was a problem hiding this comment.
A couple of small comments but one of the things that may be missing is cache_total. I see this in the stats.json but it doesn't look like it's being handled in the code. Can you confirm? In either case, we need to add a test to ensure it's being covered.
be69529 to
b623db1
Compare
johnsonw
left a comment
There was a problem hiding this comment.
Couple of small comments
|
Please post a demo where you hit the scrape endpoint and show the new metrics being collected in the output. |
b623db1 to
9342644
Compare
|
This is the output from a real system running on the latest version |
|
Tested on GCP instance and verified that |
lustrefs-exporter/testcmds/cmds_test_jobstats_with_stderr_output.json
Outdated
Show resolved
Hide resolved
lustrefs-exporter/testcmds/cmds_test_lustre_metrics_output_with_mock.json
Outdated
Show resolved
Hide resolved
lustrefs-exporter/testcmds/cmds_test_metrics_endpoint_is_idempotent.json
Outdated
Show resolved
Hide resolved
|
| Branch | breuhan/add_lustre_cache_metrics |
| Testbed | ci-runner |
⚠️ WARNING: No Threshold found!Without a Threshold, no Alerts will ever be generated.
- RAM Hits (hits)
- LLi Miss Rate (misses (%))
- D1mr (misses (reads))
- RAM Hit Rate (hits (%))
- DLmr (misses (reads))
- Total read+write (reads/writes)
- Dw (writes)
- DLmw (misses (writes))
- LL Hits (hits)
- LLd Miss Rate (misses (%))
- LL Miss Rate (misses (%))
- L1 Hits (hits)
- I1mr (misses (reads))
- D1 Miss Rate (misses (%))
- D1mw (misses (writes))
- L1 Hit Rate (hits (%))
- LL Hit Rate (hits (%))
- Dr (reads)
- Estimated Cycles (cycles)
- I1 Miss Rate (misses (%))
- ILmr (misses (reads))
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the--ci-only-thresholdsflag.
Click to view all benchmark results
| Benchmark | D1 Miss Rate | misses (%) | D1mr | misses (reads) x 1e3 | D1mw | misses (writes) x 1e3 | DLmr | misses (reads) | DLmw | misses (writes) x 1e3 | Dr | reads x 1e6 | Dw | writes x 1e6 | Estimated Cycles | cycles x 1e6 | I1 Miss Rate | misses (%) | I1mr | misses (reads) x 1e3 | ILmr | misses (reads) | Instructions | Benchmark Result instructions x 1e6 (Result Δ%) | Lower Boundary instructions x 1e6 (Limit %) | Upper Boundary instructions x 1e6 (Limit %) | L1 Hit Rate | hits (%) | L1 Hits | hits x 1e6 | LL Hit Rate | hits (%) | LL Hits | hits x 1e3 | LL Miss Rate | misses (%) | LLd Miss Rate | misses (%) | LLi Miss Rate | misses (%) | RAM Hit Rate | hits (%) | RAM Hits | hits x 1e3 | Total read+write | reads/writes x 1e6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| lustre_metrics::memory_benches::bench_encode_lustre_metrics with_setup:generate_records() | 📈 view plot | 0.92 % | 📈 view plot | 25.00 reads x 1e3 | 📈 view plot | 9.09 writes x 1e3 | 📈 view plot | 111.00 reads | 📈 view plot | 6.48 writes x 1e3 | 📈 view plot | 2.47 x 1e6 | 📈 view plot | 1.22 x 1e6 | 📈 view plot | 14.79 x 1e6 | 📈 view plot | 0.01 % | 📈 view plot | 1.03 reads x 1e3 | 📈 view plot | 875.00 reads | 📈 view plot 🚷 view threshold | 10.74 x 1e6(-20.24%)Baseline: 13.46 x 1e6 | 2.76 x 1e6 (25.73%) | 24.16 x 1e6 (44.44%) | 📈 view plot | 99.76 % | 📈 view plot | 14.39 x 1e6 | 📈 view plot | 0.19 % | 📈 view plot | 27.66 x 1e3 | 📈 view plot | 0.05 % | 📈 view plot | 0.18 % | 📈 view plot | 0.01 % | 📈 view plot | 0.05 % | 📈 view plot | 7.46 x 1e3 | 📈 view plot | 14.43 x 1e6 |
This PR will add lustre cache metrics to
lustrefs-exporter. These are already existing stats that are now exposed.Example of the source stats:
This will result in new OTEL exposed metrics (just an example)