-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
H_0
RAPL metrics are stable against coherent workloads and Perf, HWPC and SmartWatts produced metrics stay stable (i.e does not introduce instability) by using this interface as base input.
Open Questions
-
- Q: Is
stress-nga correct candidate ?
- Q: Is
- Stress can be configured to reproduce the same usage at given report frequency
-
- Q: What is the current baseline/SOTA to be considered ?
- RAPL
-
- Q: What are the metrics to be consider both for MM and OM
- MM
- Joules
- OM
- CPU Consumption
- Memory usage
- Global system calls impact
-
- Q: What is the "most specific" zoom to consider in sub-benchmarks results
To Avoid : Benchmarking Crimes
- Selective Benchmarking
- Not evaluating degradation in other areas : Performance significantly improve in area of interest AND does not significantly degrade elsewhere
- Check for decrease of std, variance and cv of Metric Measured (MM) with stable Outside Metrics (OM)
- Cherry-picking subsets of a suite without justification (and conclude about the whole suite) : Justify avoided subset, not arbitrary choices
- Run the benchmark against all available type of infrastructure (with reasonable combination)
- Selective data set hidding deficiencies : no restriction of a dataset if removed data tells a different story
- Present the whole benchmarks results
- Improper handling of benchmarks results
- Micro-benchmarks cannot be pictured alone, they can be used an example before presenting macro-benchmarking for a real-world workloads
- May this benchmark conclude of stability against stress-ng use-cases, ensure that it stays true again a more complexe use case, still controlled ?
- Throughput degraded by X% => overhead is X% : Accompany throughput comparisons with complete CPU load AND Compare I/O throughput in terms of processing time per bit
- If comparing two stability results for different PowerAPI version, do so against OM
- Downplaying overheads
- X% to Y% is (Y-X)% increase/decrease : 1% to 2% is doubling
- baseline is denominator in relative comparison
- No indication of signifiance of data : always refer to variance/standard deviation and any obvious indicators (R²...), even consider min/max. Use student's t-test to check significance
- Geometric mean shall be preferred to Arithmetic means at least with normalized values
- Using the wrong benchmarks
- Benchmarking of simplified simulated system : use a representative system AND do not make any simplifying assumption with impact
- Ensure that stress-ng is a correct candidate for benchmarking
- Unappropriated and misleading benchmarks : measure what you are changing
- Use the same inputs of formula/sensors on a given benchmark
- Same dataset for calibration and validation : disjoint both data sets
- Use different program / options to calibrate SmartWatts on development process
- Benchmarking of simplified simulated system : use a representative system AND do not make any simplifying assumption with impact
- Improper comparison of results
- No proper baseline : always compare to the "real" baseline (SOTA solution, theoretically best)
- Define OQ Baseline
- Only evaluating against yourself : Even re-edition/correction shall refer to the current "real" baseline
- Present evolution from previous paper and from current SOTA/baseline
- Unfair benchmarking of competitors : be explicit and complete about competitor's tool use (even consider having them validating the results)
- If comparing to Scaphandre...
- Inflating gains by not comparing against current SOTA : avoid "they improved by X and us by Y" prefer "We improved X new established baseline by Z"
- Update results of previous papier with current baseline
- No proper baseline : always compare to the "real" baseline (SOTA solution, theoretically best)
- Missing information
- Missing specification of evaluation platform : give as much details about ALL used hardware to ensure reproducibility : processor arch, nb cores, clock rate, memory sizes, all cache level size, core type, microarchitecture, OS & version, hypervisor & version
- Enrich as much metadata as possible to have sup resources
- Present only top level/aggregated results : sub-benchmarks shall be presented to avoid loss of information
- Define OQ Adequat grain for "most specific" sub-benchmarks
- Relative numbers only : present raw values in addition of ratio to allow people to sanity check
- To do ! :)
Best Practice
- Document what I do
- In /**/*.md in github
- Run warp-up iterations that aren't timed
- May be usefull to litteraly warm-up CPU
- Do several runs and check std (expected to be < 0.1% )
- [ ]We may define 0.1% as the ideal threshold to focus on something else (and then keep 0.1% as a threshold for further features acceptance)
- Use combination of successive and separate run
- same twice in a row (possible caching exhibit)
- May be relevant for HWPC Sensor
- explore data set in both directions
- "directions" might be "more stressed" to "more relaxed" and return ? may need a huge "cool off" step ?
- same twice in a row (possible caching exhibit)
- If making use of regular strides (2, 4, 8, 16...) also use "random" points to avoid "pathological case" but keep regular in order to identify those pathological cases
- May bee considered in formulas ?
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation