rrd: Fix absolute rate calculations #6646

last-genius · 2025-09-02T08:50:41Z

Absolute metric should work as follows:

Timestamp    = 300, 600, 900, 1200
Step         = 300 seconds
ABSOLUTE DS  =    1,  2,   3,    4

But they do not seem to have ever worked correctly - they were previously divided by the interval but incorrectly handled NaNs, resulting in wrong behaviour. Then the refactoring (including 73ca3cc ("CA-404597: rrd - Pass Gauge and Absolute data source values as-is")) broke them further.

Divide absolute metrics by the interval in process_ds_value. Fix the unit test to expect the right behaviour - it passes with this fix.

This means absolute metrics are now correctly calculated as unit/s, changing the behaviour in #6640

I am not 100% sure this fix goes all the way either - unit tests should cover expected values in RRAs as well, and the logic is still really confusing to me.

Absolute metric should work as follows: Timestamp = 300, 600, 900, 1200 Step = 300 seconds ABSOLUTE DS = 1, 2, 3, 4 But they do not seem to have ever worked correctly - they were previously divided by the interval but incorrectly handled NaNs, resulting in wrong behaviour. Then the refactoring (including 73ca3cc ("CA-404597: rrd - Pass Gauge and Absolute data source values as-is")) broke them further. Divide absolute metrics by the interval in process_ds_value. Fix the unit test to expect the right behaviour - it passes with this fix. Signed-off-by: Andrii Sultanov <[email protected]>

edwintorok · 2025-09-02T09:18:18Z

Would be good to test against what rrdtool does (since the code tries to mimic its specification): feed the same input data to both rrdtool and xcp-rrdd (e.g. with a synthetic plugin), and check the output.
(Although I don't think there is a way to feed synthetic time to rrdtool)

edwintorok · 2025-09-02T09:20:25Z

Would be good to test against what rrdtool does (since the code tries to mimic its specification): feed the same input data to both rrdtool and xcp-rrdd (e.g. with a synthetic plugin), and check the output. (Although I don't think there is a way to feed synthetic time to rrdtool)

Ah actually you can feed timestamps to rrdupdate, see https://oss.oetiker.ch/rrdtool/doc/rrdupdate.en.html

{N | timestamp}:value[:value]...
The data used for updating the RRD was acquired at a certain time. This time can either be defined in seconds since 1970->01-01 or by using the letter 'N', in which case the update time is set to be the current time

That probably means we can't feed it fractional times, but that should be OK, we can generate synthetic data using integer timestamps only.

minglumlu · 2025-09-02T10:08:24Z

The Absolute is mainly used for IO stats, which are from outputs of /usr/bin/iostat.
Would the units of the stats be already something like "rMB/s"?
I think dividing interval seems right per the definition of Absolute. But would it get the wrong value for the existing IO stats?

robhoes · 2025-09-02T16:29:08Z

rrdp_iostat.ml currently defines 8 datasources of type Absolute. They all list their unit as x/s, except Tapdisks_in_low_memory_mode. That means that the value that come in must have unit x. Also, those counters must be reset to 0 when they are read (otherwise we'd need the type to be Derive).

Assuming that is indeed the case, then the implication is that those 8 datasource are wrong by a factor of 5 right now :/

robhoes · 2025-09-02T16:38:00Z

Here are 4 of those iostat values getting collected in the make function:

        ; io_throughput_read_mb=
            to_float
              (get_stats_read_sectors s3 -- opt get_stats_read_sectors last_s3)
            *. 512.
            /. 1048576.
        ; io_throughput_write_mb=
            to_float
              (get_stats_write_sectors s3 -- opt get_stats_write_sectors last_s3)
            *. 512.
            /. 1048576.
        ; iops_read=
            get_stats_read_reqs_completed s3
            -- opt get_stats_read_reqs_completed last_s3
        ; iops_write=
            get_stats_write_reqs_completed s3
            -- opt get_stats_write_reqs_completed last_s3

So those do indeed appear to have the right units and "reset" the value by subtracting the "last" value, before passing the values to ds_make, making Absolute the right RRD type.

The iowait datasource similarly looks good. So the 7 Absolute iostat RRDs look correctly defined, and therefore must be out by 5x before the fix in the PR :(

robhoes · 2025-09-02T16:46:57Z

The Tapdisks_in_low_memory_mode RRD looks a bit questionable. The incoming values are just the current number of tapdisks in this mode, an instantaneous value that is not time-dependent. I wonder if this should be a GAUGE instead.

This RRD was introduced in c1e0901.

minglumlu · 2025-09-03T08:09:38Z

I wonder if this should be a GAUGE instead.

Indeed. It should be a GAUGE as it's a measurement at a time point.
May need a new metric for it. The rrd may not be happy for changing the type in-place.

robhoes · 2025-09-03T16:10:19Z

In hindsight, perhaps the code that was changed here should wasn't the right place to do it. It seems the aim of that function is to calculate the difference between values for DERIVE datasource, not to do any time-based calculations. There is code further down in rrd.ml:

          (* CA-404597 - Gauge and Absolute values should be passed as-is,
             without being involved in time-based calculations at all.
             This applies to calculations below as well *)
          match ds.ds_ty with
          | Gauge | Absolute ->
              ds.ds_value <- value
          | Derive ->
              ds.ds_value <- ds.ds_value +. (pre_int *. value /. interval)

Is this a better place to fix this by having a separate Absolute case like this?

         | Absolute ->
              ds.ds_value <- value /. interval

There is a similar block even further down, which uses post_int rather thanpre_int for Derive.

last-genius · 2025-09-03T16:26:57Z

In hindsight, perhaps the code that was changed here should wasn't the right place to do it. It seems the aim of that function is to calculate the difference between values for DERIVE datasource, not to do any time-based calculations. There is code further down in rrd.ml:
          (* CA-404597 - Gauge and Absolute values should be passed as-is,
             without being involved in time-based calculations at all.
             This applies to calculations below as well *)
          match ds.ds_ty with
          | Gauge | Absolute ->
              ds.ds_value <- value
          | Derive ->
              ds.ds_value <- ds.ds_value +. (pre_int *. value /. interval)
Is this a better place to fix this by having a separate Absolute case like this?
         | Absolute ->
              ds.ds_value <- value /. interval
There is a similar block even further down, which uses post_int rather thanpre_int for Derive.

That's how I first approached it. But your suggestion does not pass the unit test (note that it modifies ds.ds_value while leaving v2s in-place, and those are used to reset the accumulators below, there might be more intricacies too).
If both this pre_int block and the one (post_int) below are changed, then it passes the unit test, but then this requires identical logic in two places instead of one.

I don't really see any downsides to the current aproach - the function already considers the interval (when comparing it against mrhb), just in a different way. But if you want to, could probably do it the other way too.

robhoes · 2025-09-04T09:56:56Z

For downsides of the current approach, the main thing is that dividing by the interval now happens if different places for Derive and Absolute. It would be nice if all the /. interval happen in the same place to make it obvious what the differences and commonalities between the RRD types are.

Then again, I don't yet really understand the v2s/post_int/pre_int stuff for Derive... I'll take a closer look.

robhoes · 2025-09-04T11:12:49Z

The main thing I don't understand now is why ds_update still treats Derive and Absolute differently. For Derive, the function process_ds_value calculates the diff between the current and previous value. For Absolute, it doesn't need to do that, because the incoming value is effectively already a diff. So when ds_update receives a value from process_ds_value shouldn't it do the same for Derive and Absolute, i.e. account for the elapsed time in the same way (while Gauge values should be left alone)?

E.g.

       match ds.ds_ty with
      | Gauge ->
          ds.ds_value <- value
      | Derive | Absolute ->
          ds.ds_value <- ds.ds_value +. (pre_int *. value /. interval)

...

            match ds.ds_ty with
            | Gauge ->
                ds.ds_value
            | Derive | Absolute ->
                ds.ds_value
                /. (occu_pdp_st -. proc_pdp_st -. ds.ds_unknown_sec)

etc.

lindig approved these changes Sep 2, 2025

View reviewed changes

minglumlu approved these changes Sep 3, 2025

View reviewed changes

robhoes approved these changes Sep 3, 2025

View reviewed changes

robhoes added this pull request to the merge queue Sep 3, 2025

Merged via the queue into xapi-project:master with commit 6c16a2f Sep 3, 2025
15 checks passed

minglumlu mentioned this pull request Sep 5, 2025

CP-308863: Count vGPU migrations #6640

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rrd: Fix absolute rate calculations #6646

rrd: Fix absolute rate calculations #6646

Uh oh!

last-genius commented Sep 2, 2025

Uh oh!

edwintorok commented Sep 2, 2025

Uh oh!

edwintorok commented Sep 2, 2025

Uh oh!

minglumlu commented Sep 2, 2025

Uh oh!

robhoes commented Sep 2, 2025

Uh oh!

robhoes commented Sep 2, 2025 •

edited

Loading

Uh oh!

robhoes commented Sep 2, 2025 •

edited

Loading

Uh oh!

minglumlu commented Sep 3, 2025

Uh oh!

Uh oh!

robhoes commented Sep 3, 2025

Uh oh!

last-genius commented Sep 3, 2025

Uh oh!

robhoes commented Sep 4, 2025

Uh oh!

robhoes commented Sep 4, 2025

Uh oh!

Uh oh!

rrd: Fix absolute rate calculations #6646

rrd: Fix absolute rate calculations #6646

Uh oh!

Conversation

last-genius commented Sep 2, 2025

Uh oh!

edwintorok commented Sep 2, 2025

Uh oh!

edwintorok commented Sep 2, 2025

Uh oh!

minglumlu commented Sep 2, 2025

Uh oh!

robhoes commented Sep 2, 2025

Uh oh!

robhoes commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robhoes commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minglumlu commented Sep 3, 2025

Uh oh!

Uh oh!

robhoes commented Sep 3, 2025

Uh oh!

last-genius commented Sep 3, 2025

Uh oh!

robhoes commented Sep 4, 2025

Uh oh!

robhoes commented Sep 4, 2025

Uh oh!

Uh oh!

robhoes commented Sep 2, 2025 •

edited

Loading

robhoes commented Sep 2, 2025 •

edited

Loading