Skip to content

Conversation

last-genius
Copy link
Contributor

Absolute metric should work as follows:

Timestamp    = 300, 600, 900, 1200
Step         = 300 seconds
ABSOLUTE DS  =    1,  2,   3,    4

But they do not seem to have ever worked correctly - they were previously divided by the interval but incorrectly handled NaNs, resulting in wrong behaviour. Then the refactoring (including 73ca3cc ("CA-404597: rrd - Pass Gauge and Absolute data source values as-is")) broke them further.

Divide absolute metrics by the interval in process_ds_value. Fix the unit test to expect the right behaviour - it passes with this fix.


This means absolute metrics are now correctly calculated as unit/s, changing the behaviour in #6640

I am not 100% sure this fix goes all the way either - unit tests should cover expected values in RRAs as well, and the logic is still really confusing to me.

Absolute metric should work as follows:

    Timestamp    = 300, 600, 900, 1200
    Step         = 300 seconds
    ABSOLUTE DS  =    1,  2,   3,    4

But they do not seem to have ever worked correctly - they were previously
divided by the interval but incorrectly handled NaNs, resulting in wrong
behaviour. Then the refactoring (including 73ca3cc ("CA-404597: rrd - Pass
Gauge and Absolute data source values as-is")) broke them further.

Divide absolute metrics by the interval in process_ds_value. Fix the unit test
to expect the right behaviour - it passes with this fix.

Signed-off-by: Andrii Sultanov <[email protected]>
@edwintorok
Copy link
Contributor

Would be good to test against what rrdtool does (since the code tries to mimic its specification): feed the same input data to both rrdtool and xcp-rrdd (e.g. with a synthetic plugin), and check the output.
(Although I don't think there is a way to feed synthetic time to rrdtool)

@edwintorok
Copy link
Contributor

Would be good to test against what rrdtool does (since the code tries to mimic its specification): feed the same input data to both rrdtool and xcp-rrdd (e.g. with a synthetic plugin), and check the output. (Although I don't think there is a way to feed synthetic time to rrdtool)

Ah actually you can feed timestamps to rrdupdate, see https://oss.oetiker.ch/rrdtool/doc/rrdupdate.en.html

{N | timestamp}:value[:value]...
The data used for updating the RRD was acquired at a certain time. This time can either be defined in seconds since 1970->01-01 or by using the letter 'N', in which case the update time is set to be the current time

That probably means we can't feed it fractional times, but that should be OK, we can generate synthetic data using integer timestamps only.

@minglumlu
Copy link
Member

The Absolute is mainly used for IO stats, which are from outputs of /usr/bin/iostat.
Would the units of the stats be already something like "rMB/s"?
I think dividing interval seems right per the definition of Absolute. But would it get the wrong value for the existing IO stats?

@robhoes
Copy link
Member

robhoes commented Sep 2, 2025

rrdp_iostat.ml currently defines 8 datasources of type Absolute. They all list their unit as x/s, except Tapdisks_in_low_memory_mode. That means that the value that come in must have unit x. Also, those counters must be reset to 0 when they are read (otherwise we'd need the type to be Derive).

Assuming that is indeed the case, then the implication is that those 8 datasource are wrong by a factor of 5 right now :/

@robhoes
Copy link
Member

robhoes commented Sep 2, 2025

Here are 4 of those iostat values getting collected in the make function:

        ; io_throughput_read_mb=
            to_float
              (get_stats_read_sectors s3 -- opt get_stats_read_sectors last_s3)
            *. 512.
            /. 1048576.
        ; io_throughput_write_mb=
            to_float
              (get_stats_write_sectors s3 -- opt get_stats_write_sectors last_s3)
            *. 512.
            /. 1048576.
        ; iops_read=
            get_stats_read_reqs_completed s3
            -- opt get_stats_read_reqs_completed last_s3
        ; iops_write=
            get_stats_write_reqs_completed s3
            -- opt get_stats_write_reqs_completed last_s3

So those do indeed appear to have the right units and "reset" the value by subtracting the "last" value, before passing the values to ds_make, making Absolute the right RRD type.

The iowait datasource similarly looks good. So the 7 Absolute iostat RRDs look correctly defined, and therefore must be out by 5x before the fix in the PR :(

@robhoes
Copy link
Member

robhoes commented Sep 2, 2025

The Tapdisks_in_low_memory_mode RRD looks a bit questionable. The incoming values are just the current number of tapdisks in this mode, an instantaneous value that is not time-dependent. I wonder if this should be a GAUGE instead.

This RRD was introduced in c1e0901.

@minglumlu
Copy link
Member

I wonder if this should be a GAUGE instead.

Indeed. It should be a GAUGE as it's a measurement at a time point.
May need a new metric for it. The rrd may not be happy for changing the type in-place.

@robhoes robhoes added this pull request to the merge queue Sep 3, 2025
Merged via the queue into xapi-project:master with commit 6c16a2f Sep 3, 2025
15 checks passed
@robhoes
Copy link
Member

robhoes commented Sep 3, 2025

In hindsight, perhaps the code that was changed here should wasn't the right place to do it. It seems the aim of that function is to calculate the difference between values for DERIVE datasource, not to do any time-based calculations. There is code further down in rrd.ml:

          (* CA-404597 - Gauge and Absolute values should be passed as-is,
             without being involved in time-based calculations at all.
             This applies to calculations below as well *)
          match ds.ds_ty with
          | Gauge | Absolute ->
              ds.ds_value <- value
          | Derive ->
              ds.ds_value <- ds.ds_value +. (pre_int *. value /. interval)

Is this a better place to fix this by having a separate Absolute case like this?

         | Absolute ->
              ds.ds_value <- value /. interval

There is a similar block even further down, which uses post_int rather thanpre_int for Derive.

@last-genius
Copy link
Contributor Author

In hindsight, perhaps the code that was changed here should wasn't the right place to do it. It seems the aim of that function is to calculate the difference between values for DERIVE datasource, not to do any time-based calculations. There is code further down in rrd.ml:

          (* CA-404597 - Gauge and Absolute values should be passed as-is,
             without being involved in time-based calculations at all.
             This applies to calculations below as well *)
          match ds.ds_ty with
          | Gauge | Absolute ->
              ds.ds_value <- value
          | Derive ->
              ds.ds_value <- ds.ds_value +. (pre_int *. value /. interval)

Is this a better place to fix this by having a separate Absolute case like this?

         | Absolute ->
              ds.ds_value <- value /. interval

There is a similar block even further down, which uses post_int rather thanpre_int for Derive.

That's how I first approached it. But your suggestion does not pass the unit test (note that it modifies ds.ds_value while leaving v2s in-place, and those are used to reset the accumulators below, there might be more intricacies too).
If both this pre_int block and the one (post_int) below are changed, then it passes the unit test, but then this requires identical logic in two places instead of one.

I don't really see any downsides to the current aproach - the function already considers the interval (when comparing it against mrhb), just in a different way. But if you want to, could probably do it the other way too.

@robhoes
Copy link
Member

robhoes commented Sep 4, 2025

For downsides of the current approach, the main thing is that dividing by the interval now happens if different places for Derive and Absolute. It would be nice if all the /. interval happen in the same place to make it obvious what the differences and commonalities between the RRD types are.

Then again, I don't yet really understand the v2s/post_int/pre_int stuff for Derive... I'll take a closer look.

@robhoes
Copy link
Member

robhoes commented Sep 4, 2025

The main thing I don't understand now is why ds_update still treats Derive and Absolute differently. For Derive, the function process_ds_value calculates the diff between the current and previous value. For Absolute, it doesn't need to do that, because the incoming value is effectively already a diff. So when ds_update receives a value from process_ds_value shouldn't it do the same for Derive and Absolute, i.e. account for the elapsed time in the same way (while Gauge values should be left alone)?

E.g.

       match ds.ds_ty with
      | Gauge ->
          ds.ds_value <- value
      | Derive | Absolute ->
          ds.ds_value <- ds.ds_value +. (pre_int *. value /. interval)

...

            match ds.ds_ty with
            | Gauge ->
                ds.ds_value
            | Derive | Absolute ->
                ds.ds_value
                /. (occu_pdp_st -. proc_pdp_st -. ds.ds_unknown_sec)

etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants