-
Notifications
You must be signed in to change notification settings - Fork 292
CP-308863: Count vGPU migrations #6640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ocaml/xapi/xapi_stats.ml
Outdated
let vgpu_migration_count_ds = | ||
( Rrd.Host | ||
, Ds.ds_make ~name:"pool_vgpu_migration_count" | ||
~description:"Number of vGPU migrations occurred since last update" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this since last xapi start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's reset on every iteration:
+ Atomic.exchange pool_vgpu_migration_count 0 |> Int64.of_int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's reset on every update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording confused me because I thought it was since the last time the host's packages were updated.
Since this is a rate (both absolute and derive are), "since the last update" should be dropped.
- Absolute: meaning that the incoming data is an absolute rate
- Derive: meaning that the rate must come from the difference between the
incoming data and the previous value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so note that "since last rrd update" is visible to users and they might not know what an rrd update is. In the ned this metric is a rate, and users don't care how it's calculated: the counter is reset every 5 seconds, or a difference is taken between the current measurement and the last one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the "unit" of this metric is something like "number of vGPU migrations per 5 seconds"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the number of migrations per second, like network bandwidth used. If we want a time-independent measure a gauge should be used, like in memory.
I do think a rate is OK to be used here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK cool, then we should reflect that in the description by adding "per second"? And ~units:"count"
should be "migrations/s"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For comparison:
, ds_make ~name:(key_format "iops_total")
~description:"I/O Requests per second"
~value:(Rrd.VT_Int64 (Int64.add value.iops_read value.iops_write))
~ty:Rrd.Absolute ~units:"requests/s" ~min:0. ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yes
ocaml/xapi/message_forwarding.ml
Outdated
@@ -2501,6 +2501,10 @@ functor | |||
let snapshot = Db.VM.get_record ~__context ~self:vm in | |||
reserve_memory_for_vm ~__context ~vm ~host ~snapshot | |||
~host_op:`vm_migrate (fun () -> | |||
Db.VM.get_VGPUs ~__context ~self:vm | |||
|> List.iter (fun _ -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a VM has multiple vGPUs, should we not still count it as a single vGPU migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Counting as a single vGPU migration makes the meaning simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit should be changed to migrations/s
If we want a rate, does it really make sense to pick migrations/s? Why not scale it to migrations/h or /day? Or would this be solved at a different layer? |
But it's actually "migrations/5 secs". It would be complicated if an interval is recorded for the data in tracking. Or we just assume it is always 5 seconds, and divide it by 5 seconds before update it? I really don't want to expose the "5" seconds. I updated it with a stop watch to make it a rate (per second) now.
Yeah, this will be solved at a different layer. The data can be consolidated as migrations/day. |
eb5b3bc
to
4a91d30
Compare
I assumed that xcp-rrdd divides by the time span of ~5s itself. So if you use an Absolute RRD type, it takes a value of unit x and it always becomes x/s. Is that right @psafont? |
AFAIU, xcp-rrdd divides the counter by 5, when calculating the value, as part of the rrd library |
Hmm. I can't find the dividing in rrd lib. Am I missing something? I think it should be in process_ds_value https://github.com/xapi-project/xen-api/blob/master/ocaml/libs/xapi-rrd/lib/rrd.ml#L348. And the testing is showing that the input being divided is indeed as expected. |
That sounds like an issue that might have been introduced by inhttps://github.com/xapi-project/xen-api/commit/73ca3cca49fd604a2cd408c332078535f0f694fe note how on the first chunknof code changed the behaviour of absolute is not changed, but in the second one, it's not treated as a rate anymore @last-genius could you confirm the correct behaviour? |
That does seem to be the case,
But that was not the behaviour prior to 73ca3cc either, the following unit test (both on master and before 73ca3cc) has this output:
let absolute_rrd_CA_404597 () =
let rra = rra_create CF_Average 100 1 0.5 in
let ts = 1000000000.0 in
let ds = ds_create "foo" Absolute (VT_Float 0.0) in
let rrd = rrd_create [|ds|] [|rra|] 1L ts in
for i = 1 to 100000 do
let t = (300. *. float_of_int i) in
let ((_, val1) as v1) =
(0, {value= VT_Float (300.*.float_of_int i); transform= Identity})
in
ds_update rrd t [|v1|] false ;
Printf.printf "calculated %f: submitted %f\n" rrd.rrd_dss.(0).ds_value
(float_of_string (ds_value_to_string val1.value));
compare_float __LOC__ (float_of_string (ds_value_to_string val1.value))
rrd.rrd_dss.(0).ds_value
done So I'm completely confused.... |
The unit test output above is the same on v24.29.0, which is before any of the recent refactoring of RRD |
This patch makes an updated unit test pass: diff --git a/ocaml/libs/xapi-rrd/lib/rrd.ml b/ocaml/libs/xapi-rrd/lib/rrd.ml
index bb516ea6a..414e2cd63 100644
--- a/ocaml/libs/xapi-rrd/lib/rrd.ml
+++ b/ocaml/libs/xapi-rrd/lib/rrd.ml
@@ -361,7 +361,9 @@ let process_ds_value ds value interval new_rrd =
let rate =
match (ds.ds_ty, new_rrd) with
- | Absolute, _ | Derive, true | Gauge, _ ->
+ | Absolute, _ ->
+ value_raw /. interval
+ | Derive, true | Gauge, _ ->
value_raw
| Derive, false -> (
match (ds.ds_last, value) with test, had to adjust let absolute_rrd_CA_404597 () =
let rra = rra_create CF_Average 100 1 0.5 in
let ts = 0.0 in
let ds =
ds_create "foo" Absolute ~mrhb:1000.0 ~min:0. ~max:infinity (VT_Float 0.0)
in
let rrd = rrd_create [|ds|] [|rra|] 1L ts in
let id = Identity in
for i = 1 to 100000 do
let t = 300. *. float_of_int i in
let ((_, val1) as v1) =
(0, {value= VT_Float (300. *. float_of_int i); transform= id})
in
ds_update rrd t [|v1|] false ;
compare_float __LOC__
(float_of_string (ds_value_to_string val1.value) /. 300.)
rrd.rrd_dss.(0).ds_value
done |
Your patch @last-genius is independent of this new metric - is that right? So maybe we should merge it from it's own PR? |
certainly, but it would change the behaviour of the new metric (well, fixing it to work as intended). I'll open a PR |
Is |
indeed |
4a91d30
to
642b5c3
Compare
Absolute metric should work as follows: Timestamp = 300, 600, 900, 1200 Step = 300 seconds ABSOLUTE DS = 1, 2, 3, 4 But they do not seem to have ever worked correctly - they were previously divided by the interval but incorrectly handled NaNs, resulting in wrong behaviour. Then the refactoring (including 73ca3cc ("CA-404597: rrd - Pass Gauge and Absolute data source values as-is")) broke them further. Divide absolute metrics by the interval in process_ds_value. Fix the unit test to expect the right behaviour - it passes with this fix. --- This means absolute metrics are now correctly calculated as unit/s, changing the behaviour in #6640 I am not 100% sure this fix goes all the way either - unit tests should cover expected values in RRAs as well, and the logic is still really confusing to me.
This metric represents the rate of VM migrations with vGPUs per second. The total count can be calculated as AVERAGE * seconds. For example, for one-day data granularity, the total count for one day is AVERAGE * 86400, in which 86400 is the seconds of one day. Signed-off-by: Ming Lu <[email protected]>
642b5c3
to
b657c62
Compare
Hi @psafont
|
This metric represents the rate of VM migrations with vGPUs per second.
The total count can be calculated as AVERAGE * seconds. For example,
for one-day data granularity, the total count for one day is
AVERAGE * 86400, in which 86400 is the seconds of one day.