refactor: align percentile semantics across metrics by bassosimone · Pull Request #21 · m-lab/iqb

bassosimone · 2025-11-17T21:53:20Z

On page 19 of the IQB report 2025
we read:

IQB uses the 95th percentile of a given
dataset to evaluate a given metric. In this
context, the 95th percentile is the value
below which 95% of the observed
measurements fall, which effectively
captures the upper bound of a typical user
experience while excluding extreme
outliers. For example, to assess whether a
region meets the network tier’s packet loss
criteria for high-quality gaming, IQB
calculates the 95th percentile of packet
loss measurements collected from users in
that region. The value is then compared to
the predefined threshold.

The formula example (page 33) shows taking the 95p of
metrics without considerations regarding polarity.

However, to produce consistent results, we need to take
the polarity into account when taking the 95p.

Let us illustrate the polarity issue with an example. We
assume that the following holds in a specific ISP:

(i) the 95p of latency being 10 ms means that 95% of
samples have 10 ms or less

(ii) the 95p of download speed being 22 Mbit/s means that
95% of samples have 22 Mbit/s or less

Let us also assume that we're evaluating online gaming and
that online gaming typically needs:

(a) download speed >= 20 Mbit/s

(b) latency <= 15 ms

The latency percentile allows us to say that "most
samples" (95%) in the given ISP show a latency (10 ms
or better) lower than the required latency (15 ms).

The speed percentile allows us to say that "few samples"
(5%) in the given ISP show a download speed (22 Mbit/s
or better) greater than the required one (20 Mbit/s).

So, is online gaming possible in the ISP? The answer
seems nonconclusive because of the imbalance in the
samples we are considering and the different polarity
between latency (higher is worse) and speed (higher
is better).

On paper, a better solution is to say "okay, for the
speed, instead, we consider the 5p". Now, let us assume:

(iii) the 5p of download speed being 21 Mbit/s means that
5% of samples have 21 Mbit/s or less

Based on this we can say that 95% of users have 21 Mbit/s
or more. Now, it's possible to write a statement regarding
the download speed concerning "most samples".

This allows us to conclude "most samples indicate that
users can play online with this ISP".

How to translate this into the actual code?

Approach I: modify cache.py so that, when we request
percentile=X we actually take the complementary percentile
for latency and loss (or for download and upload speed).

This change aligns the polarity and allows us to answer
questions using uniform sample sizes. It is basically
equivalent to what we manually did above.

Approach II: swap the labels for latency and loss (or for
download speed and upload speed) when querying BigQuery.

This means cache.py uniformly accesses percentile=X with
the understanding that 2/4 of the labels are swapped.

The second approach is more robust because it guarantees that,
if percentile=X is there then also the complementary
percentile is there. In both cases, people reading the code
will need to be aware of the polarity anyway.

Based on a discussion with @sermpezis, I am going to swap the
labels for latency and packet loss. The actual swapping operation
is anyway irrelevant (it's mostly a matter of convention) and
what matters is that we're aligning the polarity.

The meaning of the approach we are choosing is that 95p is the
cutoff where 95% of users have worse performance for the definition
of worse that is implied by the metric (e.g., lower speed or
higher latency). Obviously, it also holds the opposite, that is,
that 5% of users have better performance.

In conclusion, the implemented change aligns the sample size so
that the same percentile label picked up by cache.py allows
to make comparable statements with respect to better/worse.

sermpezis

The reasoning you used to implement the changes makes sense.
However, there is one misunderstanding of the IQB formula, which led to incorrect implementation.

Your two main changes are:
(1) Reverse the percentiles in the raw data (i.e., 5th instead of 95th) for "higher is better" metrics (e.g., speed)
(2) Invert at data generation, not run time.

For the change (2), I agree that it makes sense for the reasons you mention.

For the change (1) I see two points:

This "custom definition" may create some confusion to the users/developers. However, I'm fine with this, and we can mitigate it with a few examples in the documentation (and the specs as you already did).
The main point is that you used the wrong percentiles for the metrics. In the beginning of your comment you write IQB needs to answer: "Can 95% of users perform this use case?". But this is not true. From the (IQB report)[https://www.measurementlab.net/publications/IQB_report_2025.pdf] pages 19-20 the goal of the IQB is to:
IQB uses the 95th percentile of a given dataset to evaluate a given metric. In this context, the 95th percentile is the value below which 95% of the observed measurements fall, which effectively captures the upper bound of a typical user experience while excluding extreme outliers
which actually means that we need to check if Can 5% of users perform this use case? .

This would require the following changes:

[Change 1] For "higher is better" metrics (e.g., speed) use the original percentiles
[Change 2] For "lower is better" metrics (e.g., latency) keep the reverse percentiles

Moreover, the following change:

[Change 3] I would keep for the True/False instead of the True/None assigment of values (again following the reasoning of the report; e.g., see Appendix II.

bassosimone · 2025-11-19T15:12:46Z

The reasoning you used to implement the changes makes sense. However, there is one misunderstanding of the IQB formula, which led to incorrect implementation.

Your two main changes are: (1) Reverse the percentiles in the raw data (i.e., 5th instead of 95th) for "higher is better" metrics (e.g., speed) (2) Invert at data generation, not run time.

For the change (2), I agree that it makes sense for the reasons you mention.

ACK

For the change (1) I see two points:

* This "custom definition" may create some confusion to the users/developers. However, I'm fine with this, and we can mitigate it with a few examples in the documentation (and the specs as you already did).

Agreed.

* The main point is that you used the wrong percentiles for the metrics. In the beginning of your comment you write `IQB needs to answer: "Can 95% of users perform this use case?"`. But this is not true. From the (IQB report)[https://www.measurementlab.net/publications/IQB_report_2025.pdf] pages 19-20 the goal of the IQB is to:
  `IQB uses the 95th percentile of a given dataset to evaluate a given metric. In this context, the 95th percentile is the value below which 95% of the observed measurements fall, which effectively captures the upper bound of a typical user experience while excluding extreme outliers`
  which actually means that we need to check if _Can **5%** of users perform this use case?_ .

I am intuitively surprised that we're making statements on just the 5% of users. However, we're going to run extensive analysis of the correct percentiles to use and the sensitivity, so those seem questions for later.

This would require the following changes:

* **_[Change 1]_** For "higher is better" metrics (e.g., speed) use the original percentiles

* **_[Change 2]_** For "lower is better" metrics (e.g., latency) keep the reverse percentiles

Yup.

Moreover, the following change:

* **_[Change 3]_** I would keep for the True/False instead of the True/None assigment of values (again following the reasoning of the report; e.g., see Appendix II.

Yes, this was not a change in the code, rather pseudocode to illustrate my point. However, since this seems to indicate that the commit message was confusing, I have reworded the commit message to avoid confusion.

@sermpezis

On page 19 of the [IQB report 2025]( https://www.measurementlab.net/publications/IQB_report_2025.pdf) we read: > IQB uses the 95th percentile of a given > dataset to evaluate a given metric. In this > context, the 95th percentile is the value > below which 95% of the observed > measurements fall, which effectively > captures the upper bound of a typical user > experience while excluding extreme > outliers. For example, to assess whether a > region meets the network tier’s packet loss > criteria for high-quality gaming, IQB > calculates the 95th percentile of packet > loss measurements collected from users in > that region. The value is then compared to > the predefined threshold. The formula example (page 33) shows taking the 95p of metrics without considerations regarding polarity. However, to produce consistent results, we need to take the polarity into account when taking the 95p. Let us illustrate the polarity issue with an example. We assume that the following holds in a specific ISP: (i) the 95p of latency being 10 ms means that 95% of samples have 10 ms or less (ii) the 95p of download speed being 22 Mbit/s means that 95% of samples have 22 Mbit/s or less Let us also assume that we're evaluating online gaming and that online gaming typically needs: (a) download speed >= 20 Mbit/s (b) latency <= 15 ms The latency percentile allows us to say that "most samples" (95%) in the given ISP show a latency (10 ms or better) lower than the required latency (15 ms). The speed percentile allows us to say that "few samples" (5%) in the given ISP show a download speed (22 Mbit/s or better) greater than the required one (20 Mbit/s). So, is online gaming possible in the ISP? The answer seems nonconclusive because of the imbalance in the samples we are considering and the different polarity between latency (higher is worse) and speed (higher is better). On paper, a better solution is to say "okay, for the speed, instead, we consider the 5p". Now, let us assume: (iii) the 5p of download speed being 21 Mbit/s means that 5% of samples have 21 Mbit/s or less Based on this we can say that 95% of users have 21 Mbit/s or more. Now, it's possible to write a statement regarding the download speed concerning "most samples". This allows us to conclude "most samples indicate that users can play online with this ISP". How to translate this into the actual code? **Approach I**: modify `cache.py` so that, when we request `percentile=X` we actually take the complementary percentile for latency and loss (or for download and upload speed). This change aligns the polarity and allows us to answer questions using uniform sample sizes. It is basically equivalent to what we manually did above. **Approach II**: swap the labels for latency and loss (or for download speed and upload speed) when querying BigQuery. This means `cache.py` uniformly accesses `percentile=X` with the understanding that 2/4 of the labels are swapped. The second approach is more robust because it guarantees that, if `percentile=X` is there then also the complementary percentile is there. In both cases, people reading the code will need to be aware of the polarity anyway. Based on a discussion with @sermpezis, I am going to swap the labels for latency and packet loss. The actual swapping operation is anyway irrelevant (it's mostly a matter of convention) and what matters is that we're aligning the polarity. The meaning of the approach we are choosing is that 95p is the cutoff where 95% of users have worse performance for the definition of worse that is implied by the metric (e.g., lower speed or higher latency). Obviously, it also holds the opposite, that is, that 5% of users have better performance. In conclusion, the implemented change aligns the sample size so that the same percentile label picked up by `cache.py` allows to make comparable statements with respect to better/worse.

I have modified the code as you requested. Please, take another look. Thank you!

bassosimone requested a review from sermpezis November 17, 2025 21:53

bassosimone force-pushed the fix/percentiles branch from 9c8c77d to 93280ff Compare November 17, 2025 21:57

sermpezis previously requested changes Nov 18, 2025

View reviewed changes

bassosimone force-pushed the fix/percentiles branch from 93280ff to b21d5d6 Compare November 19, 2025 15:06

bassosimone changed the title ~~fix: invert speed percentile labels~~ refactor: align percentile semantics across metrics Nov 19, 2025

bassosimone force-pushed the fix/percentiles branch from b21d5d6 to 70f1473 Compare November 19, 2025 15:16

bassosimone requested a review from sermpezis November 19, 2025 15:18

sermpezis approved these changes Nov 20, 2025

View reviewed changes

bassosimone merged commit 4060464 into main Nov 20, 2025
4 checks passed

bassosimone deleted the fix/percentiles branch November 20, 2025 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: align percentile semantics across metrics#21

refactor: align percentile semantics across metrics#21
bassosimone merged 1 commit intomainfrom
fix/percentiles

bassosimone commented Nov 17, 2025 •

edited

Loading

Uh oh!

sermpezis left a comment

Uh oh!

bassosimone commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bassosimone commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sermpezis left a comment

Choose a reason for hiding this comment

Uh oh!

bassosimone commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bassosimone commented Nov 17, 2025 •

edited

Loading