Skip to content

Commit d47f370

Browse files
committed
Switch from serial to concurrent transmision in virtual world
Also fix note on complexity of maintaining mu(i,t).
1 parent 5a2d5c4 commit d47f370

File tree

1 file changed

+215
-83
lines changed
  • keps/sig-api-machinery/1040-priority-and-fairness

1 file changed

+215
-83
lines changed

keps/sig-api-machinery/1040-priority-and-fairness/README.md

Lines changed: 215 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -717,9 +717,11 @@ mu(i,t) = min(rho(i,t), mu_fair(t))
717717

718718
where:
719719
- `i` identifies a queue,
720-
- `rho(i,t)` is the rate requested by queue `i` at time `t` and is defined to be
721-
the product of `mu_single` and the number of packets of that queue
722-
that are unsent in the virtual world at time `t`,
720+
- `rho(i,t)` is the rate requested by queue `i` at time `t` and is
721+
defined to be the product of `mu_single` and the number of packets
722+
of that queue that are not fully sent in the virtual world at time
723+
`t` (those for which `t_arrive(i,j) <= t` and whose transmission
724+
completes strictly after `t`), and
723725
- `mu_fair(t)` is the smallest non-negative quantity that solves the equation
724726
```
725727
min(mu_single, Sum[over i] rho(i,t)) = Sum[over i] min(rho(i,t), mu_fair(t))
@@ -737,13 +739,13 @@ adjusted at this time, among others. In this virtual world a queue's
737739
packets are divided into three subsets: those that have been
738740
completely sent, those that are in the process of being sent, and
739741
those that have not yet started being sent. That number being sent is
740-
1 unless it is 0 and the queue has no unsent packets. Unlike the
741-
original fantasy, this virtual world uses the same clock as the real
742-
world. Whenever a packet finishes being sent in the real world, the
743-
next packet to be transmitted is the unsent one that will finish being
744-
sent soonest in the virtual world. If there is a tie among several,
745-
we pick the one whose queue is next in round-robin order (following
746-
the queue last picked).
742+
1 unless it is 0 and the queue has no unsent packets. This virtual
743+
world uses the same clock as the real world. Whenever a packet
744+
finishes being sent in the real world, the next packet to be
745+
transmitted is the one that is unsent in the real world and will
746+
finish being sent soonest in the virtual world. If there is a tie
747+
among several, we pick the one whose queue is next in round-robin
748+
order (following the queue last picked).
747749

748750
We can define beginning and end times (B and E) for transmission of
749751
the j'th packet of queue i in the virtual world, with the following
@@ -757,7 +759,7 @@ Integral[from tau=B(i,j) to tau=E(i,j)] mu(i,tau) dtau = len(i,j)
757759
This has a practical advantage over the original story: the integrals
758760
are only over the lifetime of a single request's service --- rather
759761
than over the lifetime of the server. This makes it easier to use
760-
floating point representations with sufficient precision.
762+
floating or fixed point representations with sufficient precision.
761763

762764
Note that computing an E value before it has arrived requires
763765
predicting the course of `mu(i,t)` from now until E arrives. However,
@@ -787,7 +789,13 @@ S(i,j) = R(B(i,j))
787789
(R and max commute because both are monotonically non-decreasing).
788790

789791
Note that `mu_fair(t)` is exactly the same as `dR/dt` in the original
790-
story. So we can reason as follows.
792+
story (excepting inconsequential differences at the instants when
793+
packets complete: `rho` is defined to exclude the packat at that
794+
instant and `NAQ` is defined to include the packet; the differences
795+
are inconsequential because all we do with `mu` and `dR/dt` in the
796+
argument below is integrate them, and a difference in the integrand at
797+
a countable number of instants makes zero difference to the integral).
798+
So we can reason as follows.
791799

792800
```
793801
Integral[tau=B(i,j) to tau=E(i,j)] mu(i,tau) dtau = len(i,j)
@@ -847,69 +855,177 @@ Because we now have more possible values for `mu(i,t)` than 0 and
847855
`mu_fair(t)`, it is more computationally complex to adjust the
848856
`mu(i,t)` values when a packet arrives or completes virtual service.
849857
That complexity is:
850-
- O(n log n), where n is the number of queues,
851-
in a straightforward implementation;
852-
- O(log n) if the queues are kept in a data structure sorted by `rho(i,t)`.
853-
854-
We can keep the same virtual transmission scheduling scheme as in the
855-
single-link world --- that is, each queue sends one packet at a time
856-
in the virtual world. We do this even though a queue can have
857-
multiple packets being sent at a given moment in the real world. This
858-
has the virtue of keeping the logic relatively simple; we can use the
859-
same equations for B and E. In a world where different packets can
860-
have very different lengths, this choice of virtual transmission
861-
schedule looks dubious. But once we get to the last step below, where
862-
are talking about serving requests that all have the same guessed
863-
service duration, this virtual transmission schedule does not look so
864-
unreasonable. If some day we wish to make request-specific guesses of
865-
service duration then we can revisit the virtual transmission
866-
schedule.
867-
868-
However, the greater diversity of `mu(i,t)` values breaks the
869-
correspondence with the original story. We can still define `R(t) =
870-
Integral[from tau=start to tau=t] mu_fair(tau) dtau`. However,
871-
because sometimes some queues get a rate that is less than
872-
`mu_fair(t)`, it is not necessarily true that `R(E(i,j)) - R(B(i,j)) =
873-
len(i,j)`. Because all non-empty queues do not necessarily get the
874-
rate `mu_fair(t)`, the prediction of that affects the dispatching
875-
choice. This ruins the simple story about how to get logarithmic cost
876-
for dispatching.
877-
878-
We can partially recover by dividing queues into three classes rather
879-
than two: empty queues, those for which `rho(i,t) <= mu_fair(t)`, and
880-
those for which `mu_fair(t) < rho(i,t)`. We can efficiently make a
881-
dispatching choice from each of the two latter classes of queues,
882-
under the assumption that `mu_fair(t)` will be henceforth constant,
883-
and then efficiently choose between those two choices. However, it is
884-
also necessary to react to a stimulus that modifies `mu_fair(t)` or
885-
some `rho(i,t)` so that some queues move between classes --- and this
886-
costs O(m log n), where m is the number of queues moved and n is the
887-
number of queues in the larger class. The details are as follows.
888-
889-
For queues where `rho(i,t) <= mu_fair(t)` we can keep track of the
890-
predicted E for the packet virtually being transmitted. As long as
891-
that queue's `rho` remains less than or equal to `mu_fair`, these E
892-
predictions do not change. We can keep these queues in a data
893-
structure sorted by those E predictions.
894-
895-
For queues `i` where `mu_fair(t) <= rho(i,t)` we can keep track of the
896-
F (that is, `R(E)`) of the packet virtually being transmitted. As
897-
long as `mu_fair` remains less than or equal to that queue's `rho`,
898-
that F does not change. We can keep these queues in a data structure
899-
sorted by those F predictions.
900-
901-
When a stimulus --- that is, packet arrival or virtual completion ---
902-
changes `mu_fair(t)` or some `rho(i,t)` in such a way that some queues
903-
move between classes, those queues get removed from their old class
904-
data structure and added to the new one.
905-
906-
When it comes time to choose a packet to begin transmitting in the
907-
real world, we start by choosing the best packet from each non-empty
908-
class of non-empty queues. Supposing that gives us two packets, we
909-
have to compare the E of one packet with the F of the other. This is
910-
done by assuming that `mu_fair` will not change henceforth. Finally,
911-
we may have to break a tie; that is done by round-robin ordering, as
912-
usual.
858+
- `O(n log n)`, where n is the number of queues, in a straightforward
859+
implementation that sorts the queues by increasing rho and then
860+
enumerates them to find the least demanding, if any, that can not
861+
get all it wants;
862+
- `O((1 + n_delta) * log n)` if the queues are kept in a
863+
logarithmic-complexity sorted data structure (such as skip-list or
864+
red-black tree) ordered by `rho(i,t)`, `n_delta` is the number of
865+
queues that enter or leave the relationship `mu(i,t) == rho(i,t)`,
866+
and a pointer to that boundary in the sorted data structure is
867+
maintained. Note that in a system that stays out of overload,
868+
`n_delta` stays zero. The same result obtains while the system
869+
stays overloaded by a fixed few queues.
870+
871+
In order to maintain the useful property that transmissions finish in
872+
the virtual world no sooner than they do in the real world (which is
873+
good because it means we do not have to revise history in the virtual
874+
world when a completion comes earlier than expected --- which
875+
possibility we introduce below) we suppose in the virtual world that
876+
each queue `i` has its `min(rho(i,t), C)` oldest unsent packets being
877+
transmitted at time `t`, using equal shares of `mu(i,t)`. The
878+
following equations define that set of packets (`SAP`), the size of
879+
that set (`NAP`), and the `rate` at which each of them is being sent.
880+
881+
```
882+
SAP(i,t) = {j such that B(i,j) <= t < E(i,j)}
883+
884+
NAP(i,t) = |SAP(i,t)|
885+
886+
rate(i,t) = if NAP(i,tau) > 0 then mu(i,t) / NAP(i,t) else 0
887+
```
888+
889+
Following is an outline of a proof that `rate(i,t) <= mu_single` ---
890+
that is, a packet is transmitted no faster in the virtual world than
891+
in the real world. When `rate(i,t) == 0` we are already done. When
892+
`rho(i,t) >= C`: `mu(i,t) <= mu_single * C` and `NAP(i,t) = C`, so
893+
their quotient can not exceed `mu_single`. When `0 < rho(i,t) < C`:
894+
`mu(i,t) <= mu_single * rho(i,t)` and `NAP(i,t) = rho(i,t)`, whose
895+
quotient is also thusly limited.
896+
897+
The following equations say when transmissions begin and end in this
898+
virtual world. To make the logic simple, we assume that each packet
899+
arrives at a different time (the implementation will run this logic
900+
with a mutex locked and thus naturally process arrivals serially,
901+
effectively standing them apart in time even if the clock does not).
902+
903+
```
904+
B(i,j) = if NAP(i,t_arrive(i,j)) <= C then t_arrive(i,j)
905+
else min[k in SAP(i,j)] E(i,k)
906+
907+
Integral[from tau=B(i,j) to tau=E(i,j)] rate(i,tau) dtau = len(i,j)
908+
```
909+
910+
Those equations look dangerously close to circular logic: `B` is
911+
defined in terms of `SAP`, and `SAP` is defined in terms of `B`. But
912+
note that the equation for `B` says that the start of transmission for
913+
a packet (i) can only be delayed because of `C` other packets that
914+
started transmission earlier (remember, distinct arrival times) and
915+
have not finished yet and (ii) can only be delayed until the first one
916+
of those finishes. There is only one choice of `B` for each packet
917+
that makes all the equations hold.
918+
919+
Note that when C is 1 these equations produce the same begin and end
920+
times as the single-link design.
921+
922+
As in the single-link case, at any given time we can estimate expected
923+
end times for packets in progress. These estimates may not be
924+
accurate, but simple estimates can be defined that nonetheless yield
925+
the correct ordering among a queue's packets. Furthermore, these
926+
estimates will correctly identify the next packet to complete among
927+
all the queues, even though it may say incorrect things about
928+
subsequent events. That is enough, because the implemenation will
929+
update the estimates every time a packet begins or ends transmission.
930+
931+
To help define these estimates we first define a concept `P(i,t)`, the
932+
"progress" made by a given queue up to a given time. It might be
933+
described as the number of bits transmitted serially (that is,
934+
considering only one link at any given time) since an arbitrary
935+
queue-specific starting time `epoch(i)`. A given active packet gets
936+
transmitted at the rate that `P` increases.
937+
938+
```
939+
P(i,t) = Integral[from tau=epoch(i) to tau=t] rate(i,t) dtau
940+
```
941+
942+
We can accumulate `P(i)` in a 64-bit number and only rarely need to
943+
advance `epoch(i)` in order to prevent overflow or troublesome loss of
944+
precision. Advancing `epoch(i)` will cost O(number of active
945+
packets), to make the corresponding updates to the `PEnd` values
946+
introduced below.
947+
948+
For a given queue `i` and packet `j`, by looking at the `P` value when
949+
the packet begins transmission and adding the length of the packet, we
950+
get the `P` value when the packet will finish transmission. By
951+
focusing on `P` values instead of wall clock time we gain independence
952+
from the variations in `rate`. This is similar to the use of `R`
953+
values in the original Fair Queuing scheme.
954+
955+
```
956+
PEnd(i,j) = P(i, B(i,j)) + len(i,j)
957+
```
958+
959+
For a given queue `i` at a given time `t` we can write the expected
960+
end (EE) time of each active packet `j` as the current time plus the
961+
expected amount of time needed to transmit the bits that have not
962+
already been transmitted (making the assumption that the current rate
963+
will continue into the future):
964+
965+
```
966+
EE(i,j,t) = t + (PEnd(i,j) - P(i,t)) / rate(i,t)
967+
```
968+
969+
Notice that the remaining time to transmit the packet, `EE(i,j,t)-t`,
970+
is a function of:
971+
- a packet-specific quantity (`PEnd`) that does not change over time,
972+
and
973+
- queue-specific quantities (`P`, `rate`) that change over time and
974+
are independent of packet.
975+
976+
The complexity of updating this representation of a queue's expected
977+
end times to account for the passage of time or a change in `mu` does
978+
not require modifying the packet-specific data (`PEnd` values) in this
979+
data structure, thus costs `O(1)`. Adding or removing a packet or
980+
changing its length (see below) does not require changing the
981+
packet-specific data of the other active packets.
982+
983+
We can keep the active packets of a queue in a logarithmic-complexity
984+
sorted data structure ordered by expected end time. Adding or
985+
removing a packet from the active set or changing the packet's length
986+
will cost O(log(size of the active set)).
987+
988+
We can divide the non-empty queues into two categories and keep each
989+
in its own data structure. For the queues that get `mu(i,t) ==
990+
rho(i,t)`, keep them in a logarithmic-complexity sorted data structure
991+
ordered by the earliest expected end time of the queue's active
992+
packets. Changes to `mu_fair` do not affect this data structure,
993+
except to the degree that they cause queues to enter or leave this
994+
category.
995+
996+
Similarly, we can keep the queues for which `mu(i,t) == mu_fair(t)` in
997+
another sorted data structure ordered by earliest expected end time.
998+
Since `mu(i,t)` is the same for all queues in this category, the
999+
passage of time and changes in `mu_fair` do not change the ordering of
1000+
packets or queues in this data structure, except to the degree that
1001+
queues enter or leave this category. The representation of expected
1002+
end times in this category gets one more level of indirection, through
1003+
that shared `mu_fair`.
1004+
1005+
When a change in `mu_fair` causes `n_delta` queues to move from one
1006+
category to another, it costs `O(n_delta * log num_queues)` to update
1007+
these data structures by those moves.
1008+
1009+
Updating the data structures for a mere change in one queue's `NAP`
1010+
has logarithmic cost.
1011+
1012+
The above discussion concerns the virtual world, which transmits each
1013+
packet no more quickly than the real world. Usually a packet will
1014+
finish transmission in the real world before it finishes in the
1015+
virtual word. But it is important to keep each packet in the virtual
1016+
world data structure until it is fully transmitted in the virtual
1017+
world. Yet, our ultimate goal is to select the next packet to
1018+
complete transmission in the virtual world _from among those packets
1019+
that have not yet started transmission in the real world_. To do this
1020+
we maintain, in addition to the full virtual data structures above,
1021+
filtered variants that contain only packets that have not yet started
1022+
transmission in the real world. We use the `mu` and `rho` values from
1023+
the virtual world in the calculations for the packets in the filtered
1024+
data structures. Whenever it is necessary to identify the earliest
1025+
expected end time among all the filtered packets, this can be done
1026+
with O(1) complexity. Finding the earliest from each of the two
1027+
categories costs O(1). Finding the earliest of those (at most) two
1028+
also takes O(1).
9131029

9141030
##### From packets to requests
9151031

@@ -921,18 +1037,34 @@ measured in seconds. The units change: `mu_single` and `mu_i` are no
9211037
longer in bits per second but rather are in service-seconds per
9221038
second; we call this unit "seats" for short. We now say `mu_single`
9231039
is 1 seat. As before: when it is time to dispatch the next request in
924-
the real world we pick from the queue whose current packet
925-
transmission would finish soonest in the virtual world, using
926-
round-robin ordering to break ties.
1040+
the real world we pick from a queue with a request that will complete
1041+
soonest in the virtual world, using round-robin ordering to break
1042+
ties.
9271043

9281044
##### Not knowing service duration up front
9291045

9301046
The final change removes the up-front knowledge of the service
931-
duration of a request. Instead, we use a guess `G`. When a request
932-
finishes execution in the real world, we learn its actual service
933-
duration `D`. At this point we adjust the explicitly represented B
934-
and E (and F, if using those) values of following requests in that
935-
queue to account for the difference `D - G`.
1047+
duration of a request. Instead, we use a guess `G`. If and when the
1048+
guess turns out to be too short --- that is, its expected end time
1049+
arrives in the virtual world but the request has not finished in the
1050+
real world --- the guess is increased. Remember that the virtual
1051+
world never serves a request faster than the real world, so whenever
1052+
that adjustment is made we are sure that the guess really was too
1053+
short.
1054+
1055+
Essentially always the (eventually adjusted, as necessary) guess will
1056+
turn out to be too long. When the request finishes execution in the
1057+
real world, we learn its actual service duration `D`. The completion
1058+
in the virtual world is concurrent or in the future, never the past.
1059+
At this point we adjust the expected end time of the request in the
1060+
virtual world to be based on `D` rather than the guess.
1061+
1062+
When the request finishes execution in the virtual world --- which by
1063+
this time is an accurate reflection of the true service duration `D`
1064+
--- either another request is dispatched from the same queue or all
1065+
the remaining requests in that queue start getting faster service. In
1066+
both cases, the service delivery in the virtual world has reacted
1067+
properly to the true service duration.
9361068

9371069
### Example Configuration
9381070

0 commit comments

Comments
 (0)