-
Notifications
You must be signed in to change notification settings - Fork 33
[WIP] Experimental ObjectDiffusion v2 (just inbound side changes) for efficient vote retrieval #1698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: peras-staging
Are you sure you want to change the base?
[WIP] Experimental ObjectDiffusion v2 (just inbound side changes) for efficient vote retrieval #1698
Conversation
9de5d64 to
443cfc8
Compare
Co-authored-by: nbacquey <[email protected]>
Co-authored-by: nbacquey <[email protected]>
Co-authored-by: nbacquey <[email protected]>
96b7016 to
8bd6007
Compare
d28fe49 to
f794d6f
Compare
Co-authored-by: nbacquey <[email protected]>
| - the inbound peer that submitted the object to pool might have acked it already at the moment the object is rejected by the pool, but the rejection indicates that the outbound peer which sent us the object is adversarial, and we should disconnect from it anyway. So there is no harm done by having acked the object to the adversarial outbound peer, as we won't want to re-download this object from it again (or any other object whatsoever). | ||
| - any other inbound peer that has this ID available from its outbound peer won't be able to ack it because this ID isn't in **their** `dpsObjectsOwtPool`, and is not in the pool either, so we will be able to download it from these other peers until we find a valid one. | ||
|
|
||
| As in TxSubmission, acknowledgement is done by indicating to the outbound peer the length of the (longest) prefix of the oustanding FIFO that we no longer care about (i.e. for which all IDs are eligible to acknowledgment by the definition above). The field `dpsOutstandingFifo` on the inbound peer is supposed to mirror exactly the state of the FIFO of the outbound peer, bar eventual discrepancies due to in-flight information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably missed something along the way, but what happens when a peer receives all but the first object it requests from an outbound peer? Can that object be re-requested from the same peer, or do we rely on some other peer sending it successfully and thus allowing us to move ahead in the outstanding FIFO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outbound peer is not allowed to reply partially to one of our requests. It could (maybe) reply out of order to different requests we made, but if request A asks for 30 specific objects, reply to request A should contain exactly the 30 objects, otherwise we disconnect immediately from that peer for protocol fault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I had missed the part that the requested objects would have to be sent all in the same reply and not in individual ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a requirement of the typed-protocol stuff that a single request lead to a single reply. Because we count (as a type-level integer) how many requests have been pipelined to know how many replies we can expect.
| When making decisions, we first divide the peers in two groups: | ||
|
|
||
| - Those who are currently executing a decision, i.e., those for which the (previous) decision in the decision channel verifies `pdStatus == DecisionBeingActedUpon`. These are further called _frozen peers_. | ||
| - Those who are not currently executing a decision, i.e., those for which the (previous) decision in the decision channel verifies `pdStatus == DecisionUnread || pdStatus == DecisionCompleted`. The former are the ones who didn't have time to read the previous decision yet, so it makes sense to recompute a more up-to-date decision for them. The latter are the ones who have completed executing the previous decision, so it also makes sense to compute a new decision for them. These two categories of peers are further called _active peers_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former are the ones who didn't have time to read the previous decision yet, so it makes sense to recompute a more up-to-date decision for them.
Could this somehow lead to starvation? I.e. the peer never leaving the DecisionUnread state because the decision thread keeps recomputing a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, as soon a peer has completed a decision, it will start again in go and ask for a new one. As soon as psaReadDecision returns, the decision is marked as DecisionBeingActedUpon. And because the decision thread is rate-limited, there is always a window in which the "lock" (not sure how exactly the STM operates to ensure atomic mutations of shared mutable variables) on the decision is not being held by the decision thread, allowing for the peer to read it and mark it as being acted upon.
The only case where a peer would not ask for a decision immediately after completing one is when it has made a blocking request for IDs, in which case it will block and collect the reply before starting at the top of go again. This is the only case I see where the decision logic could update the decision (with status DecisionUnread) a significant amount of times before the peer asks for it, but in that case the decision should be trivial (i.e. do nothing because there is no IDs in the FIFO, as this is the precondition for making a blocking request for IDs).
I'm not sure I answered your question actually. Feel free to ask for more details.
Also if you see how the documentation could be improved to make it clearer for future readers, please suggest an edit ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes, I think this answers my question.
Maybe a follow up question: let's say that all our inbound peers are blocked waiting for IDs. Would this make the decision thread enter a temporary busy waiting loop where it does nothing but to update its decisions over and over, or would it block as well until some condition flag switches state?
Sorry in advance if this is nonsensical :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, at the beginning of the protocol, all inbound peers would block waiting for IDs, so the decision loop would always update trivial decisions by a new trivial decision.
It could make sense to pre-filter-out the peers that are waiting for IDs, so that they are not considered by the decision thread. Even simpler actually would be to only call psaOnDecisionExecuted when the IDs have been received in the blocking case, so they would be filtered out automatically (considered as "frozen" peers). If we do this change, there will be theoretically no case where a decision would be updated a significant amount of time before being read and frozen in.
NOTE: a peer can only block requesting for IDs when there are no longer any IDs in its FIFO, so the decision for this peer must be really trivial, as there is nothing we can ask it to do.
| - They are not already in the pool | ||
| - They are available from the peer, and won't be acked by the peer on its next request for IDs according to the partial decision computed by `computeAck` at the previous step (NOTE: theoretically, this second condition is redundant with other constraints and invariants of the current implementation). | ||
|
|
||
| Then we "reverse" this mapping to obtain a map of object IDs to the set of active peers that have the corresponding interesting objects available (according to the criteria above), further called _potential providers_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any notion of quality of service that one could use to prioritize certain potential providers over others at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could try to balance the load, e.g. prioritize peers with the fewer objects in flight at a given time.
Given that any invalid object triggers protocol termination for a peer, we can't have a metric of "quality" for the objects provided by an outbound peer. As it must be perfect or we disconnect.
It could make sense to use information from the network layer maybe, to see how fast the peers are to distribute objects and prioritize the fastest providers. But that would probably require some architectural changes; or require that we measure the time taken between a request and reply ourselves, at the application layer, which might be inconvenient/inaccurate. Really don't know at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this should be part of the optimization effort for voting logic, which is not due for several months.
a459164 to
7fa719c
Compare
Co-authored-by: nbacquey [email protected]
Description
Please include a meaningful description of the PR and link the relevant issues
this PR might resolve.
Also note that: