-
Notifications
You must be signed in to change notification settings - Fork 44
Saving websocket RTT samples #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
darkk
wants to merge
7
commits into
m-lab:main
Choose a base branch
from
censoredplanet:l7-ping-pt2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
9be4dcf
ndt7: measure and report L7 RTT based on websocket ping
darkk dc0cdbf
ndt: add websocket-based ping to the spec
darkk 1ac78b3
Rename WSInfo to WSPingInfo
darkk da9614c
ndt7: wrap websocket ping into json to protect against unsolicited po…
darkk 1de3f7a
ndt7: implement /ping handler as a self-contained module
darkk 1c72e32
ndt7: add comments on the initial ping frame
darkk 6f3810e
Fix typo
darkk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| /* jshint esversion: 6, asi: true, worker: true */ | ||
| // WebWorker that runs the ndt7 ping test | ||
| onmessage = function (ev) { | ||
| 'use strict' | ||
| let url = new URL(ev.data.href) | ||
| url.protocol = (url.protocol === 'https:') ? 'wss:' : 'ws:' | ||
| url.pathname = '/ndt/v7/ping' | ||
| const sock = new WebSocket(url.toString(), 'net.measurementlab.ndt.v7') | ||
| sock.onclose = function () { | ||
| postMessage(null) | ||
| } | ||
| sock.onopen = function () { | ||
| sock.onmessage = function (ev) { | ||
| if (!(ev.data instanceof Blob)) { | ||
| let m = JSON.parse(ev.data) | ||
| m.Origin = 'server' | ||
| m.Test = 'ping' | ||
| postMessage(m) | ||
| } | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| package model | ||
|
|
||
| // WSInfo contains an application level (websocket) ping measurement data. | ||
| // It may be melded into AppInfo. | ||
| // FIXME: describe this structure is in the ndt7 specification. | ||
| type WSInfo struct { | ||
| ElapsedTime int64 | ||
| LastRTT int64 // TCPInfo.RTT is smoothed RTT, LastRTT is just a sample. | ||
| MinRTT int64 | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by the fact that a new channel has been introduced. Isn't it possible to use the
measurerchfor passing around information? I understood that the PING is anothernullable pointer within a Measurement.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download.sender.loopgoroutine may be naturally blocked onconn.WritePreparedMessage, so there is a risk thatsend(measurerch)will blockreceiver.loopgoroutine. Blockedmeasurer.loopis good, it'll just skip a few ticks. Blockedreceiver.loopleads to skewed RTT measurements and I'd rather avoid that.There are several possible options to overcome that.
Current implementation creates buffered channel for RTT measurements with a buffer size equal of maximum possible amount of pings issued during the test. It was good enough for PoC, but I consider it a wasteful behavior optimized for the worst case.
I'm refactoring the code to send less
WSPingInfomessages to the client. Download sub-test will send only the first message back to the client via a buffered channel with small buffer and tail-drop other messages if the channel is not writable due to sender being blocked. TheWSPingInfomessages would be still logged, and I assume that writing to the logging channel does not naturally block for an unpredictable amount of time (unlikeconn.WritePreparedMessageindownload.sender.loop). I think, loggingWSPingInfoinClientMeasurementssection is okay. Anyway, theelapsedvalue in ping/pong frame is not signed, so client can fake it, so putting that toClientMeasurementsmakes sense to me :-)BTW, should ping be signed or formatted as
{ndt7: $value}as a safety net against a client sending unsolicited pongs with some numerical value? Clients can send unsolicited pongs for heartbeat purposes per RFC.It may make sense to apply the same logic to the measurements coming from
measurer.loop. Currently the messages are logged if and only ifdstis available for writing (sender is not blocked). Maybe it makes sense to split that channel into two: one channel coming to the log (that blocks on "disk") and another coming to the client (that may tail-drop messages if the client is not fast enough to fetchWritePreparedMessageblob). What do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, slow client blocking TCPInfo logging seems to be a real-world case, not a hypothetical situation. I've checked my few ndt7 data samples, some of them have a gap as large as 1.2s between TCPInfo.ElapsedTime samples (that's twice as large as
spec.MaxPoissonSamplingIntervalbeing 625ms). Those 11 files with 321 TCPInfo samples have 15 TCPInfo samples being more than 650ms apart.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about client side traces or server side traces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That you don't get measurements on the client side because of head of line blocking is a know issue that I think we cannot address. However, the channel has buffer, so the measurements were actually saved on disk when I checked. The improvement that I foresee here is actually separating saving measurements from sending measurements to client. In the second case, I think it is reasonable to consider sending the client the most recent sample not the most ancient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought that I have is that I'd try to not change the channel structure in this PR but rather consider changing the channel structure later. The code is using a pipeline pattern and channels are closed to signal EOF. However, there are better patterns where there are multiple workers per stage and where contexts are used rather than closing channels. I have a sense that the right solution here is to measure and reorganise channels to have more parallelism when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I see, the created channels do not have buffers, all
make(chan...)calls underndt7/do not specify any non-zero capacity. Do you mean golang channels here?I was worried about the following backpressure situation:
tor)download.sender.loopgoroutine blocks onconn.WritePreparedMessagefor some time (torcase gives me delays like 8500ms)download.sender.loopdoes not fetch messages fromm, ok := <-srcas the goroutine is blockedmeasurer.loopgoroutine blocks on sending todstas the channel is not drained and has no buffermeasurer.loopsamples TCP_INFO and BRR_INFO possibly less often thatMaxPoissonSamplingIntervalas it's blocked on sending todstand skips some ofticker.CticksNote, I assume that
ndt7has a (weak) guarantee to take TCPInfo samples at least as often asMaxPoissonSamplingInterval. That's just my assumption and it may be wrong.This backpressure case worries me as it means that
receiver.loopbeing blocked on writing RTT estimate to the channel drained bysendermay provide wrong RTT estimate for the next sample in a queue.receivershould be mostly blocked onconn.ReadMessageto provide reasonably accurate data.The orignal goal to introduce
pongchwas to mitigate that backpressure with quite a deep buffer. Bug I agree that this backpressure issue should be out of the scope of this pull request.Back to answering your original question. My goal is go have some of
WSPingInfosamples being sent back to the ndt7 client during the ping test runtime. Themasterbranch has the channel passing both the messages and the EOF signal frommeasurer.loopto${testname}.sender.loop. Having two concurrent writers (measurer.loopandreceiver.loop) sending to the channel that is eventually closed(!) will lead to goroutine panic (on an attempt to send to closed channel).So, I'm still going to use
pongchas a channel fromreceiver.loopto${testname}.sender.loopto keep the "pipeline + EOF" pattern.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR I acknowledge the issue you report (did see it myself). I will not accept a patch with pongch. I explain how I’d change the architecture to avoid losing measurements. I am not asking you to implement that, however. I am just asking you to kindly focus here on a simple patch that implements the /ping you need and does not intersect with existing subtests.
I was wrong. I used a buffer in a development branch then. I must have concluded that was not the right fix, which is to change the architecture.
That pattern is wrong. A pattern where an expired ctx tells you a goroutine to stop is right. We’ll fix that.
Your attitude is wrong. The right attitude is to listen. We are on the same page on more than 99% of the topics here.
Let me summarize my position: thank your for exposing a bunch of architectural issues. I am not going to accept a patch that complicates significantly the code base. Please focus on writing a /ping endpoint that does its job and does not intersect with the implementation of other experiments.
That patch, I will expedite its review. I am not willing to spend further time to argue on the merits of adding complexity to the code base when this can be avoided. That is 無駄無駄無駄無駄.
Your analysis is right. Your remediation is wrong. Thank your for spelling out this issue so clearly. As I mentioned to you (I guess I’m private?) I did see the same for a 2G connection.
This is right or wrong depending on the meaning of take. It’s right if take also implies save. It’s wrong if take also imply sending the data to the client. We send data to the client on a best effort basis but we should be doing our best to collect and save samples.
(On this note: M-Lab has a sidecar service that samples TCPInfo at high frequency. The data is collected separately and can be joined after submission. We are still collecting BBR and TCPInfo here, why? The original design was to stop early when BBR bandwidth estimation did become stable, but we have not implemented this yet. Anyway, this should help you understand why dealing with this issue has priority medium and not high.)
The right fix that significantly mitigates all that have been said so far is to change the code architecture as follows:
the measurer is a class and stores measurements in itself in a lock free way
the channels become all non blocking when writing
LIFO is better than FIFO when passing measurements downstream
we use a WaitGroup to wait for all the goroutines to complete
then we’re again single threaded and we just save on disk accessing directly what was measured
This is the right architectural fix for the current situation.
We already agreed that your use case is best served by a separate subtest called ping. We already agreed that we also need a single sample at the beginning of the /download. Everything else seems irrelevant to the objective, hence wrong.
If adding support for WebSocket level ping to download and upload is such a burden with the current architecture, the right thing to do is to open an issue, just take the first sample of the connection, and get rid of the remaining code. A future code architecture may accommodate this feature with less effort. This is something that would tell us that the architecture is right. Now it is clearly wrong because it does not allow us to do that easily.
Instead, please focus on writing /ping to fulfill your goals. Please do so without complicating the implementation of existing endpoints in exchange for seriously marginal gains.
✌️
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(writing that down, so the essence is not lost in a log of a private chat)
I've refactored the code under following rules:
measurerandsaverintactchan <- valueMAY happen in some case, but this case is an "erratic" code path, not a "usual" one.saver, depends on EOF)Yeah, I'm sorry, that's an (unlucky) trait I have indeed! Thanks (no kidding) for nudging me towards awareness, better focus and simplicity.
I've also completely forgotten about that. That's why I overestimated the utility of TCPInfo samples being logged.