-
Notifications
You must be signed in to change notification settings - Fork 70
fix: Disable partial download if file being downloaded is modified between retries #3941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Disable partial download if file being downloaded is modified between retries #3941
Conversation
4fd77e9 to
6b3b3a4
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! 🚀 New features to boost your workflow:
|
Robot Results
|
6b3b3a4 to
783fb5a
Compare
783fb5a to
4d81a65
Compare
4d81a65 to
a0547c7
Compare
| let mut response = self.request_range_from(url, offset).await?; | ||
| if was_resource_modified(&response, &prev_response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing wrong with this impl. But, instead of this manual check and adjusting the start position, using the IfRange header along with the ETag seems like the "accepted convention" to resume a partial download. Avoids that additional HTTP call as well (not that it really matters in the context of a file download).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IfRange indeed is something that I wanted to add as well, but in a follow-up PR.
| (None, None) => { | ||
| // no etags in either request, assume resource is unchanged | ||
| false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be safer to assume the contrary. But, I also understand this choice: if the source doesn't even try to provide support to detect changes, then why bother?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, particularly with downloading files from Cumulocity where ETag is not given, I didn't want to disable partial downloads there, as people might be surprised why it stopped working... But in cases where ETag is actually available we shouldn't ignore it.
a0547c7 to
7187df8
Compare
albinsuresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the impl though there are a few minor things to fix.
|
There was one additional issue where if the file was modified but server returned full response anyway, we would discard the request and retry instead of downloading the new response. |
albinsuresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-confirming my approval for the updated logic.
| use reqwest::StatusCode; | ||
| use reqwest::{header, Response}; | ||
|
|
||
| pub(super) enum PartialResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional:
| pub(super) enum PartialResponse { | |
| pub(super) enum RangeResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO better to keep the name PartialResponse as it's related to the Partial Content status code.
| StatusCode::PARTIAL_CONTENT => { | ||
| if was_resource_modified(response, prev_response) { | ||
| return Ok(PartialResponse::ResourceModified); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, I thought this inefficiency (where the already fetched partial content has to be discarded after the Etag check) could be completely avoided with the usage of If-Range header. But I just realised that a server supporting ETag doesn't guarantee that it supports If-Range as well. So, we'll need this to handle that corner case. But yeah, for the servers that support it, this path would be skipped.
| use reqwest::header; | ||
| use reqwest::header::HeaderValue; | ||
| use reqwest::StatusCode; | ||
| use reqwest::{header, Response}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fixed to make the formatter happy.
Signed-off-by: Marcel Guzik <[email protected]>
Signed-off-by: Marcel Guzik <[email protected]>
Signed-off-by: Marcel Guzik <[email protected]>
00efec1 to
f19a5d0
Compare
| let mut response = self.request_range_from(url, request_offset).await?; | ||
| let offset = match partial_response::response_range_start(&response, &prev_response)? { | ||
| PartialResponse::CompleteContent => 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully get how we are sure the download is making progress, as we restart here from zero in a loop and the backoff retry being inner to this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We restart from zero only if the server returns 200 OK and we have to download the entire file again. self.request_range_from performs the HTTP request, and is subject to the backoff retry policy, but reading response body and writing it to file happens in save_chunks_to_file_at, which if it completes without errors, breaks out of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which if it completes without errors, breaks out of the loop.
That if is my concern. What if we repeatedly fail to consume from the network the last bytes of a file that is served in its entirety each time? Having an infinite retry loop is never safe, even when the likelihood is low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed you've convinced me that the loop being unbounded is a problem and it should be bounded somehow, but how? Should we just limit the number of iterations the loop can do, such that if we retry too many times we fail, even if theoretically every retry could make a little progress before e.g. timing out? Or should we try to do something smarter, like only count towards the limit requests that haven't made any progress (although it's not clear to me how to precisely define it)?
Ah, this whole thing turns out to be more complicated that initially anticipated, everywhere there is some little edge case. Maybe there are some dependencies that could help with this, or if not I should take a more thorough look at how projects like wget are doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just limit the number of iterations the loop can do
Simple and effective. I don't think we need something smarter as the point is only to avoid insane traps.
If during the download the resource was modified on the server to be smaller and we've already written more bytes to disk than the new size of the file, after retrying and overwriting the file with the new version we could end up with garbage data from the old version of the file at the end. To fix this, after completing download, we call `set_len` to discard any extra bytes that might be present after the cursor when we finish writing. Signed-off-by: Marcel Guzik <[email protected]>
|
The issue with leftover bytes from older versions of the downloaded resource was addressed in a7763e4. |
Proposed changes
In current download implementation, where in case of interruption we try to resume the download using a HTTP range request to avoid re-downloading parts we already downloaded, there is a potential issue where if the file is updated by the server between retries, we can miss this and corrupt the file.
This PR adds a check where if the file is modified, we abort the range request and request the full file again with a normal
GETrequest.In particular, the check if file was modified compares
ETagheader value if it exists, between current and previous request. If it is different than ETag from previous request, we request full range of the file again. If there's noETag, behaviour is unchanged and we proceed with the range request, so as to not disable partial requests for servers that don't send anETag, for example Cumulocity Inventory Binaries.Types of changes
Paste Link to the issue
Checklist
just prepare-devonce)just formatas mentioned in CODING_GUIDELINESjust checkas mentioned in CODING_GUIDELINESFurther comments
One more thing we could do is check if the
Content-Lengthof the resource is the same, hopefully catching some updates where size of the file changes. Unfortunately currently I am unable to test this because to test partial requests we use chunked transfer encoding, and when it's in use,reqwestignoresContent-Lengthheader and doesn't report it (maybe idea being that chunked transfer encoding is used inherently for streaming requests, where you most often don't know the size ahead of time). So currently, this check is not added, and we only check ETag if it's present.Some servers also send a
Last-Modifiedheader, which can be added (unfortunately Cumulocity also doesn't support it).There's also option to use
Want-Content-Digestheader to request a digest from the server, which we could then also compute locally to perform an integration check before finishing the download, but so far I haven't seen a server that supports it.