Update libnetwork to fix port binding issue#428
Merged
Conversation
This new version has a patch cherry-picked from here: moby/libnetwork#1805 This patch is meant to avoid cases in which libnetwork internal state gets inconsistent in case of crashes. Signed-off-by: Leandro Motta Barros <leandro@balena.io> Change-type: patch
0217c4f to
56aa633
Compare
majorz
approved these changes
Apr 21, 2023
Contributor
|
Very clever test case reproduction, looks good to me! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This updates balena-libnetwork to a version that should fix some port binding issues that may happen after balenaEngine or device crashes. Specifically, this balena-libnetwork version cherry-picks this unmerged upstream patch (with minor changes to make it compatible with recent Moby versions).
I cannot comment on the precise details, but this patch essentially changes the order of initialization of some network-related components in order to avoid getting into a inconsistent state.
Fixes #272 (at least shall fix some of its occurrences)
Testing
Tested for regressions: Engine unit tests and integration tests passing. Tried it in a meta-balena branch; all tests passed. Also did some manual testing on a Pi 3.
Testing for effectiveness is another story. We don't have a reliable way to reproduce the issue, so I created a version of the Engine meant to crash at a point that triggers the issue. Now, I cannot tell for sure this is reproducing exactly the same case we are seeing in practice, but to me the symptoms look close enough to give a good confidence this is a step in the right direction.
I'll describe in details what I did to reproduce the issue and test the patch because this might be a good future reference should other similar issues appear (or this one re-appear).
First, based on this analysis we see that the issue happens when the Engine crashes at a more or less specific point. I tried to locate such point; not sure I found it exactly, but I found something -- and then added some code that allows us to force a crash right there:
For the test itself, I prepared two Engine versions: one containing the patch we are testing (
balena-engine-patched), another containing the "crash code" above (balena-engine-crashable). I copied both to the data partition of a Pi 3, so that I can symlink/usr/bin/balena-engineto either of them as needed. And then:balena-engine-crashable(but not forcing a crash yet!), user service (container) running, all nice and fine.ps aux | grep proxy, check the PIDs. In my case, 2216 and 2226.touch /mnt/data/crash-the-engine.pleasebalena-engine-proxyprocesses holding the ports. Check withlsof -nP -iTCP -sTCP:LISTENandps aux | grep proxy. Notice these are new processes (PIDs 2984 and 2993 in my case) created while bringing up the service again, before the forced crash.rebootbalena-engine-proxyprocesses even before we try to start the service (IIUC, they are created as the Engine initializes the network subsystem; it's basically trying to restore the pre-reboot state.)mount -o remount,rw /,cd /usr/bin/,ln -nfs /mnt/data/balena-engine-patched balena-engine.rebootSo, looks like the patch helped, Q.E.D. 🙂
Side note: If we reboot again between steps 9 and 10 , the service starts successfully. In this case, we apparently don't create
balena-engine-proxyprocesses before attempting to start the service. I don't know why this happens -- why does this second reboot (apparently) makes the internal state consistent again?