-
Notifications
You must be signed in to change notification settings - Fork 36
BUG: fix several race conditions #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bf4f198 to
a72d7b3
Compare
4764f46 to
64fef36
Compare
alpeb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdickhoven for this. This certainly makes the logic more robust an less repetitive 👍
Can I ask you to please create a separate PR for the changes unrelated to the fix? We always squash commits when merging, so it'd be nice to have those as a separate commit.
You can disregard for now the CI failure; we're currently addressing that in other PRs.
9c26082 to
438915e
Compare
|
hi @alpeb 👋 i have taken out the unrelated changes. ✅ happy to submit another pr for those unrelated changes but they are 100% cosmetic and 0% functional. so there's really no need. |
|
by the way, i just thought of a way in which specifically, the current picture this scenario: two new cni config files are created at very nearly the same time. let's call them
moving the updated but before we can deal with that inotifywait event, we first have to deal with the one that is already waiting in the queue for
moving the updated but before we can deal with that inotifywait event, we first have to deal with the one that is already waiting in the queue for ...and since so the cycle repeats... forever. should i fix this behavior in this pr or create a new pr? it is technically a different error case (not a race condition but an infinite loop). i would address this by updating the sha sum logic to store all sha sums in a variable so that the sha sum for the correct file can be looked up. if you want me to submit a separate pr, i'm going to wait until this pr is merged... because the required changes would be different for pre-/post-merge |
|
this is what my fix for the infinite patching loop would look like (using this pr as base): diff --git a/cni-plugin/deployment/scripts/install-cni.sh b/cni-plugin/deployment/scripts/install-cni.sh
index cdf166c..b6156d7 100755
--- a/cni-plugin/deployment/scripts/install-cni.sh
+++ b/cni-plugin/deployment/scripts/install-cni.sh
@@ -264,14 +264,16 @@ sync() {
# monitor_cni_config starts a watch on the host's CNI config directory
monitor_cni_config() {
+ local new_sha
inotifywait -m "${HOST_CNI_NET}" -e create,moved_to,modify |
while read -r directory action filename; do
if [[ "$filename" =~ .*.(conflist|conf)$ ]]; then
log "Detected change in $directory: $action $filename"
- sync "$filename" "$action" "$cni_conf_sha"
+ sync "$filename" "$action" "$(jq -r --arg file "$filename" '.[$file] | select(.)' <<< "$cni_conf_sha")"
# calculate file SHA to use in the next iteration
if [[ -e "$directory/$filename" ]]; then
- cni_conf_sha="$(sha256sum "$directory/$filename" | while read -r s _; do echo "$s"; done)"
+ new_sha="$(sha256sum "$directory/$filename" | while read -r s _; do echo "$s"; done)"
+ cni_conf_sha="$(jq -c --arg file "$filename" --arg sha "$new_sha" '. * {$file: $sha}' <<< "$cni_conf_sha")"
fi
fi
done
@@ -315,7 +317,7 @@ install_cni_bin
# Otherwise, new CNI config files can be created just _after_ the initial round
# of patching and just _before_ we set up the `inotifywait` loop to detect new
# CNI config files.
-cni_conf_sha="__init__"
+cni_conf_sha='{}'
monitor_cni_config &
# Append our config to any existing config file (*.conflist or *.conf)this will maintain cni config sha hashes in a json object like this: {
"05-cilium.conflist" : "0a08ee0b9360e2ee2c3ed1d83263a3168832101346d0528a2474c3f80b7c73d6",
"10-aws.conflist" : "7ed380c9100362003cde9861cc6ef09307245eba3ea963cdba36186a30284acd"
}the since the string also ... <<< "$cni_conf_sha"to ... < <(echo "$cni_conf_sha")or echo "$cni_conf_sha" | ... |
|
Sorry for the late reply... As you correctly realized, this script was designed supposing the existence of a single cni config file. I'd like to spend more time investigating how linkerd-cni should behave in the presence of multiple such files before evaluating your solution. So let's for now leave your last diff posted as a comment out of the current PR. I'll give another look and round of tests to the current commits in the following days, so we can move forward. |
|
Testing this results in the following output: Those extra events at the bottom are new with this change. Given that we're now calling |
which block are you referring to? this?: # Append our config to any existing config file (*.conflist or *.conf)
config_files=$(find "${HOST_CNI_NET}" -maxdepth 1 -type f \( -iname '*conflist' -o -iname '*conf' \))
if [ -z "$config_files" ]; then
log "No active CNI configuration files found"
else
config_file_count=$(echo "$config_files" | grep -v linkerd | sort | wc -l)
if [ "$config_file_count" -eq 0 ]; then
log "No active CNI configuration files found"
else
find "${HOST_CNI_NET}" -maxdepth 1 -type f \( -iname '*conflist' -o -iname '*conf' \) -print0 |
while read -r -d $'\0' file; do
log "Installing CNI configuration for $file"
...this block is definitely needed! because cni config files can already exist before the logic is:
i guess that the change in the block that follows but i figured that consolidating the patching logic made sense. does that answer your question? apologies if i did not understand your question correctly. the log output looks exactly like what i would expect from my changes. |
|
oh wait... i think i understand what you're asking...
you are right. i would not expect those last two events. however, i thought that this was maybe your ci check making sure that trying to update the same file twice would not lead to double-patching. 🤷 if that's not the case then you are correct: those last two events shouldn't happen and i'm not sure where they would be coming from. 😕 ...but only the last two. the two events before the last two are expected and a result of this patch. yes. before this change, so there wouldn't have been an now, when so the act of patching will trigger so: 1st log "Trigger CNI config detection for $file"
...
mv -f "$tmp_file" "$file"2nd both of these events did not show up before because but, as you can see, the second event is ignored (as expected/intended). but then there are two more |
|
i hope this answers your question. please let me know if i'm not making sense. |
|
any progress on this? cni chaining is kind of a thing and it would be good for linkerd to work correctly in environments where cni chaining is already in place... i.e. where linkerd isn't the only chained cni plugin. chaining cilium to aws vpc cni is definitely a pretty common setup based on what i see in the cilium slack workspace. let me know if there's anything else i can do to help get this very real race condition fixed. |
|
Hi, thanks for pinging back; looking again into this... As a side note, I've just realized libcni recently released a way of doing safe subdirectory-based plugin config loading, which would save us from all this config patching nonsense... but that approach will have to wait till it gets picked up by the major cloud providers. Perhaps we could have users opt into that behavior. Not asking you to implement any of that, just wanted to share what I found out 🙂 |
yes. good point.
😍 that's great! i hope it becomes available soon. that would certainly make things easier. do you want me to update my pr with the also, do you want me to open another pr for the other race condition i pointed out about the infinite patching loop due to the flawed shasum logic? if so, then i'd want to wait until this pr is merged before i submit the second pr. |
|
Yes, please update the PR with the |
i personally think that this is a mistake and here's why: handling the concurrency (whether in bash or in another language) is actually not very complex. the the only thing you have to do is make sure that your logic can deal with filesystem events for multiple files in any order. and i can trivially provoke an infinite patching loop (given the current shasum logic) by running the following two commands in quick succession: ...which i just did on one of my worker nodes and it sent linkerd-cni into said infinite patching loop: ...and it's perfectly plausible that two cni plugin files are created in short succession. the race condition that is fixed by the current version of my pr may be "masking" the infinite patch loop race condition to a large extent. so fixing this race condition will probably make the other race condition more likely to occur. anyway, my patch to address the infinite patch loop is quite trivial imo. and i would personally include it in this pr. but... i don't want to talk you into doing something that you're not comfortable with. so... your call. |
| log "Trigger CNI config detection for $file" | ||
| # The following will trigger the `sync()` function via `inotifywait` in | ||
| # `monitor_cni_config()`. | ||
| touch "$file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: there may be filesystems out there that round the mtime attribute to seconds.
this may render a touch ineffective (because the old and new mtime could end up being the same which may then not trigger an ATTRIB event) but i have no way of knowing/testing that.
the safest / most universally compatible way to do this is probably what i had before. but this is definitely "cleaner".
up to you if you want to keep the touch logic or revert back to the mv logic. happy to revert if you tell me to.
also... no rush. we currently have the workaround outlined in the pr description deployed in all of our clusters. so we are fine. ✅ if you'd rather rewrite the cni patch logic in golang or wait for the new subdirectory-based plugin config, that's fine by us. but other linkerd-cni users may run into this race condition and won't be able to effectively troubleshoot / mitigate it. 🤷 |
|
Thanks for the detailed repro for the infinite loop issue. I wasn't entirely clear on how multiple config files would coexist under /etc/cni/net.d (leaving aside the new libcni drop-in config stuff, as if IIUC the separate config files would be under separate directories, so it shouldn't affect our existing inotifywait setup). So here's a summary of what I tested out, just for reference. I set up a test EKS cluster with aws-cni and cilium-cni. When installing cilium, I see it copies aws-cni's conflist file and appends its own config into its plugins array, naming the new file with a higher precedence so it overrides the original conflist: The original conflist (10-aws.conflist) should no longer be modified, so I guess this is why we haven't encountered the condition you describe (btw this is probably a better approach than mutating files, but that's another discussion). That being said, we have no guarantees about what touches those files and I think you made a good case about your infinite loop remediation which should provide improved guardrails. As for the |
it's not bad but it comes with its own set of challenges. 🙁 but, yeah, instead of using shasum to keep track of which files have already been patched, we could just check if a newly created file in
yes. we basically already have that: there are several things about this script that are a bit kludgy and inconsistent that i would be happy to clean up but i didn't want to make such a big change that you would be reluctant to accept my pr. i.e. i wanted to make as small a change as possible that basically preserves most of the existing logic to make it easier for you to reason about my change. but i've been writing shell scripts since the early 90s so i feel more than comfortable refactoring (at least some aspects of) this script if you are ok with that.
that wouldn't work without some exceptionally ugly kludges (because bash isn't a "real" programming language)... so, no to that. ...or maybe you meant to say "trigger" (instead of "setting up") the i.e. have a separate function that does log "Trigger CNI config detection for $file"
tmp_file="$(mktemp -u /tmp/linkerd-cni.patch-candidate.XXXXXX)"
cp -fp "$file" "$tmp_file"
# The following will trigger the `sync()` function via `inotifywait` in
# `monitor_cni_config()`.
mv "$tmp_file" "$file" |
|
so... i found a number of additional race conditions in this script. none of them have to do with bash but with the fact that we are dealing with a filesystem. so you have to really think about atomicity... which we would have to do in e.g. golang as well. anyway... below is a diff based on the current pr of all the changes that i would like to submit in order to remove all (except for one very unlikely) race conditions (that i have discovered while refactoring)... plus a few formatting nits: diff --git a/cni-plugin/deployment/scripts/install-cni.sh b/cni-plugin/deployment/scripts/install-cni.sh
index fe2ffa0..f0ba4f5 100755
--- a/cni-plugin/deployment/scripts/install-cni.sh
+++ b/cni-plugin/deployment/scripts/install-cni.sh
@@ -25,7 +25,7 @@
# - Expects the desired CNI config in the CNI_NETWORK_CONFIG env variable.
# Ensure all variables are defined, and that the script fails when an error is hit.
-set -u -e -o pipefail
+set -u -e -o pipefail +o noclobber
# Helper function for raising errors
# Usage:
@@ -77,7 +77,7 @@ cleanup() {
# Find all conflist files and print them out using a NULL separator instead of
# writing each file in a new line. We will subsequently read each string and
# attempt to rm linkerd config from it using jq helper.
- local cni_data=''
+ local cni_data
find "${HOST_CNI_NET}" -maxdepth 1 -type f \( -iname '*conflist' \) -print0 |
while read -r -d $'\0' file; do
log "Removing linkerd-cni config from $file"
@@ -176,104 +176,113 @@ create_cni_conf() {
CNI_NETWORK_CONFIG="${CNI_NETWORK_CONFIG:-}"
# If the CNI Network Config has been overwritten, then use template from file
- if [ -e "${CNI_NETWORK_CONFIG_FILE}" ]; then
- log "Using CNI config template from ${CNI_NETWORK_CONFIG_FILE}."
- cp "${CNI_NETWORK_CONFIG_FILE}" "${TMP_CONF}"
- elif [ "${CNI_NETWORK_CONFIG}" ]; then
+ if [ -e "$CNI_NETWORK_CONFIG_FILE" ]; then
+ log "Using CNI config template from $CNI_NETWORK_CONFIG_FILE."
+ cp -fp "$CNI_NETWORK_CONFIG_FILE" "$TMP_CONF"
+ elif [ "$CNI_NETWORK_CONFIG" ]; then
log 'Using CNI config template from CNI_NETWORK_CONFIG environment variable.'
- cat >"${TMP_CONF}" <<EOF
-${CNI_NETWORK_CONFIG}
+ cat <<EOF > "$TMP_CONF"
+$CNI_NETWORK_CONFIG
EOF
fi
# Use alternative command character "~", since these include a "/".
- sed -i s~__KUBECONFIG_FILEPATH__~"${DEST_CNI_NET_DIR}/${KUBECONFIG_FILE_NAME}"~g ${TMP_CONF}
+ sed -i s~__KUBECONFIG_FILEPATH__~"$DEST_CNI_NET_DIR/$KUBECONFIG_FILE_NAME"~g "$TMP_CONF"
- log "CNI config: $(cat ${TMP_CONF})"
+ log "CNI config: $(cat "$TMP_CONF")"
}
install_cni_conf() {
local cni_conf_path=$1
- local tmp_data=''
- local conf_data=''
- if [ -e "${cni_conf_path}" ]; then
- # Add the linkerd-cni plugin to the existing list
- tmp_data=$(cat "${TMP_CONF}")
- conf_data=$(jq --argjson CNI_TMP_CONF_DATA "${tmp_data}" -f /linkerd/filter.jq "${cni_conf_path}")
- echo "${conf_data}" > ${TMP_CONF}
- fi
+ local tmp_data=$(cat "$TMP_CONF")
+ local conf_data=$(jq --argjson CNI_TMP_CONF_DATA "$tmp_data" -f /linkerd/filter.jq "$cni_conf_path" || true)
+
+ # Ensure that CNI config file did not disappear during processing.
+ [ -n "$conf_data" ] || return 0
- # If the old config filename ends with .conf, rename it to .conflist, because it has changed to be a list
- filename=${cni_conf_path##*/}
- extension=${filename##*.}
+ # Add the linkerd-cni plugin to the existing list.
+ echo "$conf_data" > "$TMP_CONF"
+
+ # If the old config filename ends with .conf, rename it to .conflist because
+ # it has changed to be a list.
+ local filename=${cni_conf_path##*/}
+ local extension=${filename##*.}
# When this variable has a file, we must delete it later.
old_file_path=
- if [ "${filename}" != '01-linkerd-cni.conf' ] && [ "${extension}" = 'conf' ]; then
- old_file_path=${cni_conf_path}
- log "Renaming ${cni_conf_path} extension to .conflist"
- cni_conf_path="${cni_conf_path}list"
+ if [ "$filename" != '01-linkerd-cni.conf' ] && [ "$extension" = 'conf' ]; then
+ old_file_path=$cni_conf_path
+ log "Renaming $cni_conf_path extension to .conflist"
+ cni_conf_path="${cni_conf_path}list"
fi
+ # Store SHA of each patched file in global `CNI_CONF_SHA` variable.
+ #
+ # This must happen in a non-concurrent access context!
+ #
+ # The below logic assumes that the `CNI_CONF_SHA` variable is already a
+ # valid JSON object. So this variable must be initialized with '{}'!
+ #
+ # E.g. (pretty-printed; actual variable stores compact JSON object)
+ #
+ # {
+ # "/etc/cni/net.d/05-foo.conflist": "b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c",
+ # "/etc/cni/net.d/10-bar.conflist": "7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730"
+ # }
+ local new_sha=$( (sha256sum "$TMP_CONF" || true) | awk '{print $1}' )
+ CNI_CONF_SHA=$(jq -c --arg f "$cni_conf_path" --arg sha "$new_sha" '. * {$f: $sha}' <<< "$CNI_CONF_SHA")
+
# Move the temporary CNI config into place.
- mv "${TMP_CONF}" "${cni_conf_path}" || exit_with_error 'Failed to mv files.'
- [ -n "$old_file_path" ] && rm -f "${old_file_path}" && log "Removing unwanted .conf file"
+ mv "$TMP_CONF" "$cni_conf_path" || exit_with_error 'Failed to mv files.'
+ [ -n "$old_file_path" ] && rm -f "$old_file_path" && log "Removing unwanted .conf file"
- log "Created CNI config ${cni_conf_path}"
+ log "Created CNI config $cni_conf_path"
}
-# Sync() is responsible for reacting to file system changes. It is used in
-# conjunction with inotify events; sync() is called with the name of the file
-# that has changed, the event type (which can be either 'CREATE', 'DELETE',
-# 'MOVED_TO' or 'MODIFY', and the previously observed SHA of the configuration
-# file.
+# `sync()` is responsible for reacting to file system changes. It is used in
+# conjunction with inotify events; `sync()` is called with the event type (which
+# can be either 'CREATE', 'DELETE', 'MOVED_TO', 'MODIFY' or 'DELETE') and the
+# name of the file that has changed
#
-# Based on the changed file and event type, sync() might re-install the CNI
+# Based on the changed file and event type, `sync()` might re-install the CNI
# plugin's configuration file.
sync() {
- local filename=$1
- local ev=$2
- local filepath="${HOST_CNI_NET}/$filename"
-
- local prev_sha=$3
-
- local config_file_count
- local new_sha
- if [ "$ev" = 'CREATE' ] || [ "$ev" = 'MOVED_TO' ] || [ "$ev" = 'MODIFY' ] || [ "$ev" = 'ATTRIB' ]; then
- # When the event type is 'CREATE', 'MOVED_TO' or 'MODIFY', we check the
- # previously observed SHA (updated with each file watch) and compare it
- # against the new file's SHA. If they differ, it means something has
- # changed.
- new_sha=$(sha256sum "${filepath}" | while read -r s _; do echo "$s"; done)
- if [ "$new_sha" != "$prev_sha" ]; then
- # Create but don't rm old one since we don't know if this will be configured
- # to run as _the_ cni plugin.
- log "New/changed file [$filename] detected; re-installing"
- create_kubeconfig
- create_cni_conf
- install_cni_conf "$filepath"
- else
- # If the SHA hasn't changed or we get an unrecognised event, ignore it.
- # When the SHA is the same, we can get into infinite loops whereby a file has
- # been created and after re-install the watch keeps triggering CREATE events
- # that never end.
- log "Ignoring event: $ev $filepath; no real changes detected"
- fi
+ local ev=$1
+ local file=${2//\/\//\/} # replace "//" with "/"
+
+ [[ "$file" =~ .*.(conflist|conf)$ ]] || return 0
+
+ log "Detected event: $ev $file"
+
+ # Retrieve previous SHA of detected file (if any) and compute current SHA.
+ local previous_sha=$(jq -r --arg f "$file" '.[$f] | select(.)' <<< "$CNI_CONF_SHA")
+ local current_sha=$( (sha256sum "$file" || true) | awk '{print $1}' )
+
+ # If the SHA hasn't changed or the detected file has disappeared, ignore it.
+ # When the SHA is the same, we can get into infinite loops whereby a file
+ # has been created and after re-install the watch keeps triggering MOVED_TO
+ # events that never end.
+ # If the `current_sha` variable is blank then the detected CNI config file has
+ # disappeared and no further action is required.
+ # There exists an unhandled (highly improbable) edge case where a CNI plugin
+ # creates a config file and then _immediately_ removes it again _while_ we are
+ # in the process of patching it. If this happens, we may create a patched CNI
+ # config file that should *not* exist.
+ if [ -n "$current_sha" ] && [ "$current_sha" != "$previous_sha" ]; then
+ log "New/changed file [$file] detected; re-installing"
+ create_kubeconfig
+ create_cni_conf
+ install_cni_conf "$file"
+ else
+ log "Ignoring event: $ev $file; no real changes detected or file disappeared"
fi
}
# monitor_cni_config starts a watch on the host's CNI config directory
monitor_cni_config() {
- inotifywait -m "${HOST_CNI_NET}" -e create,moved_to,modify,attrib |
+ inotifywait -m "$HOST_CNI_NET" -e create,moved_to,modify |
while read -r directory action filename; do
- if [[ "$filename" =~ .*.(conflist|conf)$ ]]; then
- log "Detected change in $directory: $action $filename"
- sync "$filename" "$action" "$cni_conf_sha"
- # calculate file SHA to use in the next iteration
- if [[ -e "$directory/$filename" ]]; then
- cni_conf_sha="$(sha256sum "$directory/$filename" | while read -r s _; do echo "$s"; done)"
- fi
- fi
+ sync "$action" "$directory/$filename"
done
}
@@ -284,16 +293,16 @@ monitor_cni_config() {
# only reacting to direct creation of a "token" file, or creation of
# directories containing a "token" file.
monitor_service_account_token() {
- inotifywait -m "${SERVICEACCOUNT_PATH}" -e create |
- while read -r directory _ filename; do
- target=$(realpath "$directory/$filename")
- if [[ (-f "$target" && "${target##*/}" == "token") || (-d "$target" && -e "$target/token") ]]; then
- log "Detected creation of file in $directory: $filename; recreating kubeconfig file"
- create_kubeconfig
- else
- log "Detected creation of file in $directory: $filename; ignoring"
- fi
- done
+ inotifywait -m "$SERVICEACCOUNT_PATH" -e create |
+ while read -r directory _ filename; do
+ target=$(realpath "$directory/$filename")
+ if [[ (-f "$target" && "${target##*/}" == "token") || (-d "$target" && -e "$target/token") ]]; then
+ log "Detected creation of file in $directory: $filename; recreating kubeconfig file"
+ create_kubeconfig
+ else
+ log "Detected creation of file in $directory: $filename; ignoring"
+ fi
+ done
}
log() {
@@ -306,35 +315,32 @@ log() {
# Delete old "interface mode" file, possibly left over from previous versions
# TODO(alpeb): remove this on stable-2.15
-rm -f "${DEFAULT_CNI_CONF_PATH}"
+rm -f "$DEFAULT_CNI_CONF_PATH"
install_cni_bin
-# The CNI config monitor must be set up _before_ we start patching CNI config
-# files!
+# The CNI config monitor must be set up _before_ we start patching existing CNI
+# config files!
# Otherwise, new CNI config files can be created just _after_ the initial round
# of patching and just _before_ we set up the `inotifywait` loop to detect new
# CNI config files.
-cni_conf_sha="__init__"
+CNI_CONF_SHA='{}'
monitor_cni_config &
# Append our config to any existing config file (*.conflist or *.conf)
-config_files=$(find "${HOST_CNI_NET}" -maxdepth 1 -type f \( -iname '*conflist' -o -iname '*conf' \))
+config_files=$(find "$HOST_CNI_NET" -maxdepth 1 -type f \( -iname '*conflist' -o -iname '*conf' \) | grep -v linkerd || true)
if [ -z "$config_files" ]; then
- log "No active CNI configuration files found"
+ log "No active CNI configuration files found"
else
- config_file_count=$(echo "$config_files" | grep -v linkerd | sort | wc -l)
- if [ "$config_file_count" -eq 0 ]; then
- log "No active CNI configuration files found"
- else
- find "${HOST_CNI_NET}" -maxdepth 1 -type f \( -iname '*conflist' -o -iname '*conf' \) -print0 |
- while read -r -d $'\0' file; do
- log "Trigger CNI config detection for $file"
- # The following will trigger the `sync()` function via `inotifywait` in
- # `monitor_cni_config()`.
- touch "$file"
- done
- fi
+ find "${HOST_CNI_NET}" -maxdepth 1 -type f \( -iname '*conflist' -o -iname '*conf' \) -print0 |
+ while read -r -d $'\0' file; do
+ log "Trigger CNI config detection for $file"
+ tmp_file="$(mktemp -u /tmp/linkerd-cni.patch-candidate.XXXXXX)"
+ cp -fp "$file" "$tmp_file"
+ # The following will trigger the `sync()` function via filesystem event.
+ # This requires `monitor_cni_config()` to be up and running!
+ mv "$tmp_file" "$file" || exit_with_error 'Failed to mv files.'
+ done
fi
# Watch in bg so we can receive interrupt signals through 'trap'. From 'mananyway, as you can see, it's quite a comprehensive change.
...actually, ☝️ this would make the bash code execute concurrently and cause lots of problems with global variables. so we definitely shouldn't do this. we want to serialize all CNI config file changes in this script. and that would NOT be the case if we put |
e.g. this is a race condition. if the cni config file is removed between if [ -e "${cni_conf_path}" ]; thenand conf_data=$(jq --argjson CNI_TMP_CONF_DATA "${tmp_data}" -f /linkerd/filter.jq "${cni_conf_path}")the script will exit and linkerd-cni's ability to patch cni config files will be permanently disabled. as you can see, i am doing the following instead which is robust against this edge case: conf_data=$(jq --argjson CNI_TMP_CONF_DATA "$tmp_data" -f /linkerd/filter.jq "$cni_conf_path" || true)
[ -n "$conf_data" ] || return 0 |
…rd#490) Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 3.4.0 to 3.5.0. - [Release notes](https://github.com/docker/setup-qemu-action/releases) - [Commits](docker/setup-qemu-action@4574d27...5964de0) --- updated-dependencies: - dependency-name: docker/setup-qemu-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…rd#493) Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 3.5.0 to 3.6.0. - [Release notes](https://github.com/docker/setup-qemu-action/releases) - [Commits](docker/setup-qemu-action@5964de0...2910929) --- updated-dependencies: - dependency-name: docker/setup-qemu-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/cache](https://github.com/actions/cache) from 4.2.1 to 4.2.2. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](actions/cache@0c907a7...d4323d4) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Fixes linkerd/linkerd2#12573 In linkerd#440 we introduced an inotifywait call to detect changes in the service account token, to be run in parallel to the existing inotifywait tracking cni config file changes. The cleanup() function didn't account for the additional inotifywait background process, so the call `kill -s KILL "$(pgrep inotifywait)"` errored out as it was expecting just one PID. An improper cleanup leaves the linkerd-cni binary deployed in the node, along with its config in the cni config file, so it continues to be called even if the linkerd-cni pod is deleted. The problem is that as soon as the pod is deleted, the service account token that the linkerd-cni binary relies on is revoked, causing the linkerd-cni invocation triggered upon each pod creation in the node to fail with an Unauthorized error. When rolling out the linkerd-cni pod this will happen sometimes, if k8s revokes that token fast enough, blocking any further creation of pods in that node.
cargo-deny-action is broken: EmbarkStudios/cargo-deny-action#91 This change replaces the action with a manual invocation via the dev container image.
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.35.1 to 1.38.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.35.1...tokio-1.38.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.38.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…erd#518) Fixes linkerd/linkerd2#13976 When the Linkerd CNI-plugin acts on a pod with a `config.linkerd.io/skip-inbound-ports` annotation, it replaces the inbound skip ports that the CNI-plugin was configured with instead of adding to them. The CNI-plugin is configured with the Linkerd proxy's admin and tap ports as skip inbound ports. Therefore, on any pod with the `config.linkerd.io/skip-inbound-ports` annotation, the admin and tap ports will no longer be skipped by the iptables redirect rules, making these proxy ports inaccessible and causing tap and promethus scraping to no longer function. Instead of replacing, we append the ports specified in `config.linkerd.io/skip-inbound-ports` to the inbound skip ports. This allows the admin and tap ports to continue to skip iptables redirection and reach the proxy as desired. This matches the behavior of the linkerd-init container: https://github.com/linkerd/linkerd2/blob/main/charts/partials/templates/_proxy-init.tpl#L25 Tested manually on k3d with Calico and validating that tap works as expected on workloads with `config.linkerd.io/skip-inbound-ports` configured. Due to the architecture of the cni-plugin executable, it is not well set up to be tested automatically without further refactoring so we omit automated tests here. Signed-off-by: Alex Leong <[email protected]>
…erd#517) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4.1.9 to 4.3.0. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@cc20338...d3f86a1) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: 4.3.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…erd#510) Bumps [sigstore/cosign-installer](https://github.com/sigstore/cosign-installer) from 3.8.1 to 3.8.2. - [Release notes](https://github.com/sigstore/cosign-installer/releases) - [Commits](sigstore/cosign-installer@d7d6bc7...3454372) --- updated-dependencies: - dependency-name: sigstore/cosign-installer dependency-version: 3.8.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [ring](https://github.com/briansmith/ring) from 0.17.9 to 0.17.14. - [Changelog](https://github.com/briansmith/ring/blob/main/RELEASES.md) - [Commits](https://github.com/briansmith/ring/commits) --- updated-dependencies: - dependency-name: ring dependency-version: 0.17.14 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.33.0 to 0.38.0. - [Commits](golang/net@v0.33.0...v0.38.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-version: 0.38.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…nkerd#509) Bumps [softprops/action-gh-release](https://github.com/softprops/action-gh-release) from 2.2.1 to 2.2.2. - [Release notes](https://github.com/softprops/action-gh-release/releases) - [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md) - [Commits](softprops/action-gh-release@c95fe14...da05d55) --- updated-dependencies: - dependency-name: softprops/action-gh-release dependency-version: 2.2.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…3.0 (linkerd#506) Bumps [github.com/containernetworking/cni](https://github.com/containernetworking/cni) from 1.2.3 to 1.3.0. - [Release notes](https://github.com/containernetworking/cni/releases) - [Commits](containernetworking/cni@v1.2.3...v1.3.0) --- updated-dependencies: - dependency-name: github.com/containernetworking/cni dependency-version: 1.3.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/cache](https://github.com/actions/cache) from 4.2.2 to 4.2.3. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](actions/cache@d4323d4...5a3ec84) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [docker/login-action](https://github.com/docker/login-action) from 3.3.0 to 3.4.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](docker/login-action@9780b0c...74a5d14) --- updated-dependencies: - dependency-name: docker/login-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [DavidAnson/markdownlint-cli2-action](https://github.com/davidanson/markdownlint-cli2-action) from 19.1.0 to 20.0.0. - [Release notes](https://github.com/davidanson/markdownlint-cli2-action/releases) - [Commits](DavidAnson/markdownlint-cli2-action@05f3221...992badc) --- updated-dependencies: - dependency-name: DavidAnson/markdownlint-cli2-action dependency-version: 20.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps alpine from 3.21.3 to 3.22.0. --- updated-dependencies: - dependency-name: alpine dependency-version: 3.22.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This updates dependabot.yml to also look into ./github/actions/* Support for this was recently introduced, see https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference#directories-or-directory--
…inkerd#527) Bumps [actions/checkout](https://github.com/actions/checkout) from 3df4ab11eba7bda6032a0b82a6bb43b11571feac to 09d2acae674a48949e3602304ab46fd20ae0c42f. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@3df4ab1...09d2aca) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '09d2acae674a48949e3602304ab46fd20ae0c42f' dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
332dc5c to
6c19648
Compare
|
ok. that seems to have worked. 😌 now that i know how to edit past commits, would you like me to also prepend my commit messages with e.g. ? ...though that would require another force-push. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, LGTM! -- Tested fine in EKS with aws-cni and chained cilium-cni 👍
|
hi @alpeb 👋 any idea when you are going to create a new release / docker image for the cni plugin with the updated |
hey @sdickhoven just a heads up that @alpeb is out of the office this week, but we can likely create a new release once he is back. |
thanks, @cratelyn 🙏 no rush. i just wanted to get an idea of roughly when we can expect a release. that's all. happy to wait until @alpeb is back from vacation (and longer too). 🏝️ |
this pr fixes a race condition that occurs when new cni config files are created while the
install-cni.shscript is executing anywhere here:https://github.com/linkerd/linkerd2-proxy-init/blob/cni-plugin/v1.6.0/cni-plugin/deployment/scripts/install-cni.sh#L323-L348
we have observed this race condition several times in our eks clusters over the past few days where we are chaining cilium to the aws vpc cni.
the
install-cni.shscript simply fails to patch the cilium cni config sometimes. i.e. cilium and linkerd-cni must be starting up and manipulating/etc/cni/net.dat just about the same time.i have a temporary workaround for this race condition in the form of the following kustomize patch:
this pr furthermore addresses several additional race conditions that can occur when two or more cni config file are created in short succession and/or when a cni config file is created and then removed immediately afterwards.