-
Notifications
You must be signed in to change notification settings - Fork 114
Add logVerbosity Helm chart parameter, reduce default log verbosity
#633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logVerbosity Helm chart parameter, reduce default log verbosity
#633
Conversation
|
I will chunk this PR into individual patches. |
| # - Checkpoint file updates | ||
| # - Kubelet plugin GRPC request/response detail | ||
| # | ||
| logVerbosity: "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to add this as a top-level helm variable. Can we start with just exposing an envvar on the relevant components (without any helm value at all), with the default set to (1) in the code if no envvar is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start with just exposing an envvar on the relevant components
We can do that. But let's work towards a specific goal:
I'd love to have a simple way to set the log verbosity across all components (either upon Helm install, or right thereafter).
With that env var: are those ergonomics good enough for us, for actually using that?
How will we (you, I) use this e.g. in a test suite, and in debugging?
Where/how and when do we set that env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hesitant to add this as a top-level helm variable
and a (smiling!) side note is that we thought of this task as "allow verbosity of kubelet plugins and controller to be set via helm" (#609)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still set the envvars via helm. There is just no explicit helm value added to the helm API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we take a hybrid approach. Rename this to defaultLogVerbosity and then allow it to be overridden by an envvar on a component by component basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still set the envvars via helm
Sorry for digging in -- but how exactly would we imagine a user to do that?
Are you thinking about e.g.
| env: [] |
and then doing something like --set controller.containers.computeDomain.env.LOG_VERBOSITY=5?
Rename this to defaultLogVerbosity
We can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow it to be overridden by an envvar on a component by component basis.
We can also do that! But we can maybe also figure that one out in another patch.
I looked into doing this (set verbosity via environment) when starting to work on this patch. What I found rather disappointing: for regular klog/v2, setting the -v CLI flag is the only way to set the verbosity, and there is no API for changing it during runtime.
Of course, there is a way to mutate the parsed CLI flags say (based on an env var) before having klog interpret them. But, that kind of approach may almost be appropriate for an obfuscation contest.
Edit: maybe something like this works (using an env var in the command specification)?
command:
- foo
- --v=$(LOG_VERBOSITY)
env:
- name: LOG_VERBOSITY
value: "4"Because I just found:
Kubernetes uses round parentheses for environment variable substitution in Pod command and args fields
Related topic: dynamically changing verbosity at runtime: I have seen a few pointers here and maybe nowadays there may be a better (even if not yet properly documented) way to dynamically set the verbosity (such as based on env var).
I want to explore other ways for changing the verbosity during runtime, and for that I think a starting point would be kubernetes/klog#368 and related resources -- klog is really limited in many ways compared to other logging libraries.
Other pointers (related, but not much of help IMO):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to defaultLogVerbosity
I have done this now.
But are we sure about that variable name? I think the "default" prefix doesn't really help.
In configuration management, often there is an order of precedence -- the same parameter can be specified on various layers. Those layers then have a well-known hierarchical order.
I want to say that it's not uncommon to have an outer "logVerbosity" setting that can be overridden via some more "inner" mechanism, using the same parameter name.
(I mean, of course you know all that -- I just want to spell it out explicitly here, that I think this is one such case)
Another point of view, terminology: when not specifying --set defaultLogVerbosity=X -- what takes effect then? The default defaultLogVerbosity 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing something like
--set controller.containers.computeDomain.env.LOG_VERBOSITY=5?
Update: well, that doesn't quite work.
I have now adjusted the patch that one can do a per-component override.
The clumsy way to do that (tested):
helm install ... \
--set controller.containers.computeDomain.env[0].name=LOG_VERBOSITY \
--set-string controller.containers.computeDomain.env[0].value=6 \
...Well, --set-string or
--set 'controller.containers.computeDomain.env[0].value="6"'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self. Also viable:
kubectl set env pod/<pname> name=value
Triggers a pod restart.
| // history), checks the global rate against the token bucket, and picks the | ||
| // longest delay from either strategy, ensuring that both per-item and overall | ||
| // queue health are respected. | ||
| func DefaultPrepUnprepRateLimiter() workqueue.TypedRateLimiter[any] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this DefaultKubeletPluginRateLimiter to be symmetrical with DefaultControllerRateLimiter.
Also, can you explain how this change is relevant to a PR focused on logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain how this change is relevant to a PR focused on logging?
As many times, I kindly refer to my PR description :-)
#633 (comment), point (3).
As indicated in that PR description and also in another comment: I am happy to distribute the commits across different PRs to keep things separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- took this part of the patch to Introduce DefaultPrepUnprepRateLimiter (less aggressive) #656
- took this part of the discussion to https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/656/files#r2413832932
| // Runs after `Action` (regardless of success/error). In urfave cli | ||
| // v2, the final error reported will be from either Action, Before, | ||
| // or After (whichever is non-nil and last executed). | ||
| klog.Infof("shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("shutdown") | |
| klog.Infof("Shutdown") |
cmd/gpu-kubelet-plugin/main.go
Outdated
| return flags.loggingConfig.Apply() | ||
| }, | ||
| Action: func(c *cli.Context) error { | ||
| klog.Infof("config: %v", flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("config: %v", flags) | |
| klog.Infof("Config: %v", flags) |
| // Runs after `Action` (regardless of success/error). In urfave cli | ||
| // v2, the final error reported will be from either Action, Before, | ||
| // or After (whichever is non-nil and last executed). | ||
| klog.Infof("shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("shutdown") | |
| klog.Infof("Shutdown") |
| }, | ||
| Action: func(c *cli.Context) error { | ||
| ctx := c.Context | ||
| klog.Infof("config: %v", render.Render(flags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("config: %v", render.Render(flags)) | |
| klog.Infof("Config: %v", render.Render(flags)) |
| // Check implements [grpc_health_v1.HealthServer]. | ||
| // Check implements [grpc_health_v1.HealthServer.Check]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- the comment explains that this implements the grpc_health_v1.HealthServer interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Check does not implement the full HealthServer interface, right? If something claims to implement an interface, I'd expect it not only implement a part of that interface. That seems to be a language / convention aspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carried to #657
| } | ||
|
|
||
| func startHealthcheck(ctx context.Context, config *Config) (*healthcheck, error) { | ||
| func setupHealthcheckPrimitives(ctx context.Context, config *Config) (*healthcheck, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Does it not start the healthcheck server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notably, this does not start a health check, as I found by following the code.
Which is why I do not like the name "startHealthcheck" (I was misled by that, I thought this starts a health check and I wondered why).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carried to #657
| results[claim.UID] = err | ||
| wg.Done() | ||
| if err != nil { | ||
| klog.V(0).Infof("Permanent error unpreparing devices for claim %v: %v", claim.UID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the V(0) be dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can the log be done inside nodeUnprepareResource so as not to clutter things here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, level 0 is implicit when doing klog.Infof().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the V(0) be dropped?
I remember: I did this to make this explicit -- to set the precedent for always using an explicit level, to enhance code readability.
Because this is a question one naturally has when reading code: what level does Info on log by default? One needs to have that additional knowledge.
But I will remove this now again to eradicate a potential point of friction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the log be done inside nodeUnprepareResource
In d.nodeUnprepareResource(ctx, claim) we return isPermanentError(err) directly w/o inspecting its return value. We only look at done here, at the call site.
We can change this if course, but let's not do this here.
| // Prepare+Checkpoint are done transactionally). Note that | ||
| // claimRef.String() contains namespace, name, UID. | ||
| klog.Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) | ||
| klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.V(2).Infof("unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) | |
| klog.V(2).Infof("Unprepare noop: claim not found in checkpoint data: %v", claimRef.String()) |
| // Runs after `Action` (regardless of success/error). In urfave cli | ||
| // v2, the final error reported will be from either Action, Before, | ||
| // or After (whichever is non-nil and last executed). | ||
| klog.Infof("shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("shutdown") | |
| klog.Infof("Shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like consistency.
We have currently many log messages starting with a lower-case character:
$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | wc -l
62
That's spread across levels:
$ grep -nR 'klog' | grep -E '\(\"[[:lower:]]' | grep Info | wc -l
35
We should do this in a follow-up; and maybe add a lint / check in CI.
| return flags.loggingConfig.Apply() | ||
| }, | ||
| Action: func(c *cli.Context) error { | ||
| klog.Infof("config: %v", render.Render(flags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| klog.Infof("config: %v", render.Render(flags)) | |
| klog.Infof("Config: %v", render.Render(flags)) |
916b42d to
cac5cd9
Compare
| kubectl get pod \ | ||
| -l nvidia-dra-driver-gpu-component=controller \ | ||
| -n nvidia-dra-driver-gpu \ | ||
| | grep -iv "NAME" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: ignore column headers
| assert_output --partial "channel2047" | ||
| assert_output --partial "channel222" | ||
| kubectl delete -f demo/specs/imex/channel-injection-all.yaml | ||
| kubectl wait --for=delete pods imex-channel-injection-all --timeout=10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: wait for cleanup after test.
| # Clean up. | ||
| kubectl delete "${POD}" | ||
| kubectl delete resourceclaim batssuite-rc-bad-opaque-config | ||
| kubectl wait --for=delete "${POD}" --timeout=10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: wait for cleanup after test.
| # new LOG_VERBOSITY_CD_DAEMON setting applies), and make sure controller | ||
| # deployment is still READY before moving on (make sure 1/1 READY). | ||
| CPOD_OLD="$(get_current_controller_pod_name)" | ||
| kubectl set env deployment nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu LOG_VERBOSITY_CD_DAEMON=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have a flip-verbosity-at-runtime mechanism, this is probably the method we will recommend.
I started to document it here: https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Troubleshooting#controlling-log-verbosity
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Helm values.yaml: defaultLogVerbosity incl. docs Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> values.yaml: tweak, based on in log level insights Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> improve helm chart artifact commentary Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> squash: tweak docs Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Rename chart var, start building tests Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: cover log verbosity set per-component via env Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> helm: rename defaultLogVerbosity to logVerbosity Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
This also had a side effect on subsequent tests, with the controller starting with _no_ LOG_VERBOSITY environment variable set. I don't understand that, but that must be a funky Helm-ism. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
a6ef300 to
4cf3d9b
Compare
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
20f9365 to
4ced422
Compare
|
Since last review:
Test suite passes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exhaustively discussed with Kevin and agreed that this is good to go. Landing this. Nice. (also need to move on from the topic, this was a lot :)).
Let's learn from these changes by observing how things behave in practice.
Let's anticipate more changes on this front in the near term.
commit 55fc7b0 Merge: 5443e0f ef23484 Author: Shiva Krishna Merla <smerla@nvidia.com> Date: Thu Nov 6 16:04:43 2025 -0800 Merge pull request NVIDIA#668 from varunrsekar/vfio-support-1.33 Support VFIO passthrough commit ef23484 Author: Varun Ramachandra Sekar <vsekar@nvidia.com> Date: Tue Oct 14 17:29:18 2025 -0700 vfio passthrough support Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use chroot to run modprobe Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> deadvertise sibling devices on preparation Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> soft check for VFs before attempting unbind Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address review comments Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> address comments (2) Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> use fuser to check if gpu is free Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> remove unnecessary securityContext Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> don't mix vfio and mig devices Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com> commit 5443e0f Merge: 59d775b 3babfe5 Author: Shiva Krishna Merla <smerla@nvidia.com> Date: Tue Nov 4 12:48:00 2025 -0800 Merge pull request NVIDIA#711 from shivamerla/add_gpu_stress_tests tests: Add separate targets for GPU plugin tests + add stress tests commit 3babfe5 Author: Shiva Krishna, Merla <smerla@nvidia.com> Date: Tue Nov 4 11:47:01 2025 -0800 tests: Use BATS_TEST_TMPDIR and failfast on errors during cleanup Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com> commit 2b3e70b Author: Shiva Krishna, Merla <smerla@nvidia.com> Date: Tue Nov 4 11:07:19 2025 -0800 tests: Add separate targets for GPU plugin tests + add stress tests * Add separate make targets to run GPU and CD specific tests * Add a stress test for GPU allocation * Refactor Makefile to share common docker setup between targets Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com> commit 59d775b Merge: 852b56f 1e79179 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Nov 3 19:38:02 2025 +0100 Merge pull request NVIDIA#709 from jgehrcke/jp/basic-gpu-tests tests: cover basic GPU allocation, misc improvements commit 852b56f Merge: 1ee1b4a e8fa8e6 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Nov 3 19:21:53 2025 +0100 Merge pull request NVIDIA#706 from Gacko/vkptt kubelet plugins: add /opt/bin to binary search paths commit 1ee1b4a Merge: f4d11e3 068bb76 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Nov 3 19:10:16 2025 +0100 Merge pull request NVIDIA#710 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.1-dev build(deps): bump nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev in /deployments/container commit 1e79179 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:44:03 2025 -0700 tests: cover basic GPU allocation Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> misc fixes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> remove cdi spec removal again Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 068bb76 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Nov 3 17:59:31 2025 +0000 build(deps): bump nvidia/distroless/cc in /deployments/container Bumps nvidia/distroless/cc from v3.2.0-dev to v3.2.1-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.1-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> commit fcd74d1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:42:35 2025 -0700 tests: add nvmm helper Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 977f421 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:42:10 2025 -0700 tests: per-user tmp dir (relevant on shared machines) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 1c2da2c Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Nov 1 11:41:09 2025 -0700 tests: parallelize per-node state dir cleanup Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e8fa8e6 Author: Marco Ebert <marco_ebert@icloud.com> Date: Wed Oct 29 09:52:34 2025 +0100 kubelet plugins: add /opt/bin to binary search paths Signed-off-by: Marco Ebert <marco_ebert@icloud.com> commit f4d11e3 Merge: 89c8258 9b20929 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 13:09:19 2025 +0100 Merge pull request NVIDIA#707 from jgehrcke/jp/version25120 Increment version to 25.12.0-dev commit 89c8258 Merge: a772441 de830d3 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 13:07:49 2025 +0100 Merge pull request NVIDIA#703 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.2.0-dev build(deps): bump nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev in /deployments/container commit a772441 Merge: 7f591c2 2a2eeec Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 13:07:04 2025 +0100 Merge pull request NVIDIA#705 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.6 to 1.18.0 commit 9b20929 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 29 12:47:26 2025 +0100 Increment version to 25.12.0-dev Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 2a2eeec Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Oct 26 17:02:01 2025 +0000 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.6 to 1.18.0. - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases) - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md) - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.6...v1.18.0) --- updated-dependencies: - dependency-name: github.com/NVIDIA/nvidia-container-toolkit dependency-version: 1.18.0 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit de830d3 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri Oct 24 17:13:23 2025 +0000 build(deps): bump nvidia/distroless/cc in /deployments/container Bumps nvidia/distroless/cc from v3.1.13-dev to v3.2.0-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.2.0-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> commit 7f591c2 Merge: cfe35ff 70fbda6 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 22 16:56:21 2025 +0200 Merge pull request NVIDIA#699 from jgehrcke/jp/readme-installation-instruction README: refer to external install instructions commit 70fbda6 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 21 14:26:18 2025 +0200 README: refer to external install instructions Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit cfe35ff Merge: 2762688 151c766 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 14:56:42 2025 +0200 Merge pull request NVIDIA#687 from jgehrcke/jp/unbreak-ci ci: fix downstream pipeline issues commit 151c766 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 14:26:43 2025 +0200 ci: bump regctl conservatively Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 7238e5d Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 14:26:24 2025 +0200 ci: rename gl pipeline stages Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 87b7915 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Sep 3 17:04:33 2025 +0200 ci: push image w/o version prefix Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 24e765d Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 17 12:27:52 2025 +0200 ci: remove scan-images step Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 2762688 Merge: 1516ec7 784ba18 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 20:12:17 2025 +0200 Merge pull request NVIDIA#685 from jgehrcke/jp/tests-v1-exactly tests: construct ResourceClaim differently on v1 commit 784ba18 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 17:30:44 2025 +0000 tests: construct ResourceClaim differently on v1 Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 1516ec7 Merge: 38b42bb e14beed Author: Shiva Krishna Merla <smerla@nvidia.com> Date: Thu Oct 16 10:01:55 2025 -0700 Merge pull request NVIDIA#682 from shivamerla/fix_attestations Ensure attestation parameters are passed only for multi-arch builds using buildx. commit 38b42bb Merge: 0d83254 6cef363 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 11:27:20 2025 +0200 Merge pull request NVIDIA#679 from jgehrcke/jp/tests-split-into-modules-add-failover tests: split into modules, add CD failover coverage commit 6cef363 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 16 08:53:23 2025 +0000 tests: explicit log on launcher container start, misc Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit db70cd7 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 15:27:53 2025 +0000 tests: add test_cd_failover.bats and support Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 38036ac Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 15:16:21 2025 +0000 tests: split tests.bats into modules Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e14beed Author: Shiva Krishna, Merla <smerla@nvidia.com> Date: Wed Oct 15 11:52:42 2025 -0700 Ensure attestation parameters are passed only for multi-arch builds using buildx. Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com> commit 0d83254 Merge: 65cd2c5 f8ace2e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 18:06:27 2025 +0200 Merge pull request NVIDIA#676 from jgehrcke/jp/curl-retry-tcp-rst build: retry TCP RST when curling bash source commit 65cd2c5 Merge: b3f4e07 c40b44b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 16:59:35 2025 +0200 Merge pull request NVIDIA#677 from jgehrcke/jp/test-abort-on-failure tests: abort suite on first failure, misc commit c40b44b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:24:06 2025 +0000 tests: adjust readme Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 6e783bf Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 10:57:25 2025 +0000 tests: rundir in /tmp (too much cruft in home dir) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit dafa4f5 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:15:32 2025 +0000 tests: merge two simple tests into one Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit c14c2ef Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:14:09 2025 +0000 tests: add on_failure hook to emit debug info Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 89bb88a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 11:12:22 2025 +0000 tests: use new --abort flag for bats (fail suite fast) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit f8ace2e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 10:42:05 2025 +0000 build: retry TCP RST when curling bash source Error seen: curl: (7) Failed to connect to mirror.cs.odu.edu port 443 after 306 ms: Connection refused By default, a TCP connection rejection (RST) is not treated by curl as a transient error, see https://curl.se/docs/manpage.html#--retry-connrefused It's a transient error in the sense that it's often a way to implement backpressure. We retry at slow rate. `--retry-all-errors` is what we want here, it includes `--retry-connrefused`. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit b3f4e07 Merge: ab5a2b3 4e5cdf2 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 15 12:31:08 2025 +0200 Merge pull request NVIDIA#669 from NVIDIA/dependabot/go_modules/main/google.golang.org/grpc-1.76.0 build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0 commit ab5a2b3 Merge: 23ccbd2 803a35a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 19:56:45 2025 +0200 Merge pull request NVIDIA#675 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.3 build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel commit 803a35a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Oct 14 17:16:27 2025 +0000 build(deps): bump golang from 1.25.2 to 1.25.3 in /deployments/devel Bumps golang from 1.25.2 to 1.25.3. --- updated-dependencies: - dependency-name: golang dependency-version: 1.25.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit 23ccbd2 Merge: 83b8249 9d02cea Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 17:30:58 2025 +0200 Merge pull request NVIDIA#672 from jgehrcke/jp/periodic-cleanup-partially-prepared-rcs CD kubelet plugin: add state reconciliation for partially prepared claims commit 9d02cea Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Oct 13 13:03:11 2025 +0000 tests: cover cleanup for stale partially prepared claims Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit f7a3310 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sun Oct 12 22:06:38 2025 +0000 CD plugin: handle stale partially prepared claims Add a fundamentally required state reconciliation: Periodically, perform a self-initiated Unprepare() of previously partially prepared claims. Perform periodically: - Read checkpoint - Iterate through RCs in PrepareStarted state - For each: RC still known in API server? If not: 1) initiate an Unprepare 2) Remove from checkpoint file if unprepr was successful Relevance: Unpreparing any partially performed claim preparation might revert a state mutation that would otherwise be permanently inconsistent with API server state (e.g., this could remove a node label). Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 83b8249 Merge: 5235bed e22cdba Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 15:11:35 2025 +0200 Merge pull request NVIDIA#674 from jgehrcke/jp/use-custom-config-dir-for-daemon CD daemon: /imexd instead of /etc/nvidia-imex commit e22cdba Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 07:15:09 2025 +0000 CD daemon: /imexd instead of /etc/nvidia-imex Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 5235bed Merge: 7b5e2cd aa15924 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 14 12:50:05 2025 +0200 Merge pull request NVIDIA#658 from jgehrcke/jp/log-full-component-config-on-startup Log full startup config in all CLIs in `Before` hook commit aa15924 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 21:09:36 2025 +0000 tests: confirm startup config logged on lvl 0 Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e2ea590 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Sep 29 13:24:00 2025 +0000 Introduce LogStartupConfig(), use in all CLIs in Before() hook Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4e5cdf2 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Oct 13 08:53:56 2025 +0000 build(deps): bump google.golang.org/grpc from 1.75.1 to 1.76.0 Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.75.1 to 1.76.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.75.1...v1.76.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.76.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> commit 7b5e2cd Merge: a1d2fd7 11f6c02 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Oct 13 10:37:08 2025 +0200 Merge pull request NVIDIA#670 from NVIDIA/dependabot/go_modules/main/golang.org/x/time-0.14.0 build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0 commit a1d2fd7 Merge: c614e61 6b2af09 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Mon Oct 13 10:32:50 2025 +0200 Merge pull request NVIDIA#671 from NVIDIA/dependabot/go_modules/main/github.com/NVIDIA/nvidia-container-toolkit-1.18.0-rc.6 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit from 1.18.0-rc.5 to 1.18.0-rc.6 commit 6b2af09 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Oct 12 17:02:23 2025 +0000 build(deps): bump github.com/NVIDIA/nvidia-container-toolkit Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.18.0-rc.5 to 1.18.0-rc.6. - [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases) - [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/CHANGELOG.md) - [Commits](NVIDIA/nvidia-container-toolkit@v1.18.0-rc.5...v1.18.0-rc.6) --- updated-dependencies: - dependency-name: github.com/NVIDIA/nvidia-container-toolkit dependency-version: 1.18.0-rc.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit 11f6c02 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Oct 12 17:02:18 2025 +0000 build(deps): bump golang.org/x/time from 0.9.0 to 0.14.0 Bumps [golang.org/x/time](https://github.com/golang/time) from 0.9.0 to 0.14.0. - [Commits](golang/time@v0.9.0...v0.14.0) --- updated-dependencies: - dependency-name: golang.org/x/time dependency-version: 0.14.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> commit c614e61 Merge: a79a9fd 4ced422 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 16:54:20 2025 +0200 Merge pull request NVIDIA#633 from jgehrcke/jp/verbosity-vs-debuggability-improvements Add `logVerbosity` Helm chart parameter, reduce default log verbosity commit 4ced422 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 14:45:52 2025 +0000 Remove newline, document env-based log verb flip Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4cf3d9b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 17:57:00 2025 +0000 Fix a typo in an error message Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 2c943f7 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 13:48:35 2025 +0000 tests: remove sinful duplicate env strategy This also had a side effect on subsequent tests, with the controller starting with _no_ LOG_VERBOSITY environment variable set. I don't understand that, but that must be a funky Helm-ism. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3d5c51f Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 13:01:21 2025 +0000 tests: fix: wait for controller flip Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit b172342 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 12:21:07 2025 +0000 tests: replace hard-coded sleep with dynamic wait Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 9748095 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 19:36:24 2025 +0000 tests: cover CD daemon log levels Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4767092 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:38:40 2025 +0000 Helm logVerbosity param: add docs, start building tests Helm values.yaml: defaultLogVerbosity incl. docs Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> values.yaml: tweak, based on in log level insights Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> improve helm chart artifact commentary Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> squash: tweak docs Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Rename chart var, start building tests Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> tests: cover log verbosity set per-component via env Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> helm: rename defaultLogVerbosity to logVerbosity Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3828da9 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 18:28:50 2025 +0000 CD daemon: change verbosity of "wait for nodes update" message Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 6d35ac1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 17:56:31 2025 +0000 CD controller: make CD daemon verbosity a required arg Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 84530ab Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 14:07:33 2025 +0000 CD controller: log manager config on startup Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit bb16c33 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:31:36 2025 +0000 CD controller/plugins/daemon: introduce LOG_VERBOSITY Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 7e89b22 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:29:46 2025 +0000 CD controller: introduce LOG_VERBOSITY_CD_DAEMON Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit c5b147b Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 14:10:33 2025 +0000 tests: add note about instability around chart flip Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4cc705a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 11:46:10 2025 +0000 Helm: expose kubelet plugin env via chart variables Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 5f143b2 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 15:53:07 2025 +0000 Upper-case log msg, no explicit verb 0 Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 8321983 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 12:16:17 2025 +0000 Change log message levels according to new system Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit a36e214 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 12:12:38 2025 +0000 Add logVerbosity Helm chart parameter Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit a79a9fd Merge: 3903df7 6e56823 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Sat Oct 11 13:43:17 2025 +0200 Merge pull request NVIDIA#646 from jgehrcke/jp/no-clique-update-cd-node-status Release workload on a non-MNNVL node in a CD commit 6e56823 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 19:47:48 2025 +0000 CD plugin: move CDI edit gen into computeDomainDaemonSettings Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> make diff smaller, rename func Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit f7e4a45 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 16:30:27 2025 +0000 CD daemon: always mount in IMEX daemon config files Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> CD plugin: always prepare IMEX config on the host and mount it in Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit c040429 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 15:59:07 2025 +0000 Fix typos in comments and log message Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit deccb4d Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 11:39:33 2025 +0000 CD plugin: always inject CD details via CDI Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Rename 'domain' to 'domainID' Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> squash: review feedback Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> shorten comment Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 023e7f9 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 11:38:23 2025 +0000 Enrich error message with CD detail when CD not found Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 32180ad Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 11:37:43 2025 +0000 CD daemon: unconditionally write IMEX daemon config Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Break out of select/case, MkdirAll() before writing file Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 13df4da Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:58:50 2025 +0000 CD daemon: init node status as NotReady, misc log msg & comment tweaks Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3cbd5a4 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:55:47 2025 +0000 CD daemon: keep business logic in no-IMEX-daemon noop mode Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit fffcea2 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:50:30 2025 +0000 Introduce maxNodesPerIMEXDomain special case for empty cliqueID Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e0b8990 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 09:49:07 2025 +0000 Update code comments Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 3903df7 Merge: 14dc9fe 72e39e9 Author: Kevin Klues <kklues@nvidia.com> Date: Fri Oct 10 13:22:31 2025 +0200 Merge pull request NVIDIA#661 from jgehrcke/jp/flush-logs-on-shutdown Flush logs in CLI app `After` hook commit 14dc9fe Merge: 8788dd1 d34a12f Author: Kevin Klues <kklues@nvidia.com> Date: Fri Oct 10 13:16:53 2025 +0200 Merge pull request NVIDIA#656 from jgehrcke/jp/custom-rate-limiting Introduce DefaultPrepUnprepRateLimiter (less aggressive) commit 8788dd1 Merge: 23d205f 0770c0a Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Fri Oct 10 12:43:33 2025 +0200 Merge pull request NVIDIA#666 from klueska/rbac-update Separate controller and kubeletplugin into separate RBAC permissions commit 0770c0a Author: Kevin Klues <kklues@nvidia.com> Date: Thu Oct 9 13:41:03 2025 +0000 Separate controller and kubeletplugin into separate RBAC permissions Signed-off-by: Kevin Klues <kklues@nvidia.com> commit 23d205f Merge: fca1c08 816c7a1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 10:01:28 2025 +0200 Merge pull request NVIDIA#664 from NVIDIA/dependabot/docker/deployments/container/main/nvidia/distroless/cc-v3.1.13-dev build(deps): bump nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev in /deployments/container commit fca1c08 Merge: e089759 b15d633 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Thu Oct 9 09:56:21 2025 +0200 Merge pull request NVIDIA#665 from NVIDIA/dependabot/docker/deployments/devel/main/golang-1.25.2 build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel commit b15d633 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Oct 8 17:15:07 2025 +0000 build(deps): bump golang from 1.25.1 to 1.25.2 in /deployments/devel Bumps golang from 1.25.1 to 1.25.2. --- updated-dependencies: - dependency-name: golang dependency-version: 1.25.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> commit 816c7a1 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Oct 8 17:15:03 2025 +0000 build(deps): bump nvidia/distroless/cc in /deployments/container Bumps nvidia/distroless/cc from v3.1.12-dev to v3.1.13-dev. --- updated-dependencies: - dependency-name: nvidia/distroless/cc dependency-version: v3.1.13-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> commit 72e39e9 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 09:42:57 2025 +0000 Flush logs in CLI app `After` hook Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit d34a12f Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 8 15:18:53 2025 +0200 Adjust go.mod to recent changes Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 7e18c33 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Sep 30 12:18:41 2025 +0000 Introduce DefaultPrepUnprepRateLimiter (less aggressive) Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit e089759 Merge: 765892d e9f647e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Wed Oct 8 13:09:32 2025 +0200 Merge pull request NVIDIA#651 from jgehrcke/jp/issue-694 CD daemon: coordinate CD updates on shutdown via mutation cache commit e9f647e Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 17:55:26 2025 +0000 tests: cover CD daemon cleanup-on-shutdown Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 980a6a1 Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 17:06:42 2025 +0000 CD daemon: pod mngr: store UpdateStatus return value in mutation cache This makes sure that fast incremental mutations on the same CD object performed during shutdown are done conflict-free (i.e., in actual, incremental fashion using intermediate state returned by the API server). Without this patch: I1007 16:49:01.678050 1 podmanager.go:196] Successfully updated node gb-nvl-043-compute06 status to NotReady E1007 16:49:01.681345 1 computedomain.go:161] Failed to remove node from ComputeDomain during shutdown: [...] \ "the object has been modified" [...] With this patch: I1007 16:59:55.350436 1 podmanager.go:200] Successfully updated node gb-nvl-043-compute07 status to NotReady I1007 16:59:55.353551 1 computedomain.go:402] Successfully removed node with IP 192.168.34.153 from ComputeDomain default/imex-channel-injection Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 4b91fce Author: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> Date: Tue Oct 7 15:50:06 2025 +0000 CD daemon: coordinate CD updates on shutdown via mutationcache Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com> commit 765892d Merge: 2b7e899 754a758 Author: Kevin Klues <kklues@nvidia.com> Date: Wed Oct 8 09:52:51 2025 +0200 Merge pull request NVIDIA#650 from NVIDIA/dependabot/github_actions/github/codeql-action-4 build(deps): bump github/codeql-action from 3 to 4 commit 754a758 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Oct 7 17:08:56 2025 +0000 build(deps): bump github/codeql-action from 3 to 4 Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3 to 4. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@v3...v4) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Resolves #609.
This PR has a set of related changes that we can also discuss in separate PRs if preferred:
For better debuggability: log full component config (
flagsobject) upon component startup (withgo-renderto quickly get to a stringified version of a more or less complex, nested struct -- open do using a different strategy).Edit: now here: Log full startup config in all CLIs in
Beforehook #658For control: introduction of a component-global
logVerbosityHelm chart parameter, including documentation laying out the starting point for a verbosity system (comments very welcome)For less noise in default config:
logVerbositylevel documentation).For robustness, explicit log flushing as part of component shutdown (I think we missed this so far).
Edit: now here: Flush logs in CLI app
Afterhook #661One interesting change that I propose here is to not have those "updated/added object callback" confirmations logged on the default log level in the CD controller -- I think we should try to not scale log volume with number of objects created (at least in default config).