How to troubleshoot feature.d changes? #2160
Replies: 3 comments 1 reply
-
@cmontemuino looks strange, indeed. Could you run nfd-worker with @ArangoGutierrez PTAL |
Beta Was this translation helpful? Give feedback.
-
@marquiz thanks for the help! I've took another cluster where the same problem is present, just to confirm it's not a one-case scenario. According to worker's logs, the features.d file is properly loaded and NodeFeature object is updated:
I notice labels do not get updated. The worker keeps showing "feature discovery completed", but the "updating NodeFeature object" message does not keep showing. So I've removed a couple of labels to verify, including annotation Also noticed
I also increased the master's logging and this is what I see:
I've removed labels and annotations as before, and then master's logs shows the following:
All values reverted back. Where is the NodeFeature object being stored? Perhaps I could take a look and see if it has outdated info in it. |
Beta Was this translation helpful? Give feedback.
-
🗣️ Found the issue: Lesson learnt: look if there are multiple NodeFeature resources for the same node in different namespaces. @marquiz, perhaps an action point for the project would be to log the NodeFeature resource name and namespace that's being updated/read. I'll close the issue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I don't get why NFD keeps adding labels to some of my GPU nodes, when it shouldn't. What I need is some pointers to identify the root cause.
If I get the labels from a node and
...| grep nvidia
:These are the labels that does not make sense from above:
So, I've removed the labels and added this one:
nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
Note: I've also tried to remove annotation
nfd.node.kubernetes.io/feature-labels
first.👉 After a few seconds, everything gets reverted back, and this is what I don't understand 🤔.
This is a GPU node and I have the gpu-operator running behind the scenes. This operator has a deamonset writing a set of labels in a directory for NFD to add labels afterwards. Logs:
I0519 09:40:50.420211 51 output.go:82] Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd I0519 09:40:50.420737 51 main.go:287] Sleeping for 30000000000
Looking at that file:
Logs from
nfd-worker
running in the node looks pretty normal:Logs from
nfd-master
:ConfigMaps:
I'm using version
v0.17.3
.🥺 Any help is highly appreciated!!
Beta Was this translation helpful? Give feedback.
All reactions