-
Notifications
You must be signed in to change notification settings - Fork 165
Cache and print devices for debugging future outages #2141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cemakd
wants to merge
23
commits into
kubernetes-sigs:master
Choose a base branch
from
cemakd:logs-for-device-mappings
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 22 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
dba53e0
Cache devices and their symlinks in node driver, periodically noting
julianKatz d3a824e
Some doc comment updates
julianKatz 485beaf
Add unit tests
julianKatz afb2393
improve partition unit test
julianKatz 643817f
Log on removal as well
julianKatz fa2d2f9
Updated unit tests to be clearer, relying on asserting linkCache
julianKatz 95163b7
Remove unused broken function
julianKatz 8d8d926
Move partition checking into the inner linkcache type. This makes it
julianKatz 448b0ba
Log when linkcache Run is triggered
julianKatz 2ef351c
New implementation that is hooked into nodestage/unstage. Just linux
julianKatz 9ab0d2f
Made a no-op windows implementation of the linkcache package
julianKatz b88e5f4
Made test device caches in node_test.go
julianKatz d76c44c
Fix sanity test
julianKatz 4abd540
Only warn on failure to create cache
julianKatz c4b69f3
Only warn on windows instantiation
julianKatz 042176a
Make non-implemented on windows an info
julianKatz bc8defa
Improved some error messages to provide better test failure feedback
julianKatz 170e24b
Always print helpful logs in failing area
julianKatz be4b045
Remove now unnecessary corp-helper when running from cloudtop
julianKatz 6ef07a2
Only run device cache if successfully created
julianKatz 3437efd
Replace verbosities
julianKatz 3824cbb
Add nil checks around the usage of the device cache
cemakd 61aeffd
Add support for NVMe disk types by using deviceutils
cemakd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
package k8sclient | ||
|
||
import ( | ||
"context" | ||
"time" | ||
|
||
v1 "k8s.io/api/core/v1" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"k8s.io/apimachinery/pkg/util/wait" | ||
"k8s.io/client-go/kubernetes" | ||
"k8s.io/client-go/rest" | ||
"k8s.io/klog/v2" | ||
) | ||
|
||
func GetNodeWithRetry(ctx context.Context, nodeName string) (*v1.Node, error) { | ||
cfg, err := rest.InClusterConfig() | ||
if err != nil { | ||
return nil, err | ||
} | ||
kubeClient, err := kubernetes.NewForConfig(cfg) | ||
if err != nil { | ||
return nil, err | ||
} | ||
return getNodeWithRetry(ctx, kubeClient, nodeName) | ||
} | ||
|
||
func getNodeWithRetry(ctx context.Context, kubeClient *kubernetes.Clientset, nodeName string) (*v1.Node, error) { | ||
var nodeObj *v1.Node | ||
backoff := wait.Backoff{ | ||
Duration: 1 * time.Second, | ||
Factor: 2.0, | ||
Steps: 5, | ||
} | ||
err := wait.ExponentialBackoffWithContext(ctx, backoff, func(_ context.Context) (bool, error) { | ||
node, err := kubeClient.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{}) | ||
if err != nil { | ||
klog.Warningf("Error getting node %s: %v, retrying...\n", nodeName, err) | ||
return false, nil | ||
} | ||
nodeObj = node | ||
klog.V(4).Infof("Successfully retrieved node info %s\n", nodeName) | ||
return true, nil | ||
}) | ||
|
||
if err != nil { | ||
klog.Errorf("Failed to get node %s after retries: %v\n", nodeName, err) | ||
} | ||
return nodeObj, err | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the time period configurable so that we can adjust it if it ends up spamming too many logs? In fact I would probably start the logging at 10 mins initially.
In the future, if we want to turn this into a metric, I would be more comfortable reducing the polling period but removing logs to avoid log spam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a flag
disk-cache-sync-period
with a default value of 10 minutes when unspecified.