[core] periodically reload Ray service account tokens#60778
[core] periodically reload Ray service account tokens#60778andrewsykim wants to merge 9 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to periodically reload Ray service account tokens, which is a crucial improvement for Kubernetes deployments where tokens expire. The implementation correctly uses std::chrono::steady_clock to track the token's age and invalidate the cache when necessary. My review focuses on improving code maintainability by addressing duplicated logic. Specifically, the token expiration check is duplicated in GetToken and TryLoadToken and should be refactored into a single helper function. I've also noted a minor style improvement for a constant declaration and a misleading comment that should be clarified.
| "or store the token in any file and set RAY_AUTH_TOKEN_PATH to point to it, " | ||
| "or set the RAY_AUTH_TOKEN environment variable."; | ||
|
|
||
| const int32_t kRaySATokenTTLSeconds = 300; |
There was a problem hiding this comment.
| // If k8s token auth is enabled, revoke cached token as Kubelet | ||
| // will expire and auto rotate new service account tokens every hour by default. | ||
| // Use 5 minutes as a default as users can configure the expiration time. | ||
| if (IsK8sTokenAuthEnabled()) { | ||
| if (cached_token_ && std::chrono::steady_clock::now() - last_load_time_ > | ||
| std::chrono::seconds(kRaySATokenTTLSeconds)) { | ||
| cached_token_ = nullptr; | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic for expiring the cached token is duplicated in TryLoadToken (lines 100-108). To improve maintainability and avoid future inconsistencies, consider extracting it into a private helper function that can be called from both methods after acquiring the lock.
Additionally, the comment on line 60 is a bit misleading as it suggests the expiration time is user-configurable, while kRaySATokenTTLSeconds is a constant. If configurability is not intended, consider updating the comment to something like: // Use a 5-minute interval for reloading to ensure the token is fresh, as Kubelet rotates service account tokens every hour by default.
There was a problem hiding this comment.
I think this is fine to ignore, there's already some duplication between GetToken and TryLoadToken, not sure it's worth refactoring
| // If k8s token auth is enabled, revoke cached token as Kubelet | ||
| // will expire and auto rotate new service account tokens every hour by default. | ||
| // Use 5 minutes as a default as users can configure the expiration time. | ||
| if (IsK8sTokenAuthEnabled()) { | ||
| if (cached_token_ && std::chrono::steady_clock::now() - last_load_time_ > | ||
| std::chrono::seconds(kRaySATokenTTLSeconds)) { | ||
| cached_token_ = nullptr; | ||
| } | ||
| } |
There was a problem hiding this comment.
I think this is fine to ignore, there's already some duplication between GetToken and TryLoadToken, not sure it's worth refactoring
ba63f2a to
b1d659f
Compare
| "or store the token in any file and set RAY_AUTH_TOKEN_PATH to point to it, " | ||
| "or set the RAY_AUTH_TOKEN environment variable."; | ||
|
|
||
| constexpr int kRaySATokenTTLSeconds = 300; |
There was a problem hiding this comment.
is it possible that the token is loaded right before it is rotated by k8s? in this case all RPC calls using the cached token would fail for 5min.
There was a problem hiding this comment.
k8s serviceaccount tokens are JWT's and JWT tokens usually have an exp field which gives us an idea of the tokens validity. can we rely on this instead of a blanket 5min TTL?
https://kubernetes.io/docs/concepts/security/service-accounts/#authenticating-credentials
There was a problem hiding this comment.
also lets add a unit test for this
There was a problem hiding this comment.
In the case of the default 1 hour expiration, I believe a new token will be automounted well before 5 minutes. But agree that checking the expiration in the token is better. I believe Go clients watch for filesystem changes to the token path but I wasn't srue how to implement that in C++ :)
Will look into this a bit and try to add unit tests
There was a problem hiding this comment.
@sampan-s-nayak added logic to check expiration in the JWT token, but kept a buffer so we attempt reload at least 5 minute before the expiration. Also added a unit test
b1d659f to
fe96dcc
Compare
86f2c22 to
2125d90
Compare
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
2125d90 to
17f987a
Compare
17f987a to
c855f26
Compare
…iration when k8s token auth is enabled Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
| if (exp) { | ||
| cached_token_expiration_time_ = | ||
| *exp - std::chrono::seconds(kRaySATokenExpirationBufferSeconds); | ||
| } else { |
There was a problem hiding this comment.
Token thrashing when remaining lifetime under buffer
High Severity
When a K8s token has less than 5 minutes remaining before expiration, cached_token_expiration_time_ is set to a past time (token's exp minus 300 seconds). This causes every subsequent GetToken() call to immediately invalidate the cached token and reload it from the filesystem. In busy systems making frequent RPC calls, this results in excessive filesystem reads until the token is actually rotated by Kubernetes, defeating the caching mechanism and potentially causing performance degradation.
There was a problem hiding this comment.
This won't be a problem as Kubernetes enforces a minimum expiration time of 10 minutes
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>


Description
Ray token auth (with k8s) loads a projected Kubernetes service account token in
/var/run/secrets/ray.io/serviceaccount/token. However, this token expires every hour by default. This PR updates Ray to periodically reload this token every 5 minutes to ensure expired service tokens are not cached.Related issues
Additional information