Skip to content

[core] periodically reload Ray service account tokens#60778

Open
andrewsykim wants to merge 9 commits intoray-project:masterfrom
andrewsykim:ray-reload-cached-token
Open

[core] periodically reload Ray service account tokens#60778
andrewsykim wants to merge 9 commits intoray-project:masterfrom
andrewsykim:ray-reload-cached-token

Conversation

@andrewsykim
Copy link
Member

Description

Ray token auth (with k8s) loads a projected Kubernetes service account token in /var/run/secrets/ray.io/serviceaccount/token. However, this token expires every hour by default. This PR updates Ray to periodically reload this token every 5 minutes to ensure expired service tokens are not cached.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@andrewsykim andrewsykim requested a review from a team as a code owner February 5, 2026 18:21
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to periodically reload Ray service account tokens, which is a crucial improvement for Kubernetes deployments where tokens expire. The implementation correctly uses std::chrono::steady_clock to track the token's age and invalidate the cache when necessary. My review focuses on improving code maintainability by addressing duplicated logic. Specifically, the token expiration check is duplicated in GetToken and TryLoadToken and should be refactored into a single helper function. I've also noted a minor style improvement for a constant declaration and a misleading comment that should be clarified.

"or store the token in any file and set RAY_AUTH_TOKEN_PATH to point to it, "
"or set the RAY_AUTH_TOKEN environment variable.";

const int32_t kRaySATokenTTLSeconds = 300;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For a compile-time constant like this, it's preferable to use constexpr instead of const. This allows the value to be used in more compile-time contexts and makes the intent clearer.

Suggested change
const int32_t kRaySATokenTTLSeconds = 300;
constexpr int32_t kRaySATokenTTLSeconds = 300;

Comment on lines 58 to 110
// If k8s token auth is enabled, revoke cached token as Kubelet
// will expire and auto rotate new service account tokens every hour by default.
// Use 5 minutes as a default as users can configure the expiration time.
if (IsK8sTokenAuthEnabled()) {
if (cached_token_ && std::chrono::steady_clock::now() - last_load_time_ >
std::chrono::seconds(kRaySATokenTTLSeconds)) {
cached_token_ = nullptr;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for expiring the cached token is duplicated in TryLoadToken (lines 100-108). To improve maintainability and avoid future inconsistencies, consider extracting it into a private helper function that can be called from both methods after acquiring the lock.

Additionally, the comment on line 60 is a bit misleading as it suggests the expiration time is user-configurable, while kRaySATokenTTLSeconds is a constant. If configurability is not intended, consider updating the comment to something like: // Use a 5-minute interval for reloading to ensure the token is fresh, as Kubelet rotates service account tokens every hour by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to ignore, there's already some duplication between GetToken and TryLoadToken, not sure it's worth refactoring

Comment on lines 100 to 159
// If k8s token auth is enabled, revoke cached token as Kubelet
// will expire and auto rotate new service account tokens every hour by default.
// Use 5 minutes as a default as users can configure the expiration time.
if (IsK8sTokenAuthEnabled()) {
if (cached_token_ && std::chrono::steady_clock::now() - last_load_time_ >
std::chrono::seconds(kRaySATokenTTLSeconds)) {
cached_token_ = nullptr;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a duplicate of the token expiration logic found in GetToken. As mentioned in my other comment, this should be extracted to a private helper function to avoid code duplication and improve maintainability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to ignore, there's already some duplication between GetToken and TryLoadToken, not sure it's worth refactoring

@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 5, 2026
@andrewsykim andrewsykim force-pushed the ray-reload-cached-token branch from ba63f2a to b1d659f Compare February 5, 2026 21:30
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Feb 5, 2026
"or store the token in any file and set RAY_AUTH_TOKEN_PATH to point to it, "
"or set the RAY_AUTH_TOKEN environment variable.";

constexpr int kRaySATokenTTLSeconds = 300;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that the token is loaded right before it is rotated by k8s? in this case all RPC calls using the cached token would fail for 5min.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s serviceaccount tokens are JWT's and JWT tokens usually have an exp field which gives us an idea of the tokens validity. can we rely on this instead of a blanket 5min TTL?

https://kubernetes.io/docs/concepts/security/service-accounts/#authenticating-credentials

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also lets add a unit test for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the default 1 hour expiration, I believe a new token will be automounted well before 5 minutes. But agree that checking the expiration in the token is better. I believe Go clients watch for filesystem changes to the token path but I wasn't srue how to implement that in C++ :)

Will look into this a bit and try to add unit tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampan-s-nayak added logic to check expiration in the JWT token, but kept a buffer so we attempt reload at least 5 minute before the expiration. Also added a unit test

@andrewsykim andrewsykim force-pushed the ray-reload-cached-token branch from b1d659f to fe96dcc Compare February 6, 2026 19:24
@andrewsykim andrewsykim force-pushed the ray-reload-cached-token branch from 86f2c22 to 2125d90 Compare February 9, 2026 14:11
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
@andrewsykim andrewsykim force-pushed the ray-reload-cached-token branch from 2125d90 to 17f987a Compare February 9, 2026 15:47
@andrewsykim andrewsykim force-pushed the ray-reload-cached-token branch from 17f987a to c855f26 Compare February 9, 2026 16:51
…iration when k8s token auth is enabled

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
if (exp) {
cached_token_expiration_time_ =
*exp - std::chrono::seconds(kRaySATokenExpirationBufferSeconds);
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token thrashing when remaining lifetime under buffer

High Severity

When a K8s token has less than 5 minutes remaining before expiration, cached_token_expiration_time_ is set to a past time (token's exp minus 300 seconds). This causes every subsequent GetToken() call to immediately invalidate the cached token and reload it from the filesystem. In busy systems making frequent RPC calls, this results in excessive filesystem reads until the token is actually rotated by Kubernetes, defeating the caching mechanism and potentially causing performance degradation.

Fix in Cursor Fix in Web

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be a problem as Kubernetes enforces a minimum expiration time of 10 minutes

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants