-
Notifications
You must be signed in to change notification settings - Fork 20
Convert checkpointing annotations to env vars #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughA new constant for a checkpoint annotation prefix was introduced. The Changes
Sequence Diagram(s)sequenceDiagram
participant PyTorchJob
participant Controller
participant Pod
participant Container
PyTorchJob->>Controller: Submit PyTorchJob with annotations
Controller->>Controller: extractCheckpointEnvVars(PyTorchJob)
Controller->>Pod: Create Pod spec
loop For each Container
Controller->>Container: Inject env vars from checkpoint annotations
end
Pod->>Container: Start with injected env vars
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
Note 🔌 MCP (Model Context Protocol) integration is now available in Early Access!Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/controller.v1/pytorch/envvar.go (2)
36-38
: Exported-constant comment doesn’t match identifierGo lint expects the comment to start with the constant’s name.
Rename the comment to avoidgolint
noise.- // CheckpointAnnotationEnvVarPrefix is the prefix of the environment variable name for the annotation. + // CheckpointAnnotationPrefix is the prefix used to detect checkpoint-config annotations.
180-196
: Minor: compute once & deduplicate
extractCheckpointEnvVars
iterates over annotations each time it’s called; invoked inside a loop this repeats work and returns identical slices.
Consider:checkpointEnvVars := extractCheckpointEnvVars(pytorchjob) envKeys := map[string]struct{}{} for _, ev := range checkpointEnvVars { envKeys[ev.Name] = struct{}{} } for i := range podTemplateSpec.Spec.Containers { // ... for _, ev := range checkpointEnvVars { if _, exists := envKeys[ev.Name]; !exists { podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, ev) } } }This avoids redundant work and prevents accidental duplicates.
Not critical, but improves efficiency and robustness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/controller.v1/pytorch/envvar.go
(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pkg/controller.v1/pytorch/envvar.go (1)
pkg/apis/kubeflow.org/v1/pytorch_types.go (1)
PyTorchJob
(56-68)
|
||
// Inject checkpointing environment variables from annotations. | ||
checkpointEnvVars := extractCheckpointEnvVars(pytorchjob) | ||
if len(checkpointEnvVars) > 0 { | ||
podTemplateSpec.Spec.Containers[i].Env = append( | ||
podTemplateSpec.Spec.Containers[i].Env, checkpointEnvVars...) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Duplicate-name env vars can break pod creation
extractCheckpointEnvVars
is called for every container and its slice is blindly appended.
If an annotation produces WORLD_SIZE
, RANK
, etc., the later code appends the same names again, and the API server rejects the pod with “duplicate env var name”.
Guard against clashes (e.g. build a map[string]struct{}
before appending or skip/override) or place the annotation-derived vars after all built-ins so they intentionally override.
🤖 Prompt for AI Agents
In pkg/controller.v1/pytorch/envvar.go around lines 56 to 63, the code appends
checkpoint environment variables to each container's Env slice without checking
for duplicates, causing pod creation to fail due to duplicate env var names. To
fix this, before appending checkpointEnvVars, build a map of existing env var
names in the container to detect duplicates, then either skip adding duplicates
or override existing entries. Alternatively, append the annotation-derived vars
after all built-in vars to intentionally override duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add something to stop built-in env variables like RANK etc from being propogated if they are added to annotations with the checkpointing prefix - if this is necessary. wdyt @efazal @astefanutti @sutaakar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Replaced by #75 |
RHOAIENG-29321
What this PR does / why we need it:
This PR adds support for utilising annotations to propogate checkpointing related env vars to the PyTorchJob pod. The annotations are specified in the PyTorchJob yaml (via create_job api) for example:
The values are extracted and injected as environment variables:
The env vars can be accessed in the training function:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #RHOAIENG-29321
Checklist:
Summary by CodeRabbit