-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
What you would like to be added?
I would like the CustomTrainer
API in the Kubeflow Trainer SDK to support running a Python script directly by specifying a file path (e.g., python myscript.py
), instead of requiring users to provide a Python function.
Proposed API:
CustomTrainer(python_file="run_kubernetes.py", ...)
- If
python_file
is provided, the SDK should set the container entrypoint to["python", "run_kubernetes.py"]
(or the specified file). - No function serialization, no wrapper scripts, no subprocesses, and no use of
runpy
inside another Python process. - This should be mutually exclusive with the existing
func
argument.
Why is this needed?
- Simplicity & Familiarity: Most users have existing training scripts and expect to run them as
python myscript.py
, just like they do locally or in YAML-based jobs. - Avoids Indirection: The current approach requires wrapping code in a function, which is then serialized, deserialized, and run via a generated entrypoint script. This is convoluted, harder to debug, and not how most ML workflows are structured.
- Better UX: Direct script execution is more transparent, easier to reason about, and matches user expectations from other ML platforms and Kubernetes YAML jobs.
- Migration Path: This makes it much easier for users to migrate from script-based workflows (e.g., Slurm, bash, or direct Kubernetes Jobs) to Kubeflow Trainer.
- Cleaner Container Lifecycle: Running the script as the main process ensures correct signal handling, exit codes, and resource cleanup.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
jskswamy and ned9