trainer: Flux policy user-guide by vsoch · Pull Request #4283 · kubeflow/website

vsoch · 2026-01-19T02:18:43Z

Description of Changes

This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset.

Related Issues

This is linked with a pull request to the trainer,

Related: kubeflow/trainer#3064.

I did not open an issue here (and can if needed, please let me know).

Checklist

You have signed off your commits
Ensure you follow best practices from our contributing guide.
(for big changes) I will post screenshots of the changes in a PR comment

cc @milroy

This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

google-oss-prow · 2026-01-19T02:18:53Z

Hi @vsoch. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2026-01-19T02:19:04Z

🚫 This command cannot be processed. Only organization members or owners can use the commands.

Arhell

/ok-to-test

andreyvelich · 2026-03-10T10:03:55Z

@vsoch Can you update this guide for Flux in Trainer please?

vsoch · 2026-03-10T19:56:55Z

Sure thing - I'll bring up a cluster on AWS today and test out using the (now merged) main branch to run it.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

google-oss-prow · 2026-03-11T01:39:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

content/en/docs/components/trainer/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vsoch · 2026-03-11T01:40:23Z

@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa?

vsoch · 2026-03-11T01:50:58Z

Found it! Putting here so I remember next time. 🙃

$ kubectl explain trainjob.spec.trainer.resourcesPerNode
GROUP:      trainer.kubeflow.org
KIND:       TrainJob
VERSION:    v1alpha1

FIELD: resourcesPerNode <Object>


DESCRIPTION:
    resourcesPerNode defines the compute resources for each training node.
    
FIELDS:
  claims	<[]Object>
    Claims lists the names of resources, defined in spec.resourceClaims,
    that are used by this container.
    
    This field depends on the
    DynamicResourceAllocation feature gate.
    
    This field is immutable. It can only be set for containers.

  limits	<map[string]Object>
    Limits describes the maximum amount of compute resources allowed.
    More info:
    https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

  requests	<map[string]Object>
    Requests describes the minimum amount of compute resources required.
    If Requests is omitted for a container, it defaults to Limits if that is
    explicitly specified,
    otherwise to an implementation-defined value. Requests cannot exceed Limits.
    More info:
    https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

vsoch · 2026-03-11T01:52:48Z

@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week!

https://events.linuxfoundation.org/hpsf-conference/program/schedule/

The best part of that abstract might be the title :)

https://www.youtube.com/watch?v=hdcTmpvDO0I

andreyvelich · 2026-03-11T14:31:46Z

@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa?

Sure, you can add this into Flux examples: https://github.com/kubeflow/trainer/tree/master/examples/flux or Trainer documentation.

@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week!

This is awesome! We should definitely promote it through our outreach channels. cc: @kubeflow/kubeflow-outreach-committee @kubeflow/kubeflow-steering-committee @kubeflow/kubeflow-trainer-team.

@tarekabouzeid @yashpal2104, could you help share this on Kubeflow’s social channels? Highlighting that Kubeflow Trainer is being used for HPC workloads would be especially impactful and highly relevant for the AI community.

yashpal2104 · 2026-03-11T14:40:47Z

yes @andreyvelich I will share a linkedin post for this, I can also record a video to promote the hpsf event talk on the trainer update

andreyvelich

Thanks @vsoch, this looks great. I left a few comments.

andreyvelich · 2026-03-11T14:34:31Z