Skip to content

trainer: Flux policy user-guide#4283

Open
vsoch wants to merge 2 commits intokubeflow:masterfrom
vsoch:user-guide/flux
Open

trainer: Flux policy user-guide#4283
vsoch wants to merge 2 commits intokubeflow:masterfrom
vsoch:user-guide/flux

Conversation

@vsoch
Copy link

@vsoch vsoch commented Jan 19, 2026

Description of Changes

This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset.

Related Issues

This is linked with a pull request to the trainer,

Related: kubeflow/trainer#3064.

I did not open an issue here (and can if needed, please let me know).

Checklist

cc @milroy

This changeset adds documentation (a user guide) to use
the Flux Policy in the Kubeflow Trainer. The example includes
running a popular simulation, LAMMPS, with CPU. A GPU example
is desired and will be added likely in a separate changeset.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@google-oss-prow
Copy link

Hi @vsoch. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot added area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator size/L labels Jan 19, 2026
@github-actions
Copy link

🚫 This command cannot be processed. Only organization members or owners can use the commands.

Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@andreyvelich
Copy link
Member

@vsoch Can you update this guide for Flux in Trainer please?

@vsoch
Copy link
Author

vsoch commented Mar 10, 2026

Sure thing - I'll bring up a cluster on AWS today and test out using the (now merged) main branch to run it.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vsoch
Copy link
Author

vsoch commented Mar 11, 2026

@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa?

@vsoch
Copy link
Author

vsoch commented Mar 11, 2026

Found it! Putting here so I remember next time. 🙃

$ kubectl explain trainjob.spec.trainer.resourcesPerNode
GROUP:      trainer.kubeflow.org
KIND:       TrainJob
VERSION:    v1alpha1

FIELD: resourcesPerNode <Object>


DESCRIPTION:
    resourcesPerNode defines the compute resources for each training node.
    
FIELDS:
  claims	<[]Object>
    Claims lists the names of resources, defined in spec.resourceClaims,
    that are used by this container.
    
    This field depends on the
    DynamicResourceAllocation feature gate.
    
    This field is immutable. It can only be set for containers.

  limits	<map[string]Object>
    Limits describes the maximum amount of compute resources allowed.
    More info:
    https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

  requests	<map[string]Object>
    Requests describes the minimum amount of compute resources required.
    If Requests is omitted for a container, it defaults to Limits if that is
    explicitly specified,
    otherwise to an implementation-defined value. Requests cannot exceed Limits.
    More info:
    https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

@vsoch
Copy link
Author

vsoch commented Mar 11, 2026

@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week!

https://events.linuxfoundation.org/hpsf-conference/program/schedule/

The best part of that abstract might be the title :)

https://www.youtube.com/watch?v=hdcTmpvDO0I

@andreyvelich
Copy link
Member

@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa?

Sure, you can add this into Flux examples: https://github.com/kubeflow/trainer/tree/master/examples/flux or Trainer documentation.

@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week!

This is awesome! We should definitely promote it through our outreach channels. cc: @kubeflow/kubeflow-outreach-committee @kubeflow/kubeflow-steering-committee @kubeflow/kubeflow-trainer-team.

@tarekabouzeid @yashpal2104, could you help share this on Kubeflow’s social channels? Highlighting that Kubeflow Trainer is being used for HPC workloads would be especially impactful and highly relevant for the AI community.

@yashpal2104
Copy link
Contributor

yes @andreyvelich I will share a linkedin post for this, I can also record a video to promote the hpsf event talk on the trainer update

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vsoch, this looks great. I left a few comments.


## Flux Policy Example: LAMMPS

This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads. Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes. For more about Flux and its context for Kubeflow Trainer, see [KEP-2841](https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move KEP to the bottom of this page if users want to explore more (Next Steps).


## Flux Policy Example: LAMMPS

This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads. Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes. For more about Flux and its context for Kubeflow Trainer, see [KEP-2841](https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc).
Copy link
Member

@andreyvelich andreyvelich Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also move this message to the Flux Framework Overview section: Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes + add link to the official Flux docs.

Suggested change
This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads. Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes. For more about Flux and its context for Kubeflow Trainer, see [KEP-2841](https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc).
This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads.

Comment on lines +41 to +45
### Prerequisites

1. **Kubeflow Trainer** installed in your cluster.
2. **JobSet** operator installed (dependency of the Trainer).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this since it is part of overview Trainer page.

Suggested change
### Prerequisites
1. **Kubeflow Trainer** installed in your cluster.
2. **JobSet** operator installed (dependency of the Trainer).

Comment on lines +48 to +52
You'll need to first install the Kubeflow Trainer.

```bash
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master"
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Suggested change
You'll need to first install the Kubeflow Trainer.
```bash
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master"
```

Comment on lines +66 to +68
```bash
kubectl get pods -w
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show output of this command?

Comment on lines +75 to +83
```bash
kubectl logs lammps-flux-node-0-0-mvjsf -c node -f
```

You can look at the second pod to see the follower broker bootstrap with the lead broker, and then cleanup when LAMMPS is done running.

```bash
kubectl logs lammps-flux-node-0-1-glj22 -c node -f
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the output

Comment on lines +89 to +95
### Switch to Interactive Mode

Delete the current job and create an interactive lammps cluster:

```bash
kubectl delete -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/lammps-train-job.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/flux-interactive.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, I would imagine building a Jupyter Kernel for interactive Flux development.
cc @akshaychitneni @bigsur0 @shravan-achar

Comment on lines +91 to +94
Delete the current job and create an interactive lammps cluster:

```bash
kubectl delete -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/lammps-train-job.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need to delete the previous Job, right?

Suggested change
Delete the current job and create an interactive lammps cluster:
```bash
kubectl delete -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/lammps-train-job.yaml
Create an interactive lammps cluster:
```bash

- FLUX_VIEW_IMAGE: The flux view base image, which defaults to `ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy`
- FLUX_NETWORK_DEVICE: The network device for the Flux overlay network only (not necessarily your application). Defaults to `eth0`

This can be easily expanded. [Let us know](https://github.com/flux-framework).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would better say that if users want to have custom image, they can create an issue in the appropriate repository in Flux GitHub org.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator ok-to-test size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants