Conversation
This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
|
Hi @vsoch. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
🚫 This command cannot be processed. Only organization members or owners can use the commands. |
|
@vsoch Can you update this guide for Flux in Trainer please? |
|
Sure thing - I'll bring up a cluster on AWS today and test out using the (now merged) main branch to run it. |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa? |
|
Found it! Putting here so I remember next time. 🙃 $ kubectl explain trainjob.spec.trainer.resourcesPerNode
GROUP: trainer.kubeflow.org
KIND: TrainJob
VERSION: v1alpha1
FIELD: resourcesPerNode <Object>
DESCRIPTION:
resourcesPerNode defines the compute resources for each training node.
FIELDS:
claims <[]Object>
Claims lists the names of resources, defined in spec.resourceClaims,
that are used by this container.
This field depends on the
DynamicResourceAllocation feature gate.
This field is immutable. It can only be set for containers.
limits <map[string]Object>
Limits describes the maximum amount of compute resources allowed.
More info:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
requests <map[string]Object>
Requests describes the minimum amount of compute resources required.
If Requests is omitted for a container, it defaults to Limits if that is
explicitly specified,
otherwise to an implementation-defined value. Requests cannot exceed Limits.
More info:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
|
@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week! https://events.linuxfoundation.org/hpsf-conference/program/schedule/ The best part of that abstract might be the title :) |
Sure, you can add this into Flux examples: https://github.com/kubeflow/trainer/tree/master/examples/flux or Trainer documentation.
This is awesome! We should definitely promote it through our outreach channels. cc: @kubeflow/kubeflow-outreach-committee @kubeflow/kubeflow-steering-committee @kubeflow/kubeflow-trainer-team. @tarekabouzeid @yashpal2104, could you help share this on Kubeflow’s social channels? Highlighting that Kubeflow Trainer is being used for HPC workloads would be especially impactful and highly relevant for the AI community. |
|
yes @andreyvelich I will share a linkedin post for this, I can also record a video to promote the hpsf event talk on the trainer update |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks @vsoch, this looks great. I left a few comments.
|
|
||
| ## Flux Policy Example: LAMMPS | ||
|
|
||
| This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads. Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes. For more about Flux and its context for Kubeflow Trainer, see [KEP-2841](https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc). |
There was a problem hiding this comment.
I would move KEP to the bottom of this page if users want to explore more (Next Steps).
|
|
||
| ## Flux Policy Example: LAMMPS | ||
|
|
||
| This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads. Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes. For more about Flux and its context for Kubeflow Trainer, see [KEP-2841](https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc). |
There was a problem hiding this comment.
Can you also move this message to the Flux Framework Overview section: Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes + add link to the official Flux docs.
| This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads. Flux is a next-generation high-performance computing (HPC) scheduler that provides sophisticated resource management in Kubernetes. For more about Flux and its context for Kubeflow Trainer, see [KEP-2841](https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc). | |
| This example demonstrates how to use the **Flux Framework** policy for the Kubeflow Trainer to run distributed HPC workloads. |
| ### Prerequisites | ||
|
|
||
| 1. **Kubeflow Trainer** installed in your cluster. | ||
| 2. **JobSet** operator installed (dependency of the Trainer). | ||
|
|
There was a problem hiding this comment.
You can remove this since it is part of overview Trainer page.
| ### Prerequisites | |
| 1. **Kubeflow Trainer** installed in your cluster. | |
| 2. **JobSet** operator installed (dependency of the Trainer). |
| You'll need to first install the Kubeflow Trainer. | ||
|
|
||
| ```bash | ||
| kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master" | ||
| ``` |
There was a problem hiding this comment.
same here
| You'll need to first install the Kubeflow Trainer. | |
| ```bash | |
| kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master" | |
| ``` |
| ```bash | ||
| kubectl get pods -w | ||
| ``` |
There was a problem hiding this comment.
Can you show output of this command?
| ```bash | ||
| kubectl logs lammps-flux-node-0-0-mvjsf -c node -f | ||
| ``` | ||
|
|
||
| You can look at the second pod to see the follower broker bootstrap with the lead broker, and then cleanup when LAMMPS is done running. | ||
|
|
||
| ```bash | ||
| kubectl logs lammps-flux-node-0-1-glj22 -c node -f | ||
| ``` |
| ### Switch to Interactive Mode | ||
|
|
||
| Delete the current job and create an interactive lammps cluster: | ||
|
|
||
| ```bash | ||
| kubectl delete -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/lammps-train-job.yaml | ||
| kubectl apply -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/flux-interactive.yaml |
There was a problem hiding this comment.
This is really cool, I would imagine building a Jupyter Kernel for interactive Flux development.
cc @akshaychitneni @bigsur0 @shravan-achar
| Delete the current job and create an interactive lammps cluster: | ||
|
|
||
| ```bash | ||
| kubectl delete -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/lammps-train-job.yaml |
There was a problem hiding this comment.
We don't really need to delete the previous Job, right?
| Delete the current job and create an interactive lammps cluster: | |
| ```bash | |
| kubectl delete -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/lammps-train-job.yaml | |
| Create an interactive lammps cluster: | |
| ```bash |
| - FLUX_VIEW_IMAGE: The flux view base image, which defaults to `ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy` | ||
| - FLUX_NETWORK_DEVICE: The network device for the Flux overlay network only (not necessarily your application). Defaults to `eth0` | ||
|
|
||
| This can be easily expanded. [Let us know](https://github.com/flux-framework). |
There was a problem hiding this comment.
I would better say that if users want to have custom image, they can create an issue in the appropriate repository in Flux GitHub org.
Description of Changes
This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset.
Related Issues
This is linked with a pull request to the trainer,
Related: kubeflow/trainer#3064.
I did not open an issue here (and can if needed, please let me know).
Checklist
cc @milroy