This guide explains how to contribute to the Kubeflow Trainer V2 project. For the Kubeflow Trainer documentation, please check the official Kubeflow documentation.
- Go (1.23 or later)
- Docker (23 or later)
- Lima (an alternative to DockerDesktop) (0.21.0 or later)
- Colima (Lima specifically for MacOS) (0.6.8 or later)
- Python (3.11 or later)
- kustomize (4.0.5 or later)
- Kind (0.27.0 or later)
- pre-commit
Note for Lima the link is to the Adopters, which supports several different container environments.
The Kubeflow Trainer project includes a Makefile with several helpful commands to streamline your development workflow:
# Generate manifests, APIs and SDK
make generateYou can see all available commands by running:
make helpThe Kubeflow Trainer project includes several types of tests to ensure code quality and functionality.
Run the Go unit tests with:
make testYou can also run Python unit tests:
make test-pythonRun the Go integration tests with:
make test-integrationFor Python integration tests:
make test-python-integrationTo set up a Kind cluster for e2e testing:
make test-e2e-setup-clusterRun the end-to-end tests with:
make test-e2eYou can also run Jupyter notebook tests with Papermill:
make test-e2e-notebookWhen coding:
Follow the effective go guidelines.
Run make generate locally to verify if changes follow best practices before submitting PRs.
When writing tests:
Use cmp.Diff instead of reflect.Equal, to provide useful comparisons. Define test cases as maps instead of slices to avoid dependencies on the running order. Map key should be equal to the test case name.
On ubuntu the default go package appears to be gccgo-go which has problems. It's recommended to install Go from official tarballs.
Make sure to install pre-commit (pip install pre-commit) and run pre-commit install from the root of the repository at least once before creating git commits.
The pre-commit hooks ensure code quality and consistency. They are executed in CI. PRs that fail to comply with the hooks will not be able to pass the corresponding CI gate. The hooks are only executed against staged files unless you run pre-commit run --all, in which case, they'll be executed against every file in the repository.
Specific programmatically generated files listed in the exclude field in .pre-commit-config.yaml are deliberately excluded from the hooks.
Create a symbolic link inside your GOPATH to the location you checked out the code:
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
ln -sf ${GIT_TRAINING} $(go env GOPATH)/src/github.com/kubeflow/training-operator- GIT_TRAINING should be the location where you checked out https://github.com/kubeflow/training-operator
Install dependencies:
go mod tidyBuild the library:
go install github.com/kubeflow/training-operator/cmd/training-operator.v1Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.
You can create a kind cluster by running:
kind create clusterThis will load your kubernetes config file with the new cluster.
After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane:
kubectl get nodesThe output should look something like below:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 32s v1.27.3
From here we can apply the manifests to the cluster:
kubectl apply --server-side -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"Then we can patch it with the latest operator image:
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "kubeflow/training-operator:latest"}]'After setting up the cluster, you can submit a sample job using a TrainJob:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: pytorch-mnist-example
spec:
runtimeRef:
name: torch-distributed
apiGroup: trainer.kubeflow.org
kind: ClusterTrainingRuntimeApply the job:
kubectl apply -f pytorch-job.yamlCheck the job status:
kubectl get trainjobs
kubectl describe trainjob pytorch-mnist-exampleYou can also run a traditional PyTorch job example:
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yamlAnd we can see the output of the job from the logs:
kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple --followTo generate Python SDK for the operator, run:
./hack/python-sdk/gen-sdk.shThis command will re-generate the api and model files together with the documentation and model tests.
The following files/folders in sdk/python are auto-generated and should not be modified directly:
sdk/python/docs
sdk/python/kubeflow/training/models
sdk/python/kubeflow/training/*.py
sdk/python/test/*.py
The Training Operator client and public APIs are located here:
sdk/python/kubeflow/training/api