Autoconf Resource Requirements for AI Jobs on Kubernetes

This Kubernetes controller integrates with the ADO autoconf custom experiment to automatically set resource requirements for AI tuning jobs that use fms-hf-tuning.

What the controller does

The controller inspects the command line of your AI jobs (either PyTorchJob objects or PyTorchJob objects wrapped inside an AppWrapper) to extract details such as:

Model name
Tuning method
Effective batch size (i.e., per_device_batch_size * NUM_GPUS)
Maximum sequence length

It combines these with the target GPU model to request recommendations from the autoconf experiment and then creates new derived objects with the recommended resource requirements:

For PyTorchJob objects, and
For AppWrapper objects that contain a PyTorchJob object

We wish to use this controller to enhance the execution of AI workloads on Kubernetes clusters such that they use the right number of GPUs so as to avoid going out of GPU memory. The design of the controller enables us to explore different algorithms for resource recommendation which we plan to explore in the future.

Kueue collaboration (design work in progress)

We are working with the Kueue maintainers on a Kubernetes Enhancement Proposal (KEP) to improve how external Kubernetes controllers interact with jobs managed by Kueue (including AI workloads). The design discussion is tracked here: kubernetes-sigs/kueue#6915.

Until that work lands, this controller demonstrates a way to interact with Kueue-managed jobs while operating within current Kueue capabilities.

Examples

The controller in action!

2026-02-11-fms-autoconf-k8s-controller-compressed.mp4

Test locally (no Docker image)

You can run the controller as a local process while it manages one or more namespaces on your cluster.

Assumptions for the example below

You use a Kueue-managed namespace called tuning to run AppWrapper workloads.
These workloads:
1. use the Kueue LocalQueue named default-queue,
2. wrap a PyTorchJob that uses fms-hf-tuning,
3. request one or more NVIDIA GPUs of the same model (e.g., NVIDIA-A100-SXM4-80GB),
4. are subject to Kyverno policies requiring the kueue.x-k8s.io/queue-name label on AppWrapper objects.

Steps

Create and activate a Python virtual environment, then install the ADO autoconf client:

python3 -m venv .venv && source .venv/bin/activate
pip install ado-autoconf==1.5.0 ipython

Log in to your cluster (via kubectl or oc).
Build the controller locally:
```
make
```

Start the controller with flags appropriate for the scenario:

./bin/manager \
  --done-label-key=kueue.x-k8s.io/queue-name \
  --done-label-value=default-queue \
  --namespaces "tuning" \
  --enable-appwrapper=true \
  --enable-pytorchjob=true \
  --unsuspend-derived-jobs=false \
  --default-gpu-model=NVIDIA-A100-SXM4-80GB \
  --path-wrapper-script=./cmd/wrapper_autoconf.py

Create an AppWrapper or PyTorchJob workload with the following labels:

# This setup both satisfies Kyverno (requires a queue-name) and
# allows Kueue to temporarily ignore the job until the controller updates it.
kueue.x-k8s.io/queue-name: fake
autoconf-plugin-name: resource-requirements-appwrapper

Example AppWrapper and PyTorchJob manifests are available under examples.

Deploy to the cluster (high-level)

If you prefer to run the controller in-cluster (e.g., as a Deployment), the high-level process is:

Build an image for the controller.
Create RBAC: ServiceAccount, Role/ClusterRole, and bindings that permit reading/updating/creating the resources you plan to manage (i.e. AppWrapper and/or PyTorchJob). The controller needs read and create permissions.
Deploy a Deployment for the controller, setting the desired command-line flags (see Configuration below). Enable leader election if you run multiple replicas.
Optionally expose metrics/webhooks via a Service if you enable those endpoints.
Label workloads so the controller can discover them (see --watch-label-key / --watch-label-value), then create your AppWrapper/PyTorchJob objects.
Observe logs and job status to confirm resources are being recommended and applied as expected. The controller will create new derived objects with owner references to the originals.

Important Configuration flags

Below are the controller’s command-line options:

Core behavior

--default-autoconf-model-version string — Default autoconf model version to use (default 3.1.0).
--default-gpu-model string — Default GPU model if not specified in the job.
--patch-cpu-request — Set job CPU request/limit to max(1, 2 * NUM_GPUS) in derived objects (default true).
--unsuspend-derived-jobs — Unsuspend derived jobs after creation.
--path-wrapper-script string — Path to the local Python wrapper for running models. Mutually exclusive with --url-ado. Exactly one of these must be set.
--url-ado string — URL of the ADO REST API serving the models. Mutually exclusive with --path-wrapper-script. Exactly one of these must be set.

Discovery & scope

--namespaces string — Comma-separated list of namespaces to watch.
--watch-label-key string — Limit monitoring to objects labeled key=value (default key autoconf-plugin-name).
--watch-label-value string — Label value used with --watch-label-key (default resource-requirements-appwrapper).
--enable-appwrapper — Watch AppWrapper objects and create derived objects with recommendations.
--enable-pytorchjob — Watch PyTorchJob objects and create derived objects with recommendations.

Completion labeling

--done-label-key string — Label key inserted on original object when processing is complete (default autoconf-plugin-done). Not set if key is kueue.x-k8s.io/queue-name.
--done-label-value string — Label value inserted on original object when processing is complete (default yes).
--waiting-for-ado-request-id-label string — Label used to mark jobs waiting for an ADO request ID (default waiting-for-ado-request-id).
--recommendation-annotation-key string — Annotation key to store recommendation results in JSON format (default ado-autoconf.ibm.com/recommendation).

Recommendation Annotations

The controller adds an annotation to each processed object containing the recommendation result in JSON format:

Successful Recommendation:

{
  "recommendation": {
    "workers": 2,
    "gpus": 4
  }
}

No Recommendation Available:

{
  "error": "No recommendation"
}

When the recommendation engine cannot generate recommendations (e.g., can_recommend != 1 from the ado experiment), the controller:

Adds the error annotation to the object
Sets the done label to mark it as processed
Allows the object to proceed through standard Kubernetes workflows without modifications
Emits a warning event indicating no recommendation was available

This ensures objects are not stuck in a retry loop when recommendations cannot be generated.

Logging

--zap-devel — Development mode defaults (console encoder, debug log level, warn stack traces). Production mode defaults (JSON encoder, info log level, error stack traces). Default true.
--zap-encoder [json|console] — Zap log encoding.
--zap-log-level value — Log verbosity (debug, info, error, panic, or integer > 0 for custom levels).
--zap-stacktrace-level [info|error|panic] — Level at and above which stack traces are captured.
--zap-time-encoding [epoch|millis|nano|iso8601|rfc3339|rfc3339nano] — Time encoding (default epoch).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cmd		cmd
examples		examples
hack		hack
internal/controller		internal/controller
test		test
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autoconf Resource Requirements for AI Jobs on Kubernetes

What the controller does

Kueue collaboration (design work in progress)

Examples

Test locally (no Docker image)

Deploy to the cluster (high-level)

Important Configuration flags

Core behavior

Discovery & scope

Completion labeling

Recommendation Annotations

Logging

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

foundation-model-stack/fms-autoconf-k8s-controller

Folders and files

Latest commit

History

Repository files navigation

Autoconf Resource Requirements for AI Jobs on Kubernetes

What the controller does

Kueue collaboration (design work in progress)

Examples

Test locally (no Docker image)

Deploy to the cluster (high-level)

Important Configuration flags

Core behavior

Discovery & scope

Completion labeling

Recommendation Annotations

Logging

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages