Skip to content

Conversation

@yeazelm
Copy link
Contributor

@yeazelm yeazelm commented Dec 31, 2025

Issue number:

Related to: bottlerocket-os/bottlerocket#4673

Description of changes:
This builds the mps-control-daemon binary from the device plugin that allows MPS support. We have to patch the hardcoded paths for Bottlerocket usage since the device plugin assumes it can write to / which doesn't work with Bottlerocket.

This change also adds a new service to start this binary when settings request it. Otherwise it daemonizes sleep infinity to let systemd try-restart upon changing the settings for MPS.

The change should be safe to take without the bottlerocket-os/bottlerocket-kernel-kit#347 change or the upcoming settings change but the daemon will not work without the kmod update and the settings being properly set.

Testing done:
Build images with the kernel change, settings changes, and validated that a node will come up with MPS working if set in user data, and the services are restarted and MPS can be enabled at runtime as well.

Setting in userdata for a g6.2xlarge which only has one GPU

Details

eksctl config snippet for setting it at the beginning:

    bottlerocket:
      settings:
        kubelet-device-plugins:
          nvidia:
            device-sharing-strategy: "mps"
            mps:
              replicas: 2

Results in a node reporting nvidia.com/gpu.shared:

Capacity:
  cpu:                    8
  ephemeral-storage:      81854Mi
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 31619656Ki
  nvidia.com/gpu.shared:  2
  pods:                   58
Allocatable:
  cpu:                    7910m
  ephemeral-storage:      76173383962
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 30602824Ki
  nvidia.com/gpu.shared:  2
  pods:                   58

Setting the MPS after boot

Details

Start with a node with no configuration for MPS:

# apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "cdi-cri",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}

# systemctl status
● ip-192-168-12-91.us-west-2.compute.internal
    State: running
    Units: 458 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 0 units
    Since: Wed 2025-12-31 22:32:18 UTC; 5min ago
  systemd: 257.9
  Tainted: unmerged-bin
   CGroup: /
....

# systemctl status nvidia-mps-control-daemon
● nvidia-mps-control-daemon.service - NVIDIA MPS Control Daemon
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
             /etc/systemd/system/nvidia-mps-control-daemon.service.d
             └─exec-start.conf
     Active: active (running) since Wed 2025-12-31 22:32:32 UTC; 5min ago
 Invocation: d1565c1130dc4d9e87108f540f1178da
   Main PID: 3111 (/usr/bin/sleep)
      Tasks: 1 (limit: 36988)
     Memory: 308K (peak: 1.2M)
        CPU: 5ms
     CGroup: /system.slice/nvidia-mps-control-daemon.service
             └─3111 /usr/bin/sleep infinity

Dec 31 22:32:32 ip-... systemd[1]: Started NVIDIA MPS Control Daemon.

# systemctl cat nvidia-mps-control-daemon
# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service
[Unit]
Description=NVIDIA MPS Control Daemon
After=nvidia-k8s-device-plugin.service
Requires=nvidia-k8s-device-plugin.service

[Service]
Type=simple
ExecStart=/bin/true
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d/00-aws-config.conf
[Service]
# Set the AWS_SDK_LOAD_CONFIG system-wide instead of at the individual service
# level, to make sure new system services that use the AWS SDK for Go read the
# shared AWS config
Environment=AWS_SDK_LOAD_CONFIG=true

# /etc/systemd/system/nvidia-mps-control-daemon.service.d/exec-start.conf
[Service]
ExecStart=
ExecStart=/usr/bin/sleep infinity

The node shows one GPU:

Capacity:
  cpu:                8
  ephemeral-storage:  81854Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31619660Ki
  nvidia.com/gpu:     1
  pods:               58
Allocatable:
  cpu:                7910m
  ephemeral-storage:  76173383962
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             30602828Ki
  nvidia.com/gpu:     1
  pods:               58

Then set MPS:

apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=mps settings.kubelet-device-plugins.nvidia.mps.replicas=8


bash-5.1# apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "cdi-cri",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "mps",
        "mps": {
          "replicas": 8
        },
        "pass-device-specs": true
      }
    }
  }
}

Now check the rest of the system:

# systemctl status
● ip-192-168-12-91.us-west-2.compute.internal
    State: running
    Units: 458 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 0 units
    Since: Wed 2025-12-31 22:32:18 UTC; 7min ago
  systemd: 257.9
  Tainted: unmerged-bin
   CGroup: /
           ├─default
...
# systemctl status nvidia-mps-control-daemon
● nvidia-mps-control-daemon.service - NVIDIA MPS Control Daemon
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
             /etc/systemd/system/nvidia-mps-control-daemon.service.d
             └─exec-start.conf
     Active: active (running) since Wed 2025-12-31 22:39:41 UTC; 36s ago
 Invocation: 7191c5bc120246709e113d50d3ce3c54
   Main PID: 6994 (mps-control-dae)
      Tasks: 12 (limit: 36988)
     Memory: 49.1M (peak: 62M)
        CPU: 227ms
     CGroup: /system.slice/nvidia-mps-control-daemon.service
             ├─6994 /usr/bin/mps-control-daemon --config-file /etc/nvidia-k8s-device-plugin/settings.yaml
             ├─7015 nvidia-cuda-mps-control -d
             └─7021 tail -n +1 -f /run/mps/nvidia.com/gpu.shared/log/control.log

Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] Accepting connection...
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] NEW UI
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] Cmd:set_default_active_thread_percentage 12
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] 12.0
Dec 31 22:39:41 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:39:41.892 Control  7015] UI closed
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] Accepting connection...
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] NEW UI
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] Cmd:get_default_active_thread_percentage
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] 12.0
Dec 31 22:40:11 ip-192-168-12-91.us-west-2.compute.internal mps-control-daemon[7021]: [2025-12-31 22:40:11.832 Control  7015] UI closed

# systemctl cat nvidia-mps-control-daemon
# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service
[Unit]
Description=NVIDIA MPS Control Daemon
After=nvidia-k8s-device-plugin.service
Requires=nvidia-k8s-device-plugin.service

[Service]
Type=simple
ExecStart=/bin/true
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

# /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d/00-aws-config.conf
[Service]
# Set the AWS_SDK_LOAD_CONFIG system-wide instead of at the individual service
# level, to make sure new system services that use the AWS SDK for Go read the
# shared AWS config
Environment=AWS_SDK_LOAD_CONFIG=true

# /etc/systemd/system/nvidia-mps-control-daemon.service.d/exec-start.conf
[Service]
ExecStart=
ExecStart=/usr/bin/mps-control-daemon --config-file /etc/nvidia-k8s-device-plugin/settings.yaml

# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  mpsRoot: "/run/nvidia/mps"
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: cdi-cri
    deviceIDStrategy: index
    containerDriverRoot: "/"
sharing:
  mps:
    renameByDefault: true
    resources:
    - name: "nvidia.com/gpu"
      replicas: 8

And the node shows the empty nvidia.com/gpu offering but now a shared one:

Capacity:
  cpu:                    8
  ephemeral-storage:      81854Mi
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 31619660Ki
  nvidia.com/gpu:         1
  nvidia.com/gpu.shared:  8
  pods:                   58
Allocatable:
  cpu:                    7910m
  ephemeral-storage:      76173383962
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 30602828Ki
  nvidia.com/gpu:         0
  nvidia.com/gpu.shared:  8
  pods:                   58

This is a known edge case and is similar to how timeslicing works. In order to avoid old resources, you'd need to start with the user-data approach.

Shifting to rename-by-default=false(apiclient set settings.kubelet-device-plugins.nvidia.mps.rename-by-default=false) will have the original nvidia.com/gpu resource instead:

Capacity:
  cpu:                    8
  ephemeral-storage:      81854Mi
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 31619660Ki
  nvidia.com/gpu:         8
  nvidia.com/gpu.shared:  8
  pods:                   58
Allocatable:
  cpu:                    7910m
  ephemeral-storage:      76173383962
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 30602828Ki
  nvidia.com/gpu:         8
  nvidia.com/gpu.shared:  0
  pods:                   58

And finally, setting sharing to none disables MPS:

# apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=none
# systemctl status nvidia-mps-control-daemon
● nvidia-mps-control-daemon.service - NVIDIA MPS Control Daemon
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-mps-control-daemon.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
             /etc/systemd/system/nvidia-mps-control-daemon.service.d
             └─exec-start.conf
     Active: active (running) since Wed 2025-12-31 22:44:41 UTC; 2s ago
 Invocation: 82664a64fd044762a81ecef6d1cc0462
   Main PID: 9436 (/usr/bin/sleep)
      Tasks: 1 (limit: 36988)
     Memory: 308K (peak: 1.2M)
        CPU: 4ms
     CGroup: /system.slice/nvidia-mps-control-daemon.service
             └─9436 /usr/bin/sleep infinity

Dec 31 22:44:41 ip-192-168-12-91.us-west-2.compute.internal systemd[1]: Started NVIDIA MPS Control Daemon.

And the resource goes back down to 1.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Add support for NVIDIA Multi-Process Service (MPS) control daemon,
including service configuration and device plugin updates.

Signed-off-by: Matthew Yeazel <[email protected]>
ExecStart=/usr/bin/mps-control-daemon --config-file /etc/nvidia-k8s-device-plugin/settings.yaml
{{else}}
ExecStart=
ExecStart=/usr/bin/sleep infinity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed. Options to remove the sleep are limited since a combination of our tooling and SystemD behaviour is preventing us from:

  • leaving the service in an exited state since SystemD will not try-restart it
  • letting the service fail till mps is available since the system will be in a degraded state
  • having the service without an [Install] section since we don't have logic to conditionally start and stop a service
  • having the entire service file be rendered on enabling mps since that would require a daemon reload which is not possible by the existing logic.

Without us writing new logic to conditionally enable and disable services, this would be the way forward to enable mps support


[Service]
Type=simple
ExecStart=/bin/true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer /usr/bin/false here so that the start fails in a more obvious way.

[Service]
{{#if (eq settings.kubelet-device-plugins.nvidia.device-sharing-strategy "mps")}}
ExecStart=
ExecStart=/usr/bin/mps-control-daemon --config-file /etc/nvidia-k8s-device-plugin/settings.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code, there's this log message:

klog.Info("No devices are configured for MPS sharing; Waiting indefinitely.")

Which seems like the infinite sleep we want. Can we render the config in a way that triggers this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants