Skip to content

Conversation

@yeazelm
Copy link
Contributor

@yeazelm yeazelm commented Jan 16, 2026

Issue number:

Closes # #4673

Description of changes:
Update the shared defaults configuration files and services to accomodate the nvidia-mps-control-daemon. Add a migration for the new settings.

Testing done:
cargo make completes. For full testing of the feature, look at bottlerocket-os/bottlerocket-core-kit#789

Migration testing passed.

From the console when downgrading:

         Starting Write network status...
[  OK  ] Finished Prepare Containerd Directory (/var/lib/containerd).
[    5.718341] migrator[1537]: 22:11:30 [INFO] Running migration 'migrate_v1.54.0_kubelet-device-plugins-mps-prefix-settings.lz4'
[  OK  ] Finished Prepare Kubelet Directory (/var/lib/kubelet).
[  OK  ] Finished Prepare Opt Directory (/opt).
[  OK  ] Finished Prepare Var Directory (/var).
[  OK  ] Finished Write network status.
[  OK  ] Reached target Network is Online.
         Mounting CNI Plugin Directory (/opt/cni)...
[    5.977593] migrator[1537]: 22:11:31 [INFO] Running migration 'migrate_v1.54.0_kubelet-device-plugins-mps-settings.lz4'
[    6.070127] migrator[1537]: 22:11:31 [INFO] Removing the weak settings and metadata.
         Mounting CSI Helper Directory (/opt/csi)...

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

/// GPU sharing in the device plugin.
fn run() -> Result<()> {
migrate(AddPrefixesMigration(vec![
"settings.kubelet-device-plugins.nvidia.mps",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have to add "configuration-files.nvidia-mps-control-daemon-exec-start-conf",.

Copy link
Contributor

@piyush-jena piyush-jena Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if you have to add a migration for "settings.kubelet-device-plugins.nvidia.device-sharing-strategy" as well because the allowed value can also be mps now.

fn run() -> Result<()> {
    migrate(RestrictListsMigration(vec![ListRestriction {
        setting: "settings.kubelet-device-plugins.nvidia.device-sharing-strategy",
        allowed_vals: &["time-slicing"],
    }]))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a custom function for this since we don't have a helper for Enum variants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the custom function would still allow "mps" upon downgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this after re-writing my code, this is a nice find! My implementation is a bit more targeted and specific to this one use case but that example would have worked as well. would have worked as well.

Add the additional services and configuration files
for the nvidia-mps-control-daemon

Signed-off-by: Matthew Yeazel <yeazelm@amazon.com>
@yeazelm
Copy link
Contributor Author

yeazelm commented Jan 22, 2026

^ Updated with core and kernel kit releases, Bottlerocket SDK update, and comments for migrations.

@yeazelm yeazelm marked this pull request as ready for review January 22, 2026 17:59
Signed-off-by: Matthew Yeazel <yeazelm@amazon.com>
@yeazelm
Copy link
Contributor Author

yeazelm commented Jan 22, 2026

^ Updated the code to break into two migrations. The first migration's datastore would be lost when the second one ran. This keeps the two separate and ensures both complete.

@KCSesh
Copy link
Contributor

KCSesh commented Jan 22, 2026

Just a note, if we wanted to do 2 sets of migrations in 1, this works as well:

use migration_helpers::{migrate, Migration, MigrationData, Result};
use std::process;

const MPS_PREFIX: &str = "settings.kubelet-device-plugins.nvidia.mps";
const DEVICE_SHARING_STRATEGY_SETTING: &str =
    "settings.kubelet-device-plugins.nvidia.device-sharing-strategy";

pub struct MpsMigration;

impl Migration for MpsMigration {
    fn forward(&mut self, input: MigrationData) -> Result<MigrationData> {
        println!("MpsMigration has no work to do on upgrade.");
        Ok(input)
    }

    fn backward(&mut self, mut input: MigrationData) -> Result<MigrationData> {
        // Remove all settings with the MPS prefix
        let to_remove: Vec<_> = input.data.keys()
            .filter(|k| k.starts_with(MPS_PREFIX))
            .cloned()
            .collect();
        for key in to_remove {
            if let Some(data) = input.data.remove(&key) {
                println!("Removed {key}, which was set to '{data}'");
            }
        }

        // Change device-sharing-strategy from "mps" to "none"
        if let Some(data) = input.data.get_mut(DEVICE_SHARING_STRATEGY_SETTING) {
            if let serde_json::Value::String(s) = data {
                if s == "mps" {
                    *data = serde_json::Value::String("none".to_string());
                    println!("Changed device-sharing-strategy from 'mps' to 'none' on downgrade.");
                }
            }
        }
        Ok(input)
    }
}

fn run() -> Result<()> {
    migrate(MpsMigration)
}

fn main() {
    if let Err(e) = run() {
        eprintln!("{e}");
        process::exit(1);
    }
}

But I think I like the 2 separate migrations since they are different trees anyways.

@yeazelm yeazelm merged commit b33e7b6 into bottlerocket-os:develop Jan 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants