Use new -c option on faux-mgs component-active-slot command

lzrd · lzrd · commit 350535792aa5 · 2025-07-26T15:06:40.000-07:00
Also: Update scripts test plan, scripts todo list, and remove Hubris issue #2093 workaround.
diff --git a/scripts/README.md b/scripts/README.md
@@ -57,6 +57,8 @@ and personal workflows.
     `faux-mgs` Rust integration:
     * `faux_mgs(["arg0", .., "argN"]) -> map`: Runs any `faux-mgs` command
         internally (using `--json=pretty`) and returns the result as a Rhai map.
+        This map will contain either an `Ok` or `Err` field, reflecting the
+        command's success or failure.
         *Do not call this directly in test scripts; use wrappers from `util.rhai`.*
     * `new_archive(path) -> ArchiveInspector`: Loads a Hubris archive (.zip).
     * `ArchiveInspector[<zip_path>]`: Access files within the archive (returns
@@ -90,7 +92,7 @@ import `${script_dir}/util` as util;
     -   `rot_boot_info()`: Gets formatted RoT Boot Info.
     -   `check_update_in_progress(component)`: Checks SP/RoT update status.
     -   `update_rot_image_file(slot, path, label)`: Updates RoT image.
-    -   `set_rot_boot_preference(slot, use_transient, label)`: Sets RoT pref.
+    -   `rot_boot_preference(slot, action, use_transient, label)`: Sets or clears RoT preference. When clearing (`action = util::PREF_CLEAR`), uses the `component-active-slot -c` command if supported by the firmware, with automatic fallback to reset workaround for version compatibility.
     -   `reset_rot_and_get_rbi(desc, label)`: Resets RoT and gets RBI.
     -   `update_sp_image(path)`: Updates SP image.
     -   `reset_sp()`: Resets SP.
diff --git a/scripts/TEST_PLAN.md b/scripts/TEST_PLAN.md
@@ -1,28 +1,28 @@
 # Validation Test Plan
 
 This document outlines a full set of tests to validate the `upgrade-rollback.rhai` test harness. It is split into two parts:
-1.  **Part 1**: Tests the current, real-world scenario where the new `baseline` firmware has features that the older `under-test` firmware lacks.
+1.  **Part 1**: Tests the current, real-world scenario where the new `under-test` firmware has features that the older `baseline` firmware lacks.
 2.  **Part 2**: Describes tests for a future state where both `baseline` and `under-test` firmware are fully compliant with the transient boot preference feature.
 
 ## Prerequisites and Setup
 
 ### 1. Environment Variables
 
-Before running these tests, for convenience, set two environment variables to point to your local Hubris build repositories:
+Before running these tests, for convenience, set environment variables to point to your local Hubris build repositories. For example:
 ```bash
 export REPO_BL=/path/to/your/baseline/hubris
 export REPO_UT=/path/to/your/under-test/hubris
+export UT_WORKTREE=${REPO_UT}
 ```
 
 These repositories must have SP and RoT Hubris build products in their
 respective `target/` directories.
 
-Examine and edit the `scripts/targets.json` file or make your own if you need
-to use images from other locations.
+**Note for other users**: The `scripts/targets.json` file uses these environment variables (e.g., `UT_WORKTREE`) to locate firmware images. If simple environment variable overrides are not convenient, then you will want your own configuration file like `scripts/targets.json` that reflects your local test environment.
 
 ### 2. The `FMR` Wrapper Script
 
-The test commands use a helper script named `FMR`, which is a wrapper around the main `cargo run --bin faux-mgs` command. Its purpose is to simplify running tests by automatically including common arguments.
+The test commands use a helper script named `FMR` (faux-mgs with Rhai scripting), which is a wrapper around the main `cargo run --bin faux-mgs` command. Its purpose is to simplify running tests by automatically including common arguments.
 
 * **Functionality**: The script automatically adds required arguments like `--features=rhaiscript`, `--json=pretty`, timeouts, and attempts to discover the correct network `--interface` setting.
 * **Log Levels**: The name used to call the script sets the log level for the test run. For example, `FMR-info` sets `--log-level=info`, while `FMR-trace` sets `--log-level=trace`.
@@ -37,8 +37,35 @@ Each numbered test case should be run from a known-clean state. Before starting
 ```bash
 FMR-info reset-component rot
 FMR-info reset-component rot
+
+This ensures that any version of RoT firmware being used has no pending Hubris
+image preferences in effect.
+
+### 3. Copy and customize `scripts/targets.json` for your environment
+
+```bash
+TARGETS=targets-$(uname-n).json
+cp scripts/targets.json $TARGETS
+# Edit $TARGETS appropriately
 ```
 
+Note that the `upgrade-rollback.rhai` script has a `-b` and `-u` options to
+override the baseline and under-test paths in `scripts/targets.json`, so if that
+is the only thing you want to change you can just use those CLI flags.
+
+
+---
+
+## Version Compatibility and Graceful Degradation
+
+The test scripts include robust version compatibility handling for the `--cancel-pending` feature:
+
+* **Preferred Method**: When supported, the scripts use `faux-mgs component-active-slot -c` to directly clear pending persistent preferences.
+* **Fallback Method**: When the SP firmware doesn't support the command (indicated by a "WrongVersion" error), the scripts automatically fall back to the RoT reset workaround.
+* **Seamless Operation**: This compatibility layer ensures tests work across different firmware versions without manual intervention.
+
+During the transition period where some devices have updated firmware and others don't, the test suite will automatically use the appropriate method for each device.
+
 ---
 
 ## Part 1: Testing Asymmetric Feature Support (Current State)
@@ -50,8 +77,7 @@ FMR-info reset-component rot
 * **Purpose**: To verify that the primary upgrade and rollback functionality works correctly without using any of the new features.
 * **Command**:
     ```bash
-    FMR-info rhai scripts/upgrade-rollback.rhai -c scripts/targets.json \
-      -b $REPO_BL -u $REPO_UT
+    ./FMR-info rhai scripts/upgrade-rollback.rhai -c $TARGETS
     ```
 * **Expected Outcome**: The script should complete successfully with an exit code of 0. It will upgrade to the `under-test` image and then roll back to the `baseline` image using persistent updates.
 
@@ -60,8 +86,7 @@ FMR-info reset-component rot
 * **Purpose**: To verify the script correctly handles the feature asymmetry when the transient update path is requested.
 * **Command**:
     ```bash
-    FMR-info rhai scripts/upgrade-rollback.rhai -c scripts/targets.json \
-      -b $REPO_BL -u $REPO_UT -t
+    ./FMR-info rhai scripts/upgrade-rollback.rhai -c $TARGETS -t
     ```
 * **Expected Outcome**: The script should complete successfully with an exit code of 0. The log should show:
     * **Upgrade**: The active `baseline` firmware does not support the feature. The script will log a warning and use a persistent update.
@@ -72,8 +97,7 @@ FMR-info reset-component rot
 * **Purpose**: To verify the logic that runs (or skips) the `test_and_recover...` negative test based on feature support.
 * **Command**:
     ```bash
-    FMR-info rhai scripts/upgrade-rollback.rhai -c scripts/targets.json \
-      -b $REPO_BL -u $REPO_UT -N
+    ./FMR-info rhai scripts/upgrade-rollback.rhai -c $TARGETS -N
     ```
 * **Expected Outcome**: The script will **fail with exit code 1**. This is the correct behavior.
     * **Upgrade**: The `baseline` firmware is active and does not support the transient feature. The script will detect this and, because the test is for the `ut` branch, it will log a `FATAL` error stating the `under-test` image must support the feature. This check is known to be flawed for this specific asymmetric case but correctly protects against regressions.
@@ -83,20 +107,16 @@ FMR-info reset-component rot
 * **Purpose**: To verify that the test harness can recover from a pre-existing `pending_persistent` preference fault.
 * **Command**:
     ```bash
-    FMR-info rhai scripts/upgrade-rollback.rhai -c scripts/targets.json \
-      -b $REPO_BL -u $REPO_UT \
-      --inject-fault=pending --hubris-2093
+    ./FMR-info rhai scripts/upgrade-rollback.rhai -c $TARGETS --inject-fault=pending
     ```
-* **Expected Outcome**: The script should run the "pending" fault injection test and exit with code 0. The log will show the sanitizer detecting the fault and using the reset-based workaround to clear it before the main test flow runs successfully.
+* **Expected Outcome**: The script should run the "pending" fault injection test and exit with code 0. The log will show the sanitizer detecting the fault and attempting to use the `faux-mgs component-active-slot -c` command to clear it. If the firmware supports the command, it will clear the fault directly. If there's a version mismatch (e.g., "WrongVersion { sp: 19, request: 20 }"), the system will fall back to the RoT reset workaround and still complete successfully.
 
 ### Test 1.5: Fault Injection - Conflicting `transient` Preference
 
 * **Purpose**: To verify the test harness correctly handles the inability to inject a fault into non-compliant firmware.
 * **Command**:
     ```bash
-    FMR-info rhai scripts/upgrade-rollback.rhai -c scripts/targets.json \
-      -b $REPO_BL -u $REPO_UT \
-      --inject-fault=transient
+    ./FMR-info rhai scripts/upgrade-rollback.rhai -c $TARGETS --inject-fault=transient
     ```
 * **Expected Outcome**: The script is **expected to fail with exit code 1**. This is the correct outcome. The log will show:
     1. The script first installs the `baseline` (`master`) firmware.
@@ -115,8 +135,7 @@ FMR-info reset-component rot
 * **Purpose**: To verify that when both images are compliant, the script uses the transient update path for both the upgrade and the rollback.
 * **Command**:
     ```bash
-    FMR-info rhai scripts/upgrade-rollback.rhai -c scripts/targets.json \
-      -b $REPO_BL -u $REPO_UT -t
+    ./FMR-info rhai scripts/upgrade-rollback.rhai -c $TARGETS -t
     ```
 * **Expected Outcome**: The script should complete successfully with an exit code of 0. The log should show a transient update is used for **both** the upgrade to `ut` and the subsequent rollback to `base`.
 
@@ -125,7 +144,6 @@ FMR-info reset-component rot
 * **Purpose**: To verify that the negative test runs successfully in both directions when all firmware is compliant.
 * **Command**:
     ```bash
-    FMR-info rhai scripts/upgrade-rollback.rhai -c scripts/targets.json \
-      -b $REPO_BL -u $REPO_UT -N
+    ./FMR-info rhai scripts/upgrade-rollback.rhai -c $TARGETS -N
     ```
 * **Expected Outcome**: The script should complete successfully with an exit code of 0. The `test_and_recover_from_preferred_slot_update_failure` function should be executed and pass for the `ut` branch during the upgrade, and then be executed and pass **again** for the `base` branch during the rollback.
diff --git a/scripts/TODO.md b/scripts/TODO.md
@@ -4,10 +4,9 @@ This document tracks known issues, planned features, and refactoring opportuniti
 
 ## High Priority / Bugs & Workarounds
 
-* **Remove `--hubris-2093` Workaround**
-    * **Issue**: The `lpc55-update-server` firmware has a bug where setting a persistent preference does not correctly clear a pre-existing pending preference. This is tracked as "Hubris issue #2093".
-    * **Workaround**: The `sanitize_boot_preferences` function in `update-helper.rhai` uses a reset to reliably clear a pending preference when the `--hubris-2093` flag is active.
-    * **Action**: Once the firmware bug is fixed, the workaround logic should be removed from `sanitize_boot_preferences` and the `--hubris-2093` flag should be removed from `upgrade-rollback.rhai`. The "ideal" logic path should become the only path.
+*   **Hubris #2093 Workaround Status**
+    *   **Issue**: The `lpc55-update-server` firmware had a bug where setting a persistent preference did not correctly clear a pre-existing pending preference. This was tracked as "Hubris issue #2093".
+    *   **Current Status**: The `faux-mgs component-active-slot -c` command is now implemented and provides the preferred solution. However, the workaround logic is retained in `sanitize_boot_preferences` to handle version compatibility - when the SP firmware doesn't support the new `-c` command (version mismatch), the system gracefully falls back to the RoT reset workaround. This ensures compatibility across firmware versions during the transition period.
 
 * **Fix `faux-mgs` Error Reporting for `reset-component`**
     * **Issue**: When the SP debugger is attached, the `reset-component sp` command fails. However, the `faux-mgs` Rust code does not gracefully package the detailed error message (`watchdog: RoT error: the SP programming dongle is connected`) into the JSON passed to Rhai. It returns a generic error.
diff --git a/scripts/update-helper.rhai b/scripts/update-helper.rhai
@@ -317,18 +317,18 @@ fn rot_supports_transient_boot_preference() {
         debug(`error|Cannot get active RoT slot. Error: ${r.error}`);
         return false;
     }
-    if r.ok.transient != () {
-        // Evidence shows that the feature is supported.
-        // Don't alter the state
+    if r.ok.transient != () || r.ok.pending_persistent != () {
+        // Indirect evidence since reporting non-() values
+        // was implemented in the same commit as transient image selection.
+        // State is not altered.
         return true;
     }
-    // For compliant firmware, this command succeeds and ensures the transient
-    // preference is cleared. For non-compliant firmware, it will fail.
-    let pref_check_result = util::rot_boot_preference(r.ok.active, util::PREF_SET, true, "transient_support_test");
+    // Now we know that if the feature is supported, then it isn't being used.
+    // If the feature isn't supported, then this call will fail:
+    let pref_check_result = util::rot_boot_preference(r.ok.active, util::PREF_SET, true, "transient_support_test_set");
     if pref_check_result?.ok == true {
-        // Setting the transient preference to the active slot is a no-op
-        // but shows that the feature is supported.
         debug("info|transient boot preference feature is supported.");
+        let _r = util::rot_boot_preference(r.ok.active, util::PREF_CLEAR, true, "transient_support_test_clear");
         return true;
     }
     debug("warn|transient boot preference feature is not supported.");
@@ -529,15 +529,30 @@ fn rot_validate_final_persistent_boot_state(
 }
 
 fn rot_validate_direct_persistent_boot_state(
-    rbi, target_update_slot, target_label
+    rbi, target_update_slot, target_label, images
 ) {
     debug("info|--- rot_validate_direct_persistent_boot_state (update-helper) ---");
-    if rbi.active != target_update_slot {
-        debug(`error|Validation FAILED: Unexpected active slot for '${target_label}'.`);
+    let active_gitc = util::caboose_value("rot", `${rbi.active}`, "GITC");
+    let expected_gitc_entries = images.by_gitc?.get(active_gitc);
+
+    if active_gitc == () {
+        debug(`error|Validation FAILED: Could not get GITC for active slot ${rbi.active}.`);
+        return false;
+    }
+
+    // Check if the active GITC belongs to the target_label (e.g., 'base_rot_a' or 'base_rot_b')
+    // The `images.by_gitc` map stores an array of labels for each GITC.
+    // We need to check if any of the labels for the active GITC match the target branch.
+    if expected_gitc_entries != () && (
+        `${target_label}_rot_a` in expected_gitc_entries ||
+        `${target_label}_rot_b` in expected_gitc_entries
+    ) {
+        debug(`info|Validation PASSED: RoT active GITC (${active_gitc}) matches expected for '${target_label}'.`);
+        return true;
+    } else {
+        debug(`error|Validation FAILED: Active GITC (${active_gitc}) does not match expected for '${target_label}'. RBI: ${rbi}`);
         return false;
     }
-    debug(`info|Validation PASSED: RoT correctly booted for '${target_label}'.`);
-    return true;
 }
 
 /// Attempts to power cycle the DUT and log its RoT state afterwards.
@@ -617,7 +632,8 @@ fn update_rot_hubris(
     path_b,
     use_transient,
     target_label,
-    conf
+    conf,
+    images
 ) {
     debug(`info|update_rot_hubris target=${target_label}`);
     debug(`info|transient=${use_transient}`);
@@ -774,10 +790,19 @@ fn update_rot_hubris(
         }
     } else {
         // Direct Persistent Boot Flow
+        // After setting persistent preference, the device should boot into the target slot.
+        // We perform one reset and check if the preference took effect.
+        if (rbi.active != target_slot) {
+            debug(`error|Validation FAILED: Persistent preference for slot ${target_slot} did not take effect after first reset. Active slot is ${rbi.active}. This indicates a firmware issue.`);
+            if (conf?.rot_hubris_power_cycle_on_failure == true) {
+                power_cycle_dut(conf, "RoT persistent preference not applied after reset");
+            }
+            return false;
+        }
+
         if (!rot_validate_direct_persistent_boot_state(
-            rbi, target_slot, target_label
+            rbi, target_slot, target_label, images
         )) {
-            // No specific error type from this local helper yet to condition power cycle
             if (conf?.rot_hubris_power_cycle_on_failure == true) {
                 power_cycle_dut(conf, "RoT direct persistent boot validation failed");
             }
@@ -860,7 +885,7 @@ fn ensure_initial_baseline_state(conf, images) {
         debug(`info|Device needs baseline flashing. SP: ${flash_sp}, RoT: ${flash_rot}`);
         if flash_rot {
             debug("info|Updating RoT to baseline (persistent).");
-            if !update_rot_hubris(conf.base.rot_a, conf.base.rot_b, false, "baseline_setup", conf) {
+            if !update_rot_hubris(conf.base.rot_a, conf.base.rot_b, false, "base", conf, images) {
                 debug("error|Failed to update RoT to baseline.");
                 return false;
             }
@@ -1087,15 +1112,6 @@ fn sanitize_boot_preferences(conf) {
         } else {
             debug("info|sanitize_boot_preferences: Successfully cleared pending persistent preference.");
         }
-
-        // Final verification
-        let final_rbi = util::rot_boot_info();
-        if final_rbi?.error != () || final_rbi.pending_persistent_boot_preference != () {
-             debug(`error|sanitize_boot_preferences: Failed to verify pending pref was cleared. RBI: ${final_rbi}`);
-             return false;
-        }
-    } else {
-      debug("info|No pending persistent preference");
     }
 
     debug("info|Boot preferences sanitized successfully.");
@@ -1106,7 +1122,7 @@ fn sanitize_boot_preferences(conf) {
 /// the SP having a pending update.
 /// Returns:
 ///   bool: `true` if no debugger and pending update detected, `false` otherwise.
-//
+///
 /// Fixing Hubris issue 2066 will give us more definitive information to use in
 /// testing.
 fn check_for_sp_debugger_and_sp_pending_update() {
diff --git a/scripts/upgrade-rollback.rhai b/scripts/upgrade-rollback.rhai