thermal: TURBO MODE ENGAGE!!! #2362

hawkw · 2026-01-21T20:42:11Z

The Problem

We have recently been seeing a number of Cosmo-Bs being shut down by the thermal control loop under load, and we we would really prefer they didn't do this. This has generally occurred due to the machine suddenly starting an intensive CPU and NVMe workload (initializing a large number of raw zvols) while inlet air temperature is uncomfortable but within our specified range of 35º F (2º C) and 95º F (35º C): ambient temperatures in "the fridge" in Oxide EMY have been between 75º F and 90º F when thermal shutdowns have occurred. In all observed cases of these thermal shutdowns, we have not seen CPU T_ctl reach the thresholds of 95 Magic AMD Thermal Sadness Units (at which point the CPU throttles itself) or 115 (at which point THERMTRIP_L is asserted). Instead, these systems have shut down due to a timeout in the thermal control loop, which limits the amount of time the system spends in the Overheated state to 60 seconds. See #2361 for details on the observed pathology.

Background

To elaborate a bit here, the Hubris thermal control loop operates in two primary control regimes: the "normal" regime (represented by ThermalControlState::Running, in which fan speeds are governed by PID control with temperature sensor readings as the process variable); and the "overheating" regime (represented by ThermalControlState::Overheated, in which we operate fans at 100% of their PWM duty cycle until we return to the normal control regime.

For each device whose temperature measurements influence the control loop (on Cosmo, the CPU T_ctl, the M.2 and U.2 SSDs, and T6 NIC), we define three temperature thresholds: target_temperature, critical_temperature, and power_down_temperature. These impact the behavior of the control loop as follows:

If any device ever reports a temperature which is over its power_down_temperature, the thermal control loop powers the system off immediately,
If we are in the normal control regime and any device reports a temperature which is over its critical_temperature threshold, we transition to the "overheating" control regime,
Once we are in the "overheating" control regime, we remain there (running at 100% duty cycle) until every device's temperature measurement is less than or equal to it's target_temperature (i.e., it has returned to nominal).

There's one additional factor, which will become important shortly: when we enter the "overheating" control regime, we start a timeout for overheat_timeout_ms, which is currently always set to 60 seconds. If 60 seconds have elapsed without returning to nominal, the thermal control loop powers the system off. In effect, if the thermal loop feels that things have not gotten better fast enough, it decides to kill itself immediately.

The present behavior was introduced in PR #2317. Prior to that change, we would exit the "overheating" control regime as soon as all temperatures were at least one degree¹ below that device's critical threshold, and return to normal control. The effect of this change is that we now remain in the "overheating" regime for longer. Perhaps a bit unintuitively, though, the intent (and, I believe, the effect!) of this change was to reduce the likelihood of thermal shutdowns: when we return to normal PID control after dropping 1 degree under critical, the fan speed will go down and the system will immediately start overheating again. In this rapid oscillation between control regimes, we didn't spend enough time at 100% PWM to cool down sufficiently, and something would hit its power_down_temperature threshold.

However, the other impact of #2317 is that the system spends a lot more time in the control regime in which it considers itself to be "overheating", and does not stop the "guess I should kill myself" timeout until temperatures have decreased by a much greater margin than was required prior. Matt² explicitly calls this out in PR:

Note that this change means the system will stay in Overheated longer, because it has to get to the normal temperature (instead of critical - 1°). I didn't change overheat_timeout_ms from its current value of 60 seconds; we really should be able to get from critical → target temperature in 60 seconds.

That last sentence --- "we really should be able to get from critical → target temperature in 60 seconds" --- is demonstrably not the case in situations when the inlet air temperature is as hot as it's been in the fridge for the last couple days. Looking at data collected from an overheating Cosmo this morning, we see that once we transition to the "overheating" control regime, the CPU T_ctl values are steadily decreasing towards the threshold value...just not fast enough to make it there in less than a minute. See this comment for details.

Okay, that was a lot. But now we should have enough context to discuss...

The (Possible) Solution

This change is based on the following observations:

If anything goes over a critical threshold, we want to remain at the max PWM duty cycle until everything returns to nominal (this was what Wait for nominal temperature #2317 did, and we want to preserve that property)
We want to shut the system down safely if running the fans at full blast fails to improve the thermal situation (this is what the 60-second overheat_timeout_ms is supposed to be for, and we don't want to get rid of that behavior)
However, the rate at which we can reject heat decreases as inlet temperature increases (I think? I barely remember high-school physics class), so if it's warmer than a typical air-conditioned data center cold isle, it will take us longer to reach a nominal temperature and transition back to the normal control regime. This may take longer than 60 seconds, which is the problem.
Just increasing the timeout feels gross. All timeouts are arbitrary. I don't really have a convincing argument for why 120 seconds is meaningfully different than 60 seconds except "it would have stopped that specific Cosmo from shutting down while Alan was running his raw zvol test this morning".
Instead, what we really want is to distinguish between "something got too hot, but things are actively getting better and so we shouldn't kill ourselves" and "no matter how much we spin our fans, we can't seem to make the computer cool off and we should therefore probably just call it".
BONUS OBSERVATION: I don't really know that much about control theory and just kinda randomly twiddling the PID gains feels scary.

Based on this, this branch introduces a new state to the control loop: TURBO MODE!!!!!.

TURBO MODE!!!!! is an extension of the "overheating" control regime in which we continue to run the fans at full blast, but we don't continue counting down for sixty seconds and then kill ourselves. When a temperature reading goes over the device's critical threshold, we still enter the Overheated state, as we did previously. If we remain in Overheated for 60 seconds, we shut down the system; again, as we did previously. And, if any temperature hits the device's power_down_temperature threshold, we still power down the system immediately, as we did previously. But, we add a new transition out of ThermalControlState::Overheated: if we see that all monitored sensors have gone below their critical thresholds, we transition from the overheated state to TURBO MODE!!!!!!

In TURBO MODE!!!, we continue to run the fans at 100% duty cycle, but no longer track the timeout. If all temperatures return to nominal, we exit TURBO MODE!!!!!!!!! and return to normal control. If anything goes back above its critical threshold, we transition from TURBO MODE!!!! back to the "overheating" state, restarting the timeout. This way, if things are below critical but not yet nominal, we keep running at full fan speed for as long as it takes to get back to nominal, rather than giving up if it takes slightly longer than 60 seconds.

Or one Happy Fun AMD T_ctl Badness Amount Unit. ↩
Who I'm not going to tag in this as he's on leave. ↩

labbott

Going to re-review in a bit but I think this is a good approach. I feel like I want a state machine doc here showing The ThermalControlState transitions though. We can go from Running -> Overheated but never go Running -> Turbo. We go Overheated -> Running if we cooldown fast and go Overheated -> Turbo if we need more time. We can go Running -> Uncontrollable and Overheated -> Uncontrollable and Turbo -> Uncontrollable

It does seem like we could potentially bounce between Overheated and Turbo indefinitely. Is there a point where we should lie down/try not to cry/cry a lot/give up if it seems like we will never reach nominal?

hawkw · 2026-01-22T18:01:37Z

@labbott:

I feel like I want a state machine doc here showing The ThermalControlState transitions though.

here's a Mermaid diagram that's kinda gross-looking:

stateDiagram-v2
    [*] --> Running
    Running --> Overheat: any temp over critical
    Overheat --> Running: all temps nominal
    Overheat --> Turbo: all temps under critical
    Turbo --> Overheat: any temp over critical
    Turbo --> Running: all temps nominal
    Running --> Uncontrollable: any temp over power_down
    Overheat --> Uncontrollable: any temp over power_down
    Overheat --> Uncontrollable: 60 seconds elapse
    Turbo --> Uncontrollable: any temp over power_down
    Uncontrollable --> [*]

I may try to turn this into an ASCII graphic that can go in a source code comment.

hawkw · 2026-01-22T18:15:46Z

It does seem like we could potentially bounce between Overheated and Turbo indefinitely. Is there a point where we should lie down/try not to cry/cry a lot/give up if it seems like we will never reach nominal?

Yeah, I've wondered about this. Since the way we ended up here was kind of a case of "timeouts, timeouts, always wrong, some too short and some too long", I was a little reluctant to add a second layer of timeouts to the state that was added specifically to work around the overly aggressive timeout.

Ideally, I think maybe we would have a timeout that tracks how long we spent over critical without reaching nominal that isn't reset when we transition from Overheat -> Turbo but that only counts time spent in Overheat. The timeout would only be reset if we return to normal control, but is paused while below critical thresholds. Implementing that is just a little bit more annoying because it requires tracking a sum of time spent in Overheat as well as the current start time, but we could do that.

I did also think about fancier things like trying to look at the rate of change, but that felt a little too complex. I think it's worth keeping the behavior in the "overheated" control regime simple, and I didn't wanna have to store a bunch of past measurements to track change over time for RAM reasons.

task/thermal/src/control.rs

labbott · 2026-01-22T18:31:35Z

Ideally, I think maybe we would have a timeout that tracks how long we spent over critical without reaching nominal that isn't reset when we transition from Overheat -> Turbo but that only counts time spent in Overheat. The timeout would only be reset if we return to normal control, but is paused while below critical thresholds. Implementing that is just a little bit more annoying because it requires tracking a sum of time spent in Overheat as well as the current start time, but we could do that.

Is there any way we could gather this data to have it available but not use it for control decisions? I guess technically if we're not over critical we're supposed to be "fine" but I'm trying to think of data to gather to make it easier to refine this loop. Maybe this is already logged in the counted ringbuf with the number of times we hit turbo?

hawkw · 2026-01-22T18:57:57Z

Ideally, I think maybe we would have a timeout that tracks how long we spent over critical without reaching nominal that isn't reset when we transition from Overheat -> Turbo but that only counts time spent in Overheat. The timeout would only be reset if we return to normal control, but is paused while below critical thresholds. Implementing that is just a little bit more annoying because it requires tracking a sum of time spent in Overheat as well as the current start time, but we could do that.

Is there any way we could gather this data to have it available but not use it for control decisions? I guess technically if we're not over critical we're supposed to be "fine" but I'm trying to think of data to gather to make it easier to refine this loop. Maybe this is already logged in the counted ringbuf with the number of times we hit turbo?

Ah, whoops, 2923d0e just makes the change to make it count against the timeout across transitions between Critical and Turbo; I can change that back to just log the data if you'd prefer.

hawkw · 2026-01-22T20:02:08Z

task/thermal/src/control.rs

+    /// This gives us an opportunity to recover from overheating by running the
+    /// fans aggressively without also deciding to give up and kill ourselves
+    /// while things are improving but not fast enough.
+    Turbo {


considered extremely unimportant, but: Y'know...I chose this name because I thought it was funny, but it occurs to me now that perhaps the existing colloquialism "fan party" (meaning: "when one or more compute sleds in the office goes MAX RPM") deserves to be enshrined in the code.

SorryForFanPartying

task/thermal/src/control.rs

labbott

I do like the refactor! I'm going to take an other pass after the coffee kicks in some more.

Trying to make it clearer that both `Critical` and `FanParty` are considered "overheated" but only one of them is over critical thresholds.

labbott · 2026-01-23T19:01:03Z

task/thermal/src/control.rs

+///
+/// - `Overheat`, in which at least one component is critical and the timeout
+///    is being tracked, and
+/// - `Turbo`, in which all temperatures are below critical, and we will run


Need to update the docs with FanParty here if that's what we're using

labbott · 2026-01-23T19:03:49Z

task/thermal/src/control.rs

+                    // If all temperatures have gone below critical, but are
+                    // still above nominal, stop the overheat timeout but
+                    // continue running at 100% PWM until things go below
+                    // nominal.
+                    let values = *values;
+                    self.record_leaving_critical(now_ms);
+                    self.state = ThermalControlState::FanParty { values };
                    ringbuf_entry!(Trace::AutoState(self.get_state()));

-                    ControlResult::Pwm(PWMDuty(pwm as u8))
-                } else if now_ms > *start_time + self.overheat_timeout_ms {
+                    ControlResult::Pwm(PWMDuty(
+                        self.pid_config.max_output as u8,
+                    ))


Worth it to pull this out to a separte function for consistency?

I went back and forth on it, since the other transition_to_whatever functions were factored out because they got called in multiple places, and this one wasn't. But, I agree that it's more consistent, so why not.

TURBO MODE ENGAGE!!!

acf68b1

hawkw changed the title ~~TURBO MODE ENGAGE!!!~~ thermal: TURBO MODE ENGAGE!!! Jan 21, 2026

hawkw marked this pull request as ready for review January 21, 2026 23:33

hawkw requested a review from labbott January 22, 2026 03:19

labbott reviewed Jan 22, 2026

View reviewed changes

hawkw commented Jan 22, 2026

View reviewed changes

task/thermal/src/control.rs Show resolved Hide resolved

hawkw added 2 commits January 22, 2026 10:32

refactor state transitions into functions

ab50bf3

don't reset the timeout when going to TURBO MODE!!!!

2923d0e

just track total overheat + critical durations

b3922fd

hawkw force-pushed the eliza/TURBO-MODE branch from 5d50cde to b3922fd Compare January 22, 2026 19:45

hawkw commented Jan 22, 2026

View reviewed changes

hawkw mentioned this pull request Jan 22, 2026

cosmo-b fan ramp not quick enough when baseline temperature is high #2361

Open

rm's diagram

b4d3d85

labbott reviewed Jan 23, 2026

View reviewed changes

task/thermal/src/control.rs Outdated Show resolved Hide resolved

labbott reviewed Jan 23, 2026

View reviewed changes

hawkw added 2 commits January 23, 2026 10:37

docs

b7c9b5a

naming tweaks

e820293

Trying to make it clearer that both `Critical` and `FanParty` are considered "overheated" but only one of them is over critical thresholds.

labbott reviewed Jan 23, 2026

View reviewed changes

hawkw added 5 commits January 23, 2026 13:24

docs fixy-uppy

62c9042

use transition_to_running when leaving boot too

6cf1948

make all state machine transitions into functions

d94df78

*values as a reflection of values

018ee7d

put the comment in a nicer place

7eed7e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thermal: TURBO MODE ENGAGE!!! #2362

thermal: TURBO MODE ENGAGE!!! #2362

Uh oh!

hawkw commented Jan 21, 2026 •

edited

Loading

Uh oh!

labbott left a comment

Uh oh!

hawkw commented Jan 22, 2026

Uh oh!

hawkw commented Jan 22, 2026

Uh oh!

Uh oh!

labbott commented Jan 22, 2026

Uh oh!

hawkw commented Jan 22, 2026

Uh oh!

hawkw Jan 22, 2026

Uh oh!

labbott Jan 23, 2026

Uh oh!

Uh oh!

labbott left a comment

Uh oh!

labbott Jan 23, 2026

Uh oh!

labbott Jan 23, 2026

Uh oh!

hawkw Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thermal: TURBO MODE ENGAGE!!! #2362

Are you sure you want to change the base?

thermal: TURBO MODE ENGAGE!!! #2362

Uh oh!

Conversation

hawkw commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

Background

The (Possible) Solution

Footnotes

Uh oh!

labbott left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw commented Jan 22, 2026

Uh oh!

hawkw commented Jan 22, 2026

Uh oh!

Uh oh!

labbott commented Jan 22, 2026

Uh oh!

hawkw commented Jan 22, 2026

Uh oh!

hawkw Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

labbott Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

labbott left a comment

Choose a reason for hiding this comment

Uh oh!

labbott Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

labbott Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkw commented Jan 21, 2026 •

edited

Loading