Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Jan 21, 2026

The Problem

We have recently been seeing a number of Cosmo-Bs being shut down by the thermal control loop under load, and we we would really prefer they didn't do this. This has generally occurred due to the machine suddenly starting an intensive CPU and NVMe workload (initializing a large number of raw zvols) while inlet air temperature is uncomfortable but within our specified range of 35º F (2º C) and 95º F (35º C): ambient temperatures in "the fridge" in Oxide EMY have been between 75º F and 90º F when thermal shutdowns have occurred. In all observed cases of these thermal shutdowns, we have not seen CPU Tctl reach the thresholds of 95 Magic AMD Thermal Sadness Units (at which point the CPU throttles itself) or 115 (at which point THERMTRIP_L is asserted). Instead, these systems have shut down due to a timeout in the thermal control loop, which limits the amount of time the system spends in the Overheated state to 60 seconds. See #2361 for details on the observed pathology.

Background

To elaborate a bit here, the Hubris thermal control loop operates in two primary control regimes: the "normal" regime (represented by ThermalControlState::Running, in which fan speeds are governed by PID control with temperature sensor readings as the process variable); and the "overheating" regime (represented by ThermalControlState::Overheated, in which we operate fans at 100% of their PWM duty cycle until we return to the normal control regime.

For each device whose temperature measurements influence the control loop (on Cosmo, the CPU Tctl, the M.2 and U.2 SSDs, and T6 NIC), we define three temperature thresholds: target_temperature, critical_temperature, and power_down_temperature. These impact the behavior of the control loop as follows:

  • If any device ever reports a temperature which is over its power_down_temperature, the thermal control loop powers the system off immediately,
  • If we are in the normal control regime and any device reports a temperature which is over its critical_temperature threshold, we transition to the "overheating" control regime,
  • Once we are in the "overheating" control regime, we remain there (running at 100% duty cycle) until every device's temperature measurement is less than or equal to it's target_temperature (i.e., it has returned to nominal).

There's one additional factor, which will become important shortly: when we enter the "overheating" control regime, we start a timeout for overheat_timeout_ms, which is currently always set to 60 seconds. If 60 seconds have elapsed without returning to nominal, the thermal control loop powers the system off. In effect, if the thermal loop feels that things have not gotten better fast enough, it decides to kill itself immediately.

The present behavior was introduced in PR #2317. Prior to that change, we would exit the "overheating" control regime as soon as all temperatures were at least one degree1 below that device's critical threshold, and return to normal control. The effect of this change is that we now remain in the "overheating" regime for longer. Perhaps a bit unintuitively, though, the intent (and, I believe, the effect!) of this change was to reduce the likelihood of thermal shutdowns: when we return to normal PID control after dropping 1 degree under critical, the fan speed will go down and the system will immediately start overheating again. In this rapid oscillation between control regimes, we didn't spend enough time at 100% PWM to cool down sufficiently, and something would hit its power_down_temperature threshold.

However, the other impact of #2317 is that the system spends a lot more time in the control regime in which it considers itself to be "overheating", and does not stop the "guess I should kill myself" timeout until temperatures have decreased by a much greater margin than was required prior. Matt2 explicitly calls this out in PR:

Note that this change means the system will stay in Overheated longer, because it has to get to the normal temperature (instead of critical - 1°). I didn't change overheat_timeout_ms from its current value of 60 seconds; we really should be able to get from critical → target temperature in 60 seconds.

That last sentence --- "we really should be able to get from critical → target temperature in 60 seconds" --- is demonstrably not the case in situations when the inlet air temperature is as hot as it's been in the fridge for the last couple days. Looking at data collected from an overheating Cosmo this morning, we see that once we transition to the "overheating" control regime, the CPU Tctl values are steadily decreasing towards the threshold value...just not fast enough to make it there in less than a minute. See this comment for details.

Okay, that was a lot. But now we should have enough context to discuss...

The (Possible) Solution

This change is based on the following observations:

  1. If anything goes over a critical threshold, we want to remain at the max PWM duty cycle until everything returns to nominal (this was what Wait for nominal temperature #2317 did, and we want to preserve that property)
  2. We want to shut the system down safely if running the fans at full blast fails to improve the thermal situation (this is what the 60-second overheat_timeout_ms is supposed to be for, and we don't want to get rid of that behavior)
  3. However, the rate at which we can reject heat decreases as inlet temperature increases (I think? I barely remember high-school physics class), so if it's warmer than a typical air-conditioned data center cold isle, it will take us longer to reach a nominal temperature and transition back to the normal control regime. This may take longer than 60 seconds, which is the problem.
  4. Just increasing the timeout feels gross. All timeouts are arbitrary. I don't really have a convincing argument for why 120 seconds is meaningfully different than 60 seconds except "it would have stopped that specific Cosmo from shutting down while Alan was running his raw zvol test this morning".
  5. Instead, what we really want is to distinguish between "something got too hot, but things are actively getting better and so we shouldn't kill ourselves" and "no matter how much we spin our fans, we can't seem to make the computer cool off and we should therefore probably just call it".
  6. BONUS OBSERVATION: I don't really know that much about control theory and just kinda randomly twiddling the PID gains feels scary.

Based on this, this branch introduces a new state to the control loop: TURBO MODE!!!!!.

TURBO MODE!!!!! is an extension of the "overheating" control regime in which we continue to run the fans at full blast, but we don't continue counting down for sixty seconds and then kill ourselves. When a temperature reading goes over the device's critical threshold, we still enter the Overheated state, as we did previously. If we remain in Overheated for 60 seconds, we shut down the system; again, as we did previously. And, if any temperature hits the device's power_down_temperature threshold, we still power down the system immediately, as we did previously. But, we add a new transition out of ThermalControlState::Overheated: if we see that all monitored sensors have gone below their critical thresholds, we transition from the overheated state to TURBO MODE!!!!!!

In TURBO MODE!!!, we continue to run the fans at 100% duty cycle, but no longer track the timeout. If all temperatures return to nominal, we exit TURBO MODE!!!!!!!!! and return to normal control. If anything goes back above its critical threshold, we transition from TURBO MODE!!!! back to the "overheating" state, restarting the timeout. This way, if things are below critical but not yet nominal, we keep running at full fan speed for as long as it takes to get back to nominal, rather than giving up if it takes slightly longer than 60 seconds.

Footnotes

  1. Or one Happy Fun AMD Tctl Badness Amount Unit.

  2. Who I'm not going to tag in this as he's on leave.

@hawkw hawkw changed the title TURBO MODE ENGAGE!!! thermal: TURBO MODE ENGAGE!!! Jan 21, 2026
@hawkw hawkw marked this pull request as ready for review January 21, 2026 23:33
@hawkw hawkw requested a review from labbott January 22, 2026 03:19
Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to re-review in a bit but I think this is a good approach. I feel like I want a state machine doc here showing The ThermalControlState transitions though. We can go from Running -> Overheated but never go Running -> Turbo. We go Overheated -> Running if we cooldown fast and go Overheated -> Turbo if we need more time. We can go Running -> Uncontrollable and Overheated -> Uncontrollable and Turbo -> Uncontrollable

It does seem like we could potentially bounce between Overheated and Turbo indefinitely. Is there a point where we should lie down/try not to cry/cry a lot/give up if it seems like we will never reach nominal?

@hawkw
Copy link
Member Author

hawkw commented Jan 22, 2026

@labbott:

I feel like I want a state machine doc here showing The ThermalControlState transitions though.

here's a Mermaid diagram that's kinda gross-looking:

stateDiagram-v2
    [*] --> Running
    Running --> Overheat: any temp over critical
    Overheat --> Running: all temps nominal
    Overheat --> Turbo: all temps under critical
    Turbo --> Overheat: any temp over critical
    Turbo --> Running: all temps nominal
    Running --> Uncontrollable: any temp over power_down
    Overheat --> Uncontrollable: any temp over power_down
    Overheat --> Uncontrollable: 60 seconds elapse
    Turbo --> Uncontrollable: any temp over power_down
    Uncontrollable --> [*]
Loading

I may try to turn this into an ASCII graphic that can go in a source code comment.

@hawkw
Copy link
Member Author

hawkw commented Jan 22, 2026

It does seem like we could potentially bounce between Overheated and Turbo indefinitely. Is there a point where we should lie down/try not to cry/cry a lot/give up if it seems like we will never reach nominal?

Yeah, I've wondered about this. Since the way we ended up here was kind of a case of "timeouts, timeouts, always wrong, some too short and some too long", I was a little reluctant to add a second layer of timeouts to the state that was added specifically to work around the overly aggressive timeout.

Ideally, I think maybe we would have a timeout that tracks how long we spent over critical without reaching nominal that isn't reset when we transition from Overheat -> Turbo but that only counts time spent in Overheat. The timeout would only be reset if we return to normal control, but is paused while below critical thresholds. Implementing that is just a little bit more annoying because it requires tracking a sum of time spent in Overheat as well as the current start time, but we could do that.

I did also think about fancier things like trying to look at the rate of change, but that felt a little too complex. I think it's worth keeping the behavior in the "overheated" control regime simple, and I didn't wanna have to store a bunch of past measurements to track change over time for RAM reasons.

@labbott
Copy link
Collaborator

labbott commented Jan 22, 2026

Ideally, I think maybe we would have a timeout that tracks how long we spent over critical without reaching nominal that isn't reset when we transition from Overheat -> Turbo but that only counts time spent in Overheat. The timeout would only be reset if we return to normal control, but is paused while below critical thresholds. Implementing that is just a little bit more annoying because it requires tracking a sum of time spent in Overheat as well as the current start time, but we could do that.

Is there any way we could gather this data to have it available but not use it for control decisions? I guess technically if we're not over critical we're supposed to be "fine" but I'm trying to think of data to gather to make it easier to refine this loop. Maybe this is already logged in the counted ringbuf with the number of times we hit turbo?

@hawkw
Copy link
Member Author

hawkw commented Jan 22, 2026

Ideally, I think maybe we would have a timeout that tracks how long we spent over critical without reaching nominal that isn't reset when we transition from Overheat -> Turbo but that only counts time spent in Overheat. The timeout would only be reset if we return to normal control, but is paused while below critical thresholds. Implementing that is just a little bit more annoying because it requires tracking a sum of time spent in Overheat as well as the current start time, but we could do that.

Is there any way we could gather this data to have it available but not use it for control decisions? I guess technically if we're not over critical we're supposed to be "fine" but I'm trying to think of data to gather to make it easier to refine this loop. Maybe this is already logged in the counted ringbuf with the number of times we hit turbo?

Ah, whoops, 2923d0e just makes the change to make it count against the timeout across transitions between Critical and Turbo; I can change that back to just log the data if you'd prefer.

/// This gives us an opportunity to recover from overheating by running the
/// fans aggressively without also deciding to give up and kill ourselves
/// while things are improving but not fast enough.
Turbo {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered extremely unimportant, but: Y'know...I chose this name because I thought it was funny, but it occurs to me now that perhaps the existing colloquialism "fan party" (meaning: "when one or more compute sleds in the office goes MAX RPM") deserves to be enshrined in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SorryForFanPartying

Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the refactor! I'm going to take an other pass after the coffee kicks in some more.

hawkw added 2 commits January 23, 2026 10:37
Trying to make it clearer that both `Critical` and `FanParty` are
considered "overheated" but only one of them is over critical
thresholds.
///
/// - `Overheat`, in which at least one component is critical and the timeout
/// is being tracked, and
/// - `Turbo`, in which all temperatures are below critical, and we will run
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the docs with FanParty here if that's what we're using

Comment on lines 1296 to 1307
// If all temperatures have gone below critical, but are
// still above nominal, stop the overheat timeout but
// continue running at 100% PWM until things go below
// nominal.
let values = *values;
self.record_leaving_critical(now_ms);
self.state = ThermalControlState::FanParty { values };
ringbuf_entry!(Trace::AutoState(self.get_state()));

ControlResult::Pwm(PWMDuty(pwm as u8))
} else if now_ms > *start_time + self.overheat_timeout_ms {
ControlResult::Pwm(PWMDuty(
self.pid_config.max_output as u8,
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth it to pull this out to a separte function for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on it, since the other transition_to_whatever functions were factored out because they got called in multiple places, and this one wasn't. But, I agree that it's more consistent, so why not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants