-
Notifications
You must be signed in to change notification settings - Fork 218
thermal: TURBO MODE ENGAGE!!! #2362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
labbott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to re-review in a bit but I think this is a good approach. I feel like I want a state machine doc here showing The ThermalControlState transitions though. We can go from Running -> Overheated but never go Running -> Turbo. We go Overheated -> Running if we cooldown fast and go Overheated -> Turbo if we need more time. We can go Running -> Uncontrollable and Overheated -> Uncontrollable and Turbo -> Uncontrollable
It does seem like we could potentially bounce between Overheated and Turbo indefinitely. Is there a point where we should lie down/try not to cry/cry a lot/give up if it seems like we will never reach nominal?
here's a Mermaid diagram that's kinda gross-looking: stateDiagram-v2
[*] --> Running
Running --> Overheat: any temp over critical
Overheat --> Running: all temps nominal
Overheat --> Turbo: all temps under critical
Turbo --> Overheat: any temp over critical
Turbo --> Running: all temps nominal
Running --> Uncontrollable: any temp over power_down
Overheat --> Uncontrollable: any temp over power_down
Overheat --> Uncontrollable: 60 seconds elapse
Turbo --> Uncontrollable: any temp over power_down
Uncontrollable --> [*]
I may try to turn this into an ASCII graphic that can go in a source code comment. |
Yeah, I've wondered about this. Since the way we ended up here was kind of a case of "timeouts, timeouts, always wrong, some too short and some too long", I was a little reluctant to add a second layer of timeouts to the state that was added specifically to work around the overly aggressive timeout. Ideally, I think maybe we would have a timeout that tracks how long we spent over critical without reaching nominal that isn't reset when we transition from I did also think about fancier things like trying to look at the rate of change, but that felt a little too complex. I think it's worth keeping the behavior in the "overheated" control regime simple, and I didn't wanna have to store a bunch of past measurements to track change over time for RAM reasons. |
Is there any way we could gather this data to have it available but not use it for control decisions? I guess technically if we're not over critical we're supposed to be "fine" but I'm trying to think of data to gather to make it easier to refine this loop. Maybe this is already logged in the counted ringbuf with the number of times we hit turbo? |
Ah, whoops, 2923d0e just makes the change to make it count against the timeout across transitions between |
5d50cde to
b3922fd
Compare
task/thermal/src/control.rs
Outdated
| /// This gives us an opportunity to recover from overheating by running the | ||
| /// fans aggressively without also deciding to give up and kill ourselves | ||
| /// while things are improving but not fast enough. | ||
| Turbo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considered extremely unimportant, but: Y'know...I chose this name because I thought it was funny, but it occurs to me now that perhaps the existing colloquialism "fan party" (meaning: "when one or more compute sleds in the office goes MAX RPM") deserves to be enshrined in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SorryForFanPartying
labbott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the refactor! I'm going to take an other pass after the coffee kicks in some more.
Trying to make it clearer that both `Critical` and `FanParty` are considered "overheated" but only one of them is over critical thresholds.
task/thermal/src/control.rs
Outdated
| /// | ||
| /// - `Overheat`, in which at least one component is critical and the timeout | ||
| /// is being tracked, and | ||
| /// - `Turbo`, in which all temperatures are below critical, and we will run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update the docs with FanParty here if that's what we're using
task/thermal/src/control.rs
Outdated
| // If all temperatures have gone below critical, but are | ||
| // still above nominal, stop the overheat timeout but | ||
| // continue running at 100% PWM until things go below | ||
| // nominal. | ||
| let values = *values; | ||
| self.record_leaving_critical(now_ms); | ||
| self.state = ThermalControlState::FanParty { values }; | ||
| ringbuf_entry!(Trace::AutoState(self.get_state())); | ||
|
|
||
| ControlResult::Pwm(PWMDuty(pwm as u8)) | ||
| } else if now_ms > *start_time + self.overheat_timeout_ms { | ||
| ControlResult::Pwm(PWMDuty( | ||
| self.pid_config.max_output as u8, | ||
| )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth it to pull this out to a separte function for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on it, since the other transition_to_whatever functions were factored out because they got called in multiple places, and this one wasn't. But, I agree that it's more consistent, so why not.
The Problem
We have recently been seeing a number of Cosmo-Bs being shut down by the thermal control loop under load, and we we would really prefer they didn't do this. This has generally occurred due to the machine suddenly starting an intensive CPU and NVMe workload (initializing a large number of raw zvols) while inlet air temperature is uncomfortable but within our specified range of 35º F (2º C) and 95º F (35º C): ambient temperatures in "the fridge" in Oxide EMY have been between 75º F and 90º F when thermal shutdowns have occurred. In all observed cases of these thermal shutdowns, we have not seen CPU Tctl reach the thresholds of 95 Magic AMD Thermal Sadness Units (at which point the CPU throttles itself) or 115 (at which point
THERMTRIP_Lis asserted). Instead, these systems have shut down due to a timeout in the thermal control loop, which limits the amount of time the system spends in theOverheatedstate to 60 seconds. See #2361 for details on the observed pathology.Background
To elaborate a bit here, the Hubris thermal control loop operates in two primary control regimes: the "normal" regime (represented by
ThermalControlState::Running, in which fan speeds are governed by PID control with temperature sensor readings as the process variable); and the "overheating" regime (represented byThermalControlState::Overheated, in which we operate fans at 100% of their PWM duty cycle until we return to the normal control regime.For each device whose temperature measurements influence the control loop (on Cosmo, the CPU Tctl, the M.2 and U.2 SSDs, and T6 NIC), we define three temperature thresholds:
target_temperature,critical_temperature, andpower_down_temperature. These impact the behavior of the control loop as follows:power_down_temperature, the thermal control loop powers the system off immediately,critical_temperaturethreshold, we transition to the "overheating" control regime,target_temperature(i.e., it has returned to nominal).There's one additional factor, which will become important shortly: when we enter the "overheating" control regime, we start a timeout for
overheat_timeout_ms, which is currently always set to 60 seconds. If 60 seconds have elapsed without returning to nominal, the thermal control loop powers the system off. In effect, if the thermal loop feels that things have not gotten better fast enough, it decides to kill itself immediately.The present behavior was introduced in PR #2317. Prior to that change, we would exit the "overheating" control regime as soon as all temperatures were at least one degree1 below that device's critical threshold, and return to normal control. The effect of this change is that we now remain in the "overheating" regime for longer. Perhaps a bit unintuitively, though, the intent (and, I believe, the effect!) of this change was to reduce the likelihood of thermal shutdowns: when we return to normal PID control after dropping 1 degree under critical, the fan speed will go down and the system will immediately start overheating again. In this rapid oscillation between control regimes, we didn't spend enough time at 100% PWM to cool down sufficiently, and something would hit its
power_down_temperaturethreshold.However, the other impact of #2317 is that the system spends a lot more time in the control regime in which it considers itself to be "overheating", and does not stop the "guess I should kill myself" timeout until temperatures have decreased by a much greater margin than was required prior. Matt2 explicitly calls this out in PR:
That last sentence --- "we really should be able to get from critical → target temperature in 60 seconds" --- is demonstrably not the case in situations when the inlet air temperature is as hot as it's been in the fridge for the last couple days. Looking at data collected from an overheating Cosmo this morning, we see that once we transition to the "overheating" control regime, the CPU Tctl values are steadily decreasing towards the threshold value...just not fast enough to make it there in less than a minute. See this comment for details.
Okay, that was a lot. But now we should have enough context to discuss...
The (Possible) Solution
This change is based on the following observations:
overheat_timeout_msis supposed to be for, and we don't want to get rid of that behavior)Based on this, this branch introduces a new state to the control loop: TURBO MODE!!!!!.
TURBO MODE!!!!! is an extension of the "overheating" control regime in which we continue to run the fans at full blast, but we don't continue counting down for sixty seconds and then kill ourselves. When a temperature reading goes over the device's critical threshold, we still enter the
Overheatedstate, as we did previously. If we remain inOverheatedfor 60 seconds, we shut down the system; again, as we did previously. And, if any temperature hits the device'spower_down_temperaturethreshold, we still power down the system immediately, as we did previously. But, we add a new transition out ofThermalControlState::Overheated: if we see that all monitored sensors have gone below their critical thresholds, we transition from the overheated state to TURBO MODE!!!!!!In TURBO MODE!!!, we continue to run the fans at 100% duty cycle, but no longer track the timeout. If all temperatures return to nominal, we exit TURBO MODE!!!!!!!!! and return to normal control. If anything goes back above its critical threshold, we transition from TURBO MODE!!!! back to the "overheating" state, restarting the timeout. This way, if things are below critical but not yet nominal, we keep running at full fan speed for as long as it takes to get back to nominal, rather than giving up if it takes slightly longer than 60 seconds.
Footnotes
Or one Happy Fun AMD Tctl Badness Amount Unit. ↩
Who I'm not going to tag in this as he's on leave. ↩