Skip to content

sidecars powered off, possible thermal issue #2328

@Nieuwejaar

Description

@Nieuwejaar

I'm not sure what's happening, or if this is even the right place to file the issue, but I don't have a better idea.

On Tuesday, I found the sidecar on dublin's sled 16 failed apparently with a PCI error. I filed this as oxidecomputer/dendrite#173. I captured the scrimlet and dendrite state, but didn't look at the sidecar itself.

On Wednesday, I found both sidecars on madrid powered off. Will looked at one of the sidecar SPs and found a possible thermal issue:

humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
   45223 ControlPwm
      16 AutoState(Boot)
       4 AutoState(Running)
       1 AutoState(Overheated)
       1 AutoState(Uncontrollable)
      13 AddedDynamicInput
       8 FanAdded
       6 RemovedDynamicInput
       2 PowerModeChanged
       2 FanControllerInitialized
       1 Start
       1 ThermalMode(Auto)
       1 CriticalDueTo
       1 PowerDownAt
       1 SetFanWatchdogOk

Today (Friday) I tried to use london and again found both sidecars powered off:

03:09:55 castle:/data/local/env/dublin/nils$ echo $PILOT_RACK
london
03:10:05 castle:/data/local/env/dublin/nils$ pilot sp st BRM44220013
BRM44220013        off (A2)
03:16:25 castle:/data/local/env/dublin/nils$ pilot sp st BRM44220004
BRM44220004        off (A2)

I haven't found the humility archive yet, so haven't looked any deeper.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions