Skip to content

Latest commit

 

History

History
403 lines (329 loc) · 16.5 KB

File metadata and controls

403 lines (329 loc) · 16.5 KB

UPS Power Manager (ESXi & iDRAC)

🌍 English | 日本語 | 简体中文

This is a UPS power management solution built on Ubuntu Server and the official CyberPower PowerPanel (pwrstat). It is specifically designed to monitor CyberPower OLS1000E series UPS and gracefully shut down ESXi hosts and virtual machines during a power outage.

Features

  • Hardware Decoupling: The monitoring device (ThinkPad T450 laptop) is independently powered by its builtin battery, decoupled from the UPS, meaning the monitoring logic can continue running even if the UPS runs out of power.
  • Enterprise-grade Native Driver: Uses the official CyberPower pwrstatd daemon to handle alerts, eliminating instability caused by poor protocol adaptation in open-source polling schemes.
  • Power Outage Reconfirmation Mechanism: Enters a configurable delay wait (default 60s) after a power outage, during which pwrstat is called multiple times to repeatedly confirm. If utility power is restored, the shutdown is automatically aborted.
  • Full Lifecycle Logging: Every power outage is recorded with [LIFECYCLE:*] structured events, UPS status snapshots, and outage duration, synchronized to a file and systemd journal.
  • Dual-insurance Hardware Intervention:
    1. Primary method: SSH into ESXi to trigger Guest OS graceful shutdown (ACPI shutdown), followed by shutting down the ESXi system.
    2. Ultimate measure: If SSH fails/does not respond, forcefully cut power via IPMI (iDRAC), ensuring server hardware power is cut off to save data.
  • Power Restoration Detection: The ups-power-monitor service continuously polls utility power status. Upon restoration, it automatically records outage duration and writes to logs.
  • Auto Power-on: Sends a power-on command via IPMI (iDRAC) automatically upon power restoration. Wait for ESXi to boot and automatically exit maintenance mode for an unattended fully automatic recovery.

System Architecture

graph TB
    subgraph S_AC["Utility Power"]
        AC["🔌 Utility AC 220V"]
    end

    subgraph S_UPS["CyberPower OLS1000E"]
        UPS_IN["AC Input"]
        UPS_BAT["🔋 UPS Battery"]
        UPS_OUT["AC Output"]
        UPS_USB["USB Communication"]
    end

    subgraph S_T450["ThinkPad T450 - Monitor"]
        T450_BAT["🔋 Laptop Battery"]
        T450_PWR["⚡ T450 Power"]
        PWRSTATD["pwrstatd Daemon"]
        LOGGER["power-event-logger.sh"]
        SHUTDOWN["shutdown.sh"]
        MONITOR["power-restore-monitor.sh"]
        HEALTH["healthcheck.sh"]
    end

    subgraph S_Dell["Dell R720XD - Protected Node"]
        DELL_PWR["⚡ Dell Power"]
        ESXI["VMware ESXi"]
        VMs["Virtual Machines"]
        IDRAC["iDRAC 8 Express"]
    end

    AC -->|Power Supply| UPS_IN
    UPS_IN --> UPS_OUT
    UPS_BAT -.->|Switch on Outage| UPS_OUT
    UPS_OUT -->|Power Supply| T450_PWR
    UPS_OUT -->|Power Supply| DELL_PWR
    UPS_USB -->|USB Status Comms| PWRSTATD
    T450_BAT -.->|Independent Power| T450_PWR

    PWRSTATD -->|pwrfail Event| LOGGER
    LOGGER -->|Trigger| SHUTDOWN
    MONITOR -->|Poll for Restore| LOGGER

    SHUTDOWN -->|SSH Graceful Shutdown| ESXI
    SHUTDOWN -->|IPMI Hard Power Off| IDRAC
    ESXI --> VMs

    LOGGER -->|Power On via IPMI| IDRAC

    style AC fill:#f9c74f,stroke:#f48c06,color:#000
    style UPS_IN fill:#90be6d,stroke:#43aa8b,color:#000
    style UPS_OUT fill:#90be6d,stroke:#43aa8b,color:#000
    style T450_PWR fill:#577590,stroke:#277da1,color:#fff
    style DELL_PWR fill:#f94144,stroke:#e63946,color:#fff
    style IDRAC fill:#f94144,stroke:#e63946,color:#fff
Loading

Shutdown Flowchart

flowchart TD
    START(["⚡ Power Outage"])
    START --> PWRSTATD["pwrstatd Detects Outage"]
    PWRSTATD -->|Delay 60s| LOGGER["power-event-logger.sh"]
    LOGGER --> SESSION["Generate Session ID & UPS Snapshot"]
    SESSION --> SD["shutdown.sh Starts"]

    SD --> DELAY{"Outage Reconfirmation Wait Ns"}
    DELAY -->|Power Restored| CANCEL(["✅ Abort Shutdown"])
    DELAY -->|Still Outage| SSH_CHECK{"ESXi SSH Avail?"}

    SSH_CHECK -->|Reachable| GET_VMS["Get Running VM List"]
    SSH_CHECK -->|Unreachable 120s Timeout| IPMI_SOFT

    GET_VMS --> SHUTDOWN_VMS["Shutdown VMs Iteratively"]
    SHUTDOWN_VMS --> WAIT_VMS["Wait 120s for VMs to stop"]
    WAIT_VMS --> MAINT["Enter Maintenance Mode"]
    MAINT --> SSH_ESXI["SSH Shutdown ESXi"]
    SSH_ESXI --> WAIT60["Wait 60s"]

    style MAINT fill:#f9c74f,stroke:#f48c06,color:#000
    WAIT60 --> CHECK_OFF{"IPMI Check Off?"}

    CHECK_OFF -->|Off| DONE(["✅ Shutdown Complete"])
    CHECK_OFF -->|Still Running| IPMI_SOFT

    IPMI_SOFT["IPMI Soft Power Off"]
    IPMI_SOFT --> WAIT_SOFT{"Check after Timeout"}
    WAIT_SOFT -->|Off| DONE
    WAIT_SOFT -->|Still Running| IPMI_HARD

    IPMI_HARD["⛔ IPMI Hard Power Off"]
    IPMI_HARD --> VERIFY["Verify Shutdown Status"]
    VERIFY --> DONE

    style START fill:#e63946,stroke:#d62828,color:#fff
    style CANCEL fill:#2a9d8f,stroke:#264653,color:#fff
    style DONE fill:#2a9d8f,stroke:#264653,color:#fff
    style IPMI_HARD fill:#e76f51,stroke:#e63946,color:#fff
Loading

Full Lifecycle Sequence Diagram

sequenceDiagram
    participant AC as Utility Power
    participant UPS as UPS
    participant PD as pwrstatd
    participant LOG as event-logger
    participant SD as shutdown.sh
    participant ESXI as ESXi
    participant VM as VM
    participant IDRAC as iDRAC
    participant MON as restore-monitor

    Note over AC,MON: Phase 1 Outage Detection
    AC-xUPS: Utility Power Loss
    UPS->>UPS: Switch to Battery
    UPS->>PD: USB: Power Failure
    PD->>PD: Wait 60s
    PD->>LOG: pwrfail Event
    LOG->>LOG: Generate Session ID
    LOG->>LOG: Record UPS Snapshot
    LOG->>SD: Invoke shutdown.sh

    Note over AC,MON: Phase 2 Delay Reconfirmation
    loop Every 5s for Ns
        SD->>PD: pwrstat -status
        alt Power Restored
            SD-->>SD: Abort Shutdown
        else Still Outage
            SD->>SD: Continue Waiting
        end
    end

    Note over AC,MON: Phase 3 Graceful Shutdown
    SD->>ESXI: SSH Connectivity Check
    SD->>ESXI: SSH get VM list
    ESXI-->>SD: Return VM list

    loop Every VM
        SD->>ESXI: SSH Shutdown VM
        ESXI->>VM: ACPI Shutdown
        VM-->>ESXI: Shutdown Complete
    end

    SD->>SD: Wait for VMs to close (120s)
    SD->>ESXI: Enter Maintenance Mode
    ESXI-->>SD: Maintenance Mode Enabled
    SD->>ESXI: SSH Shutdown ESXi
    SD->>SD: Wait 60s

    SD->>IDRAC: IPMI power status
    alt powered off
        IDRAC-->>SD: off
    else Still running
        SD->>IDRAC: IPMI power soft
        SD->>SD: Wait timeout
        alt Still not off
            SD->>IDRAC: IPMI power off (Hard power cut)
        end
    end

    Note over AC,MON: Phase 4 Power Restoration
    AC-->>UPS: Power Restored
    UPS->>UPS: Switch to AC Power

    loop 10s poll
        MON->>PD: pwrstat -status
        PD-->>MON: Normal
    end

    MON->>LOG: active event
    LOG->>LOG: Record Outage Duration
    LOG->>IDRAC: IPMI power on
    IDRAC->>ESXI: Server Boots

    Note over LOG,ESXI: Phase 5 Exit Maintenance Mode
    loop Wait up to 5 min for ESXi SSH
        LOG->>ESXI: SSH Connectivity Check
    end
    LOG->>ESXI: Check Maintenance Mode
    ESXI-->>LOG: Enabled
    LOG->>ESXI: Exit Maintenance Mode
    ESXI-->>LOG: Disabled
    ESXI->>VM: VM Auto-Start
    LOG->>LOG: MONITORING_RESUMED
Loading

Hardware and Software Requirements

  • Monitoring Node: ThinkPad T450 laptop, running Ubuntu 22.04, BIOS configured to power on upon AC, system suspend/sleep disabled, running 24x7.
  • Protected Node: Dell PowerEdge R720XD (equipped with iDRAC controller, IPMI over LAN enabled), running VMware ESXi.
  • UPS: CyberPower OLS1000E, connected to T450 via USB.
  • Environment Dependencies:
    • CyberPower PowerPanel for Linux (PPL): Must be installed in advance using dpkg (sudo dpkg -i CyberPower_PPL_Linux...deb) and confirmed running.
    • ipmitool, sshpass

Project Structure

ups-power-manager/
├── install.sh                          # One-click install script
├── config/
│   ├── config.env                      # Main configuration (ESXi / iDRAC / parameters)
│   └── logrotate.conf                  # Log rotation config (weekly, keeps 12 weeks)
└── scripts/
    ├── shutdown.sh                     # Core shutdown script (triggered on outage)
    ├── healthcheck.sh                  # Health check script (runs hourly)
    ├── power-event-logger.sh           # Outage/Restoration event logger
    ├── power-restore-monitor.sh        # AC restoration polling monitor (resident service)
    └── laptop-setup.sh                 # Laptop power management setup

Quick Deployment

  1. Clone this repository to your Ubuntu monitoring node:

    git clone https://github.com/erocat/ups-power-manager.git
    cd ups-power-manager
  2. Ensure the UPS USB cable is connected, and test if pwrstat can correctly read the UPS status:

    sudo pwrstat -status
  3. Modify config/config.env, filling in your ESXi address, iDRAC address, and credentials; adjust SHUTDOWN_DELAY as needed (seconds to wait after outage).

  4. Execute the one-click installer with root privileges (automates deployment, sets up pwrstat events, registers systemd services):

    sudo ./install.sh
  5. (First time deployment) Configure the monitoring node to a never-sleep mode and install the logrotate & restore monitor service:

    sudo /opt/ups-power-manager/laptop-setup.sh
    sudo cp config/logrotate.conf /etc/logrotate.d/ups-power-manager
    # power-restore-monitor is automatically registered by install.sh, or manually:
    sudo systemctl enable --now ups-power-monitor.service

    ⚠️ You must manually configure After Power Loss → Power On in BIOS to allow automatic boot up on AC restoration.

Maintenance and Testing

Outage Simulation Testing (From Safe to Real)

Level 1: Dry-Run (Zero risk, recommended first step)

sudo /opt/ups-power-manager/shutdown.sh --dry-run

Won't send any shutdown commands, verifies SSH/IPMI connectivity and script logic.

Level 2: pwrstat Software Simulation (Low risk)

sudo pwrstat -test

Triggers a simulated outage event, but UPS still supplies power. Script enters the wait delay. Within 30s, pwrstat automatically returns to Normal, aborting the shutdown, no actual shutdown is performed.

Level 3: True Power Cut Test (⚠️ Ensure server data is saved before rehearsal)
Directly unplug the AC power cord from the UPS to trigger the full real power outage protection sequence.

Querying Logs

Log Path/Command Description
Shutdown Sequence Log tail -f /var/log/ups-shutdown.log Complete shutdown steps with SESSION_ID
Power Event Log tail -f /var/log/ups-power-events.log Outage/Restore snapshots, durations
systemd journal journalctl -t ups-power-manager -f All events, supports time filtering

Query full flow by Outage Session:

# 1. First find Session ID (Format: OUTAGE-YYYYMMDD-HHMMSS)
grep "LIFECYCLE:POWER_FAILURE" /var/log/ups-power-events.log

# 2. Filter complete flow by Session ID
grep "OUTAGE-20260305-065407" /var/log/ups-shutdown.log

Lifecycle Event Tags:

LIFECYCLE:POWER_FAILURE_DETECTED   Outage Detected
LIFECYCLE:SHUTDOWN_START           Shutdown Process Started
LIFECYCLE:DELAY_CHECK_START/DONE   Delay Confirmation Phase
LIFECYCLE:VM_SHUTDOWN_START/DONE   VM Shutdown Phase
LIFECYCLE:ESXI_SSH_SHUTDOWN_START  ESXi SSH Shutdown (includes maint mode)
LIFECYCLE:IPMI_SOFT_START/FAILED   IPMI Soft (Degraded path)
LIFECYCLE:IPMI_HARD_START          IPMI Hard (Ultimate Measure)
LIFECYCLE:SHUTDOWN_END             Process Finished (incl. duration)
LIFECYCLE:POWER_RESTORED           Utility Power Restored
LIFECYCLE:OUTAGE_DURATION          Current Outage Duration Time
LIFECYCLE:AUTO_POWER_ON            IPMI Auto Power On
LIFECYCLE:EXIT_MAINTENANCE         Exit ESXi Maint Mode (wait SSH→check→exit)
LIFECYCLE:MONITORING_RESUMED       Resume Normal Monitoring

Take UPS Snapshot Manually:

sudo /opt/ups-power-manager/power-event-logger.sh snapshot

Common Operation Commands

pwrstat -status                              # Current UPS Status
pwrstat -config                              # Event bindings config
systemctl status pwrstatd                    # pwrstatd Daemon Status
systemctl list-timers ups-healthcheck.timer  # Healthcheck Timer Status
systemctl status ups-power-monitor           # Power Restoration Monitor Status
/opt/ups-power-manager/healthcheck.sh        # Trigger health check manually

Note

  • pwrstatd is the trigger core of this solution, ensure it automatically runs on boot: systemctl status pwrstatd.
  • The T450 laptop will not power off due to an outage (Enable shutdown system: Off), it relies on its internal battery to run and executes remote shutdown operations.
  • The T450 must be configured to turn on via BIOS (After Power Loss → Power On), otherwise it will not auto-start after battery drains.
  • To adjust trigger configs, run: pwrstat -pwrfail -cmd <Script Path> -duration <Seconds> -shutdown <on|off>.
  • If you need to reconfigure power settings on the laptop, rerun sudo /opt/ups-power-manager/laptop-setup.sh.

Actual Deployment Log

Initial Deployment: 2026-03-05 14:38 (UTC+8)
Log Enhancement Upgrade: 2026-03-05 15:25 (UTC+8)
Target Server: 192.168.1.117 (Hostname: ups, Ubuntu 6.8.0-100-generic)

Deployed Files

Path Description
/opt/ups-power-manager/shutdown.sh Core shutdown script (Contains SESSION_ID + LIFECYCLE tags)
/opt/ups-power-manager/healthcheck.sh Health check script
/opt/ups-power-manager/power-event-logger.sh Outage/Restore events logic
/opt/ups-power-manager/power-restore-monitor.sh Resident service for AC restoring
/opt/ups-power-manager/laptop-setup.sh Laptop sleep configuration script
/opt/ups-power-manager/on-pwrfail.sh pwrstat pwrfail wrapper script (Auto-generated)
/opt/ups-power-manager/on-active.sh pwrstat active wrapper script (Auto-generated)
/etc/ups-power-manager/config.env Config File (Permissions: 600)
/etc/logrotate.d/ups-power-manager Log rotate configuration

Registered Systemd Services

Service Unit Type Status Description
pwrstatd.service service enabled / active CyberPower Daemon
ups-healthcheck.service service (oneshot) enabled Healthcheck runner
ups-healthcheck.timer timer enabled / active Runs 2 mins after boot, then hourly
ups-power-monitor.service service enabled / active AC status polling, runs persistently

pwrstat Power Failure Event Binding (Final Config)

Event Delay Script Script Run Window Local Shutdown
pwrfail 60s on-pwrfail.sh 300s Off (T450 stays on)
lowbatt 0s shutdown.sh 60s Off (T450 stays on)

Verification Validation

  • pwrstatd runs normally, UPS communication validated (OLS1000E online)
  • shutdown.sh --dry-run passed
  • ups-healthcheck.timer enabled / active (Starts on boot)
  • ups-power-monitor.service enabled / active (Starts on boot)
  • laptop-setup.sh successfully executed (Suspend/sleep disabled, auto-start)
  • ✅ T450 does not shut down on outage (Enable shutdown system: Off)
  • ✅ True outage event correctly logged (2026-03-05 06:54 UTC, Battery 95%, Approx. 30 runtime)
  • ✅ logrotate configuration applied (Weekly rotation, keeps 12 weeks)
  • ⚠️ pwrstat -active event is not supported out-of-the-box by PPL limits (Currently handled by polling daemon ups-power-monitor.service)
  • ⚠️ BIOS After Power Loss → Power On needs manual configuration in T450 BIOS.

Fast Diagnostics

# Execute on 192.168.1.117
pwrstat -status                                    # Current Status
pwrstat -config                                    # Event binding config
systemctl status ups-power-monitor                 # Utility Power Monitoring service
systemctl list-timers ups-healthcheck.timer        # Active Timers
journalctl -t ups-power-manager --since "1 hour ago"  # Past 1 hr events
tail -f /var/log/ups-shutdown.log                  # Realtime shutdown log
tail -f /var/log/ups-power-events.log              # Realtime outage log
/opt/ups-power-manager/shutdown.sh --dry-run       # Dry-run simulate logic