Skip to content

Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573

Open
adityagarg0911 wants to merge 10 commits intoAzure:developfrom
adityagarg0911:gargaditya/kernel_soft_lockup_detection
Open

Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573
adityagarg0911 wants to merge 10 commits intoAzure:developfrom
adityagarg0911:gargaditya/kernel_soft_lockup_detection

Conversation

@adityagarg0911
Copy link

@adityagarg0911 adityagarg0911 commented Mar 5, 2026

Description

Add kernel soft lockup monitoring to the Azure Linux Agent. This new feature periodically parses dmesg output to detect CPU soft lockup events (BUG: soft lockup - CPU#N stuck for Xs!), aggregates them by CPU, and reports summarized telemetry to Azure. This helps detect and diagnose VM health issues caused by CPUs being stuck in kernel code

Changes:

  • New module kernel_event_monitor.py — MonitorKernelSoftLockup periodic operation that:
    • Parses dmesg for soft lockup events using regex
    • Aggregates events by CPU ID (count, max stuck time, last kernel timestamp)
    • Reports via telemetry (WALAEventOperation.KernelSoftLockup)
    • Persists watermark to disk to avoid duplicate reporting across agent restarts
    • Detects reboots via boot ID to reset watermark
  • monitor.py — Conditionally adds MonitorKernelSoftLockup to the monitor thread based on config
  • conf.py — New config options: Monitor.KernelSoftLockup (enable/disable) and Monitor.KernelSoftLockupPeriod (check interval, minimum 300s)
  • waagent.conf — Added default config entries (Monitor.KernelSoftLockup=y, Monitor.KernelSoftLockupPeriod=21600)
  • event.py — New WALAEventOperation.KernelSoftLockup operation type
  • test_kernel_event_monitor.py — 16 unit tests covering regex, parsing, aggregation, reporting, state persistence, dmesg error handling, and end-to-end operation
  • test_monitor.py — Updated to verify MonitorKernelSoftLockup is included/excluded based on config flag

Issue #

PR information

  • Ensure development PR is based on the develop branch.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines


Distro maintenance information, if applicable

  • This is a contribution from a distro maintainer
  • The changes in this PR have been taken as a downstream patch (Note: it is not recommended to patch the agent without upstream review and approval)

@adityagarg0911 adityagarg0911 marked this pull request as ready for review March 5, 2026 15:52

if not found_timestamp:
logger.periodic_warn(
logger.EVERY_HOUR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you run soft lockup detection every 6 hours, I don’t think this periodic logging timer is needed. We try to avoid the logging timer as well in the past we have seen issue with this logic. When you needed, probably think about custom logic to avoid frequent logging.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I have kept it this way in case we decrease the time to less than an hour in future.
But this makes sense, I'll change it to logger.warn as it is greater than hour. Thanks

log_event=False
)
except Exception as e:
logger.periodic_warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

try:
return run_command(['dmesg'], track_process=False, timeout=self._DMESG_TIMEOUT)
except Exception as e:
logger.periodic_warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

json.dump(state, f)
except Exception as e:
logger.periodic_warn(
logger.EVERY_HOUR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants