automatic troubleshooting endpoint context docs#728
Conversation
ferullo
left a comment
There was a problem hiding this comment.
There's a lot to review here so I'm going to submit a review for each file. The first one is done. It's reasonable/expected to sit on my recommendations until an Endpoint developer reviews too (so they can see my comments and contradict me, etc).
|
|
||
| ## Summary | ||
|
|
||
| Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions. |
There was a problem hiding this comment.
Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures.
Does an LLM need to be told that?
|
|
||
| Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions. | ||
|
|
||
| Memory dump analysis with WinDbg (`!analyze -v`) is essential for root-cause determination. The bugcheck code alone is not sufficient — the faulting call stack identifies which code path triggered the crash. |
There was a problem hiding this comment.
I don't expect customers to do this. Perhaps text that describes how to collect a memory dump and a note that says to share it with us?
|
|
||
| ### ODX-enabled volume crash (8.19.8, 9.1.8, 9.2.2) | ||
|
|
||
| A regression introduced in versions 8.19.8, 9.1.8, and 9.2.2 causes BSODs on systems with ODX (Offloaded Data Transfer) enabled volumes, particularly affecting Hyper-V clusters and Windows Server 2016 Datacenter. The crash occurs in the file system filter driver's post-FsControl handler (`bkPostFsControl`) when processing offloaded write completions. The faulting call stack typically shows `elastic_endpoint_driver!bkPostFsControl` followed by `FLTMGR!FltGetStreamContext`. |
There was a problem hiding this comment.
he crash occurs in the file system filter driver's post-FsControl handler (
bkPostFsControl) when processing offloaded write completions. The faulting call stack typically showselastic_endpoint_driver!bkPostFsControlfollowed byFLTMGR!FltGetStreamContext.
I doubt this is useful info for a Kibana user.
|
|
||
| A regression in the network driver introduced in Elastic Defend versions 8.17.8, 8.18.3, and 9.0.3 can cause kernel pool corruption on systems with a large number of long-lived network connections that remain inactive for 30+ minutes. The corruption manifests as BSODs with various bugcheck codes including `IRQL_NOT_LESS_OR_EQUAL`, `SYSTEM_SERVICE_EXCEPTION`, `KERNEL_MODE_HEAP_CORRUPTION`, or `PAGE_FAULT_IN_NONPAGED_AREA`. | ||
|
|
||
| This is the most frequently reported BSOD pattern and affects Windows Server environments with persistent connections (e.g. database servers, backup servers running Veeam with PostgreSQL). The kernel pool may already be corrupted when the driver attempts a routine memory allocation, causing the crash to appear in unrelated code paths. |
There was a problem hiding this comment.
I doubt this paragraph is useful info for a Kibana user.
|
|
||
| **Fixed version**: 9.2.4. | ||
|
|
||
| **Mitigation**: Downgrade to the prior agent version (e.g. 9.2.1) until the upgrade to 9.2.4+ can be performed. |
There was a problem hiding this comment.
Agent doesn't support downgrade (or am I mistaken). This should say "Upgrade to a version with the fix" I think.
|
|
||
| Other security products running kernel-mode drivers can interfere with Elastic Defend's driver initialization or runtime operation. The most commonly reported conflicts include: | ||
|
|
||
| - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`). |
There was a problem hiding this comment.
| - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`). | |
| - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. This interaction was introduced by an Elastic Defend refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`). |
Question for Defend developers: is "add a Trellix exclusion for the BFE service (svchost.exe)." an acceptable recommendation?
|
|
||
| - **CrowdStrike, Kaspersky, Windows Defender coexistence**: Running multiple endpoint security products increases the probability of kernel-level interactions. Each additional kernel-mode filter driver introduces another point of contention for file system, registry, and network callbacks. When BSODs occur on systems with multiple security products, simplify by removing redundant products. | ||
|
|
||
| - **High third-party driver count**: Systems with an unusually high number of third-party kernel drivers (e.g. 168+ drivers) amplify the risk of pool corruption being attributed to or triggered by any one driver. Enable Driver Verifier on suspect third-party drivers to isolate the true source. |
There was a problem hiding this comment.
IDK, is this actionable by users?
|
|
||
| ### Unsupported OS version | ||
|
|
||
| Upgrading Elastic Defend to a version that dropped support for the host's Windows version causes immediate BSODs or boot loops. The most common case is upgrading to 8.13+ on Windows Server 2012 R2, which lost support in that release. The system crashes during driver load because the driver uses kernel APIs unavailable on the older OS. |
There was a problem hiding this comment.
Something is missing here, we added support for Windows Server 2012 R2 back in 8.16.0
|
|
||
| In some cases the endpoint driver causes a system deadlock rather than a classic BSOD. The system becomes completely unresponsive — applications freeze, Task Manager hangs, and the Elastic service cannot be stopped. This typically requires a hard reboot. A kernel memory dump captured during the lockup (via keyboard-initiated crash: right Ctrl + Scroll Lock twice) is required for diagnosis. | ||
|
|
||
| This pattern has been observed when the driver's file system filter processing enters a long-running or blocking state while monitoring specific applications. If the lockup is reproducible with a specific application, adding that application as a Trusted Application may resolve the conflict. |
There was a problem hiding this comment.
This paragraph doesn't seem useful to a user.
|
|
||
| ## Investigation priorities | ||
|
|
||
| 1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined. |
There was a problem hiding this comment.
| 1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined. | |
| 1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Share the dump with Elastic. |
|
|
||
| ## Symptom | ||
|
|
||
| A custom notification message has been configured in the Elastic Defend Device Control policy to display when a USB device is blocked, but the Windows system tray popup does not appear. Instead, the user sees only a generic Windows Explorer error stating the device is not accessible. Alternatively, device-specific allow/block rules based on `device.serial_number` do not match the intended device because the serial number field contains `0` or a seemingly random value. |
There was a problem hiding this comment.
These two situations are completely different. Should a single MD doc cover them both or is it better to break this doc up? I'm holding off on reading this file until this is answered.
I have no personal preference, I'm just bringing this up in case it helps with context windows.
|
|
||
| ## Summary | ||
|
|
||
| Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes. |
There was a problem hiding this comment.
| Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes. | |
| Elastic Defend on Linux uses eBPF or tracefs to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes. |
|
|
||
| CPU returns to normal within approximately 40 seconds after connectivity is restored. The command `sudo /opt/Elastic/Endpoint/elastic-endpoint test output` can be used to verify output connectivity — on affected versions this command itself will spike CPU when the output is unreachable. | ||
|
|
||
| Upgrade to 8.13.4+ where the retry loop includes proper backoff. Check `logs-elastic_agent.endpoint_security-*` for the error patterns above. |
There was a problem hiding this comment.
8.13.4 is really old, do we need to call this out?
| The endpoint will report status as CONFIGURING during this time: | ||
| - `Endpoint is setting status to CONFIGURING, reason: Policy Application Status` | ||
|
|
||
| On first run with an empty cache, the CONFIGURING phase can take 5–30 minutes depending on the number and size of running processes. This is expected behavior. Subsequent restarts are fast because the cache persists. |
There was a problem hiding this comment.
5-30 minutes? is that supposed to be seconds?
There was a problem hiding this comment.
I suppose not, because 30 seconds wouldn't be a problem.
Change Summary
Adds Automatic Troubleshooting context docs useful for troubleshooting common endpoint issues.