automatic troubleshooting endpoint context docs#728

Open

joeypoon wants to merge 4 commits intomainfrom

feature/at-sdh-context-docs-v2

Member

joeypoon commented Mar 13, 2026

Change Summary

Adds Automatic Troubleshooting context docs useful for troubleshooting common endpoint issues.

joeypoon added 4 commits

March 9, 2026 19:54


          add high_cpu, missed_checkins, bsod, trustedapps context docs

79ea268


          add endpoint_exceptions and incompatible_software context docs

d4c162e


          add output_config context doc for Kafka/Logstash output troubleshooting

537edfe


          add device_control context doc for notification and serial number issues

c0a6c43

joeypoon requested a review from a team as a code owner

March 13, 2026 01:15

joeypoon requested review from pzl and tomsonpl

March 13, 2026 01:15

pzl approved these changes

View reviewed changes

ferullo reviewed

View reviewed changes

Contributor

ferullo left a comment

There's a lot to review here so I'm going to submit a review for each file. The first one is done. It's reasonable/expected to sit on my recommendations until an Endpoint developer reviews too (so they can see my comments and contradict me, etc).

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		## Summary

		Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.

Contributor

ferullo Mar 18, 2026

Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures.

Does an LLM need to be told that?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.

		Memory dump analysis with WinDbg (`!analyze -v`) is essential for root-cause determination. The bugcheck code alone is not sufficient — the faulting call stack identifies which code path triggered the crash.

Contributor

ferullo Mar 18, 2026

I don't expect customers to do this. Perhaps text that describes how to collect a memory dump and a note that says to share it with us?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		### ODX-enabled volume crash (8.19.8, 9.1.8, 9.2.2)

		A regression introduced in versions 8.19.8, 9.1.8, and 9.2.2 causes BSODs on systems with ODX (Offloaded Data Transfer) enabled volumes, particularly affecting Hyper-V clusters and Windows Server 2016 Datacenter. The crash occurs in the file system filter driver's post-FsControl handler (`bkPostFsControl`) when processing offloaded write completions. The faulting call stack typically shows `elastic_endpoint_driver!bkPostFsControl` followed by `FLTMGR!FltGetStreamContext`.

Contributor

ferullo Mar 18, 2026

he crash occurs in the file system filter driver's post-FsControl handler (bkPostFsControl) when processing offloaded write completions. The faulting call stack typically shows elastic_endpoint_driver!bkPostFsControl followed by FLTMGR!FltGetStreamContext.

I doubt this is useful info for a Kibana user.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		A regression in the network driver introduced in Elastic Defend versions 8.17.8, 8.18.3, and 9.0.3 can cause kernel pool corruption on systems with a large number of long-lived network connections that remain inactive for 30+ minutes. The corruption manifests as BSODs with various bugcheck codes including `IRQL_NOT_LESS_OR_EQUAL`, `SYSTEM_SERVICE_EXCEPTION`, `KERNEL_MODE_HEAP_CORRUPTION`, or `PAGE_FAULT_IN_NONPAGED_AREA`.

		This is the most frequently reported BSOD pattern and affects Windows Server environments with persistent connections (e.g. database servers, backup servers running Veeam with PostgreSQL). The kernel pool may already be corrupted when the driver attempts a routine memory allocation, causing the crash to appear in unrelated code paths.

Contributor

ferullo Mar 18, 2026

I doubt this paragraph is useful info for a Kibana user.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		Fixed version: 9.2.4.

		Mitigation: Downgrade to the prior agent version (e.g. 9.2.1) until the upgrade to 9.2.4+ can be performed.

Contributor

ferullo Mar 18, 2026

Agent doesn't support downgrade (or am I mistaken). This should say "Upgrade to a version with the fix" I think.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		Other security products running kernel-mode drivers can interfere with Elastic Defend's driver initialization or runtime operation. The most commonly reported conflicts include:

		- Trellix Access Control: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).

Contributor

ferullo Mar 18, 2026

Suggested change

      
            - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).
          
            - **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. This interaction was introduced by an Elastic Defend refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).

Question for Defend developers: is "add a Trellix exclusion for the BFE service (svchost.exe)." an acceptable recommendation?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		- CrowdStrike, Kaspersky, Windows Defender coexistence: Running multiple endpoint security products increases the probability of kernel-level interactions. Each additional kernel-mode filter driver introduces another point of contention for file system, registry, and network callbacks. When BSODs occur on systems with multiple security products, simplify by removing redundant products.

		- High third-party driver count: Systems with an unusually high number of third-party kernel drivers (e.g. 168+ drivers) amplify the risk of pool corruption being attributed to or triggered by any one driver. Enable Driver Verifier on suspect third-party drivers to isolate the true source.

Contributor

ferullo Mar 18, 2026

IDK, is this actionable by users?

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		### Unsupported OS version

		Upgrading Elastic Defend to a version that dropped support for the host's Windows version causes immediate BSODs or boot loops. The most common case is upgrading to 8.13+ on Windows Server 2012 R2, which lost support in that release. The system crashes during driver load because the driver uses kernel APIs unavailable on the older OS.

Contributor

ferullo Mar 18, 2026

Something is missing here, we added support for Windows Server 2012 R2 back in 8.16.0

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		In some cases the endpoint driver causes a system deadlock rather than a classic BSOD. The system becomes completely unresponsive — applications freeze, Task Manager hangs, and the Elastic service cannot be stopped. This typically requires a hard reboot. A kernel memory dump captured during the lockup (via keyboard-initiated crash: right Ctrl + Scroll Lock twice) is required for diagnosis.

		This pattern has been observed when the driver's file system filter processing enters a long-running or blocking state while monitoring specific applications. If the lockup is reproducible with a specific application, adding that application as a Trusted Application may resolve the conflict.

Contributor

ferullo Mar 18, 2026

This paragraph doesn't seem useful to a user.

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md


		## Investigation priorities

		1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined.

Contributor

ferullo Mar 18, 2026

Suggested change

      
            1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined.
          
            1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Share the dump with Elastic.

ferullo reviewed

View reviewed changes

package/endpoint/docs/knowledge_base/device_control/device_control_notification.md


		## Symptom

		A custom notification message has been configured in the Elastic Defend Device Control policy to display when a USB device is blocked, but the Windows system tray popup does not appear. Instead, the user sees only a generic Windows Explorer error stating the device is not accessible. Alternatively, device-specific allow/block rules based on `device.serial_number` do not match the intended device because the serial number field contains `0` or a seemingly random value.

Contributor

ferullo Mar 18, 2026

These two situations are completely different. Should a single MD doc cover them both or is it better to break this doc up? I'm holding off on reading this file until this is answered.

I have no personal preference, I'm just bringing this up in case it helps with context windows.

nicholasberlin reviewed

View reviewed changes

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md


		## Summary

		Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.

Contributor

nicholasberlin Mar 19, 2026

Suggested change

      
            Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.
          
            Elastic Defend on Linux uses eBPF or tracefs to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md


		CPU returns to normal within approximately 40 seconds after connectivity is restored. The command `sudo /opt/Elastic/Endpoint/elastic-endpoint test output` can be used to verify output connectivity — on affected versions this command itself will spike CPU when the output is unreachable.

		Upgrade to 8.13.4+ where the retry loop includes proper backoff. Check `logs-elastic_agent.endpoint_security-*` for the error patterns above.

Contributor

nicholasberlin Mar 19, 2026

8.13.4 is really old, do we need to call this out?

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+              The endpoint will report status as CONFIGURING during this time:
+              - `Endpoint is setting status to CONFIGURING, reason: Policy Application Status`
+              On first run with an empty cache, the CONFIGURING phase can take 5–30 minutes depending on the number and size of running processes. This is expected behavior. Subsequent restarts are fast because the cache persists.

Contributor

nicholasberlin Mar 19, 2026

5-30 minutes? is that supposed to be seconds?

Contributor

nicholasberlin Mar 19, 2026

I suppose not, because 30 seconds wouldn't be a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet