-
Notifications
You must be signed in to change notification settings - Fork 709
BadLogLines speed ups #28692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BadLogLines speed ups #28692
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the BadLogLines (BLL) search mechanism that runs after tests to identify problematic log entries, addressing timeout issues observed in scale tests. The optimization employs two key strategies: replacing grep with the faster ripgrep tool and parallelizing log searches across nodes using Python's ThreadPoolExecutor.
Key changes:
- Switch from
greptoripgrep(rg) for faster log searching - Implement parallel log scanning across cluster nodes
- Add comprehensive type hints to the affected Python modules
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/rptest/services/utils.py | Core BLL logic refactored with parallel scanning, type hints, and ripgrep support |
| tests/rptest/services/redpanda.py | Updated function signature to use non-nullable default for test_start_time |
| tests/rptest/services/openmessaging_benchmark.py | Updated type hint to use new NodeToLines type |
| tests/rptest/tests/services_self_test.py | Added tests for BLL functionality and LogMessage admin API |
| tests/docker/ducktape-deps/tool-pkgs | Added ripgrep installation to Docker environment |
| tools/type-checking/type-check-strictness.json | Removed utils.py from type-checking exclusion list |
| proto/redpanda/core/admin/internal/v1/debug.proto | Added LogMessage RPC and LogLevel enum definitions |
| src/v/redpanda/admin/services/internal/debug.{h,cc} | Implemented LogMessage RPC handler |
| tests/rptest/clients/admin/proto/redpanda/core/admin/internal/v1/debug_pb2* | Generated protobuf bindings for LogMessage API |
tests/rptest/clients/admin/proto/redpanda/core/admin/internal/v1/debug_pb2.pyi
Outdated
Show resolved
Hide resolved
tests/rptest/clients/admin/proto/redpanda/core/admin/internal/v1/debug_pb2.pyi
Outdated
Show resolved
Hide resolved
7b83f1e to
a2fb898
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
a2fb898 to
c6357d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
| "UndefinedBehaviorSanitizer", | ||
| "Aborting on shard", | ||
| "libc++abi: terminating due to uncaught exception", | ||
| "terminating due to uncaught exception", |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The pattern was changed from 'libc++abi: terminating due to uncaught exception' to just 'terminating due to uncaught exception', which may match unintended log lines. Consider whether this broader pattern could introduce false positives in error detection.
| "terminating due to uncaught exception", | |
| "libc++abi: terminating due to uncaught exception", | |
| "terminate called after throwing an instance of", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was by design. BRE and ERE differ in the treatment of + (both can use it as a quantifier, but BRE wants it escaped to do so, ERE the reverse) so I just removed it to avoid any possible issues.
Pretty sure the remaining string is good enough!
|
|
||
| # Prepare matching terms | ||
| self.match_terms = self.DEFAULT_MATCH_TERMS | ||
| self.match_terms: list[str] = list(self.DEFAULT_MATCH_TERMS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this copy, we were actually modifying DEFAULT_MATCH_TERMS (an class variable) every time we invoked the constructor, this this was growing over time with more ^ERROR terms. Doesn't affect the match and I don't even think it affects the perf since the algos used essentially remove redundant strings as a side-effect.
c6357d8 to
a359d5a
Compare
Retry command for Build#76817please wait until all jobs are finished before running the slash command |
StephanDollberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel search may not help on docker where one underlying device is used
Are you saying we are disk bound on those?
No, only that if we are disk bound (or close to it, i.e., using more than 1/Nth of the disk BW for N redpanda node log search) then this won't help (or will get less N speedup if we are close). m5d.12xlarge has at least 1.4 GB/s thoughput on its local disks, while grep CPU side seems like it will be < 500 MB/s, so I don't think we are disk limited in CI (plus page caching, which is hard to evaluate). |
7665019 to
f3f2bc9
Compare
|
f3f2bc9 is a pure rebase |
|
7665019 converts the expression from GNU BRE to ERE, which will fix the two self test failures. |
Adds ability to log a message to the Redpanda server logs via an internal admin API. Use for DT tests.
I am going to make some changes here, so add a self-test. It uses the new log line API to send log lines to one Redpanda node and checks that raise on bad logs catches them.
Fully types at strict. This required a couple of touch-up changes in other files, and exposed in a bug on the cloud tests side where we passed tuples instead of list of tuples.
Install ripgrep from github. In my testing ripgrep is close to 5x faster when searching large logfiles, though it needs to be ripgrep 15, not <= 13 that we'd get on older Ubuntu distros. So install 15.1 from github releases.
Use ripgrep (rg) instead of grep for log searching in rptest. In my benchmark this results in about a 3x speedup when searching through 3 logs in parallel resulting from creating 10 topics of 1000 partitions. The benchmark is also included in this change, though @ignored so it does not run CI (it takes a minute or so). This requires translating GNU BRE (used by grep by default) to ERE (used by grep -E, and ripgrep). It was probably a mistake to use BRE in the first place, but it is what it is.
We query each node one by one to check the log for bad log lines, but we can do this in parallel: it puts almost no load on the runner, and the work happens on the nodes, so let's speed it up. This results in about a 3x speed in my benchmark, on top of the 3x speedup from switching to rg (so about 9x overall).
f3f2bc9 to
e62fe62
Compare
|
e62fe62 is a pure rebase in an attempt to get green CI. |
We have observed timeouts, especially in scale tests, while doing the BadLogLines (BLL) search after the test finishes. Speed this up in two ways:
rginstead ofgrepThis gives about a 9x benefit locally for me, combined. The actual benefit may vary: e.g., if the search is IO limited then the rg change may not help and even the parallel search may not help on docker where one underlying device is used. However on scale tests we use distinct nodes so we are guaranteed a speedup there (and we often use more than 3 nodes as well, giving a better than 3x speedup).
Individual changes have details.
Backports Required
Release Notes