BadLogLines speed ups #28692

travisdowns · 2025-11-21T18:04:35Z

We have observed timeouts, especially in scale tests, while doing the BadLogLines (BLL) search after the test finishes. Speed this up in two ways:

Use rg instead of grep
Search the nodes in parallel

This gives about a 9x benefit locally for me, combined. The actual benefit may vary: e.g., if the search is IO limited then the rg change may not help and even the parallel search may not help on docker where one underlying device is used. However on scale tests we use distinct nodes so we are guaranteed a speedup there (and we often use more than 3 nodes as well, giving a better than 3x speedup).

Individual changes have details.

Backports Required

Release Notes

none

Copilot

Pull request overview

This PR optimizes the BadLogLines (BLL) search mechanism that runs after tests to identify problematic log entries, addressing timeout issues observed in scale tests. The optimization employs two key strategies: replacing grep with the faster ripgrep tool and parallelizing log searches across nodes using Python's ThreadPoolExecutor.

Key changes:

Switch from grep to ripgrep (rg) for faster log searching
Implement parallel log scanning across cluster nodes
Add comprehensive type hints to the affected Python modules

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/rptest/services/utils.py	Core BLL logic refactored with parallel scanning, type hints, and ripgrep support
tests/rptest/services/redpanda.py	Updated function signature to use non-nullable default for test_start_time
tests/rptest/services/openmessaging_benchmark.py	Updated type hint to use new NodeToLines type
tests/rptest/tests/services_self_test.py	Added tests for BLL functionality and LogMessage admin API
tests/docker/ducktape-deps/tool-pkgs	Added ripgrep installation to Docker environment
tools/type-checking/type-check-strictness.json	Removed utils.py from type-checking exclusion list
proto/redpanda/core/admin/internal/v1/debug.proto	Added LogMessage RPC and LogLevel enum definitions
src/v/redpanda/admin/services/internal/debug.{h,cc}	Implemented LogMessage RPC handler
tests/rptest/clients/admin/proto/redpanda/core/admin/internal/v1/debug_pb2*	Generated protobuf bindings for LogMessage API

tests/rptest/clients/admin/proto/redpanda/core/admin/internal/v1/debug_pb2.pyi

tests/rptest/services/utils.py

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

tests/rptest/services/openmessaging_benchmark.py

tests/rptest/services/utils.py

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Copilot · 2025-11-21T18:17:04Z

tests/rptest/services/utils.py

        "UndefinedBehaviorSanitizer",
        "Aborting on shard",
-        "libc++abi: terminating due to uncaught exception",
+        "terminating due to uncaught exception",


[nitpick] The pattern was changed from 'libc++abi: terminating due to uncaught exception' to just 'terminating due to uncaught exception', which may match unintended log lines. Consider whether this broader pattern could introduce false positives in error detection.

Suggested change

"terminating due to uncaught exception",

"libc++abi: terminating due to uncaught exception",

"terminate called after throwing an instance of",

This was by design. BRE and ERE differ in the treatment of + (both can use it as a quantifier, but BRE wants it escaped to do so, ERE the reverse) so I just removed it to avoid any possible issues.

Pretty sure the remaining string is good enough!

travisdowns · 2025-11-21T18:17:43Z

tests/rptest/services/utils.py


        # Prepare matching terms
-        self.match_terms = self.DEFAULT_MATCH_TERMS
+        self.match_terms: list[str] = list(self.DEFAULT_MATCH_TERMS)


before this copy, we were actually modifying DEFAULT_MATCH_TERMS (an class variable) every time we invoked the constructor, this this was growing over time with more ^ERROR terms. Doesn't affect the match and I don't even think it affects the perf since the algos used essentially remove redundant strings as a side-effect.

vbotbuildovich · 2025-11-21T21:06:28Z

Retry command for Build#76817

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/services_self_test.py::RedpandaServiceSelfTest.test_asan_backtrace
tests/rptest/tests/services_self_test.py::RedpandaServiceSelfTest.test_ubsan_backtrace

vbotbuildovich · 2025-11-21T21:39:05Z

CI test results

test results on build#76817

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ClusterConfigLegacyDefaultTest	test_legacy_default	{"wipe_cache": true}	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_legacy_default
ControlCharacterPermittedAfterUpgrade	test_upgrade_from_pre_v23_2	{"initial_version_as_list": [22, 3, 11]}	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControlCharacterPermittedAfterUpgrade&test_method=test_upgrade_from_pre_v23_2
IcebergTogglingTest	test_iceberg_mode_toggling	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=IcebergTogglingTest&test_method=test_iceberg_mode_toggling
SchemaEvolutionE2ETests	test_reorder_columns	{"cloud_storage_type": 1, "qe_and_cat": ["trino", "rest_hadoop"]}	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaEvolutionE2ETests&test_method=test_reorder_columns
EndToEndShadowIndexingTestWithDisruptions	test_write_with_node_failures	{"cloud_storage_type": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7dd-096b-4979-abd9-7cf8c3035528	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndShadowIndexingTestWithDisruptions&test_method=test_write_with_node_failures
MaintenanceTest	test_exclusive_maintenance	{"use_rpk": true}	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MaintenanceTest&test_method=test_exclusive_maintenance
RemoteLabelsTest	test_share_bucket_delete_topic	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RemoteLabelsTest&test_method=test_share_bucket_delete_topic
RedpandaServiceSelfTest	test_asan_backtrace	null	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7dd-095d-4778-af5f-f11b7dace52c	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaServiceSelfTest&test_method=test_asan_backtrace
RedpandaServiceSelfTest	test_ubsan_backtrace	null	integration	https://buildkite.com/redpanda/redpanda/builds/76817#019aa7dd-0969-4281-8cbe-ebca1a3b3098	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaServiceSelfTest&test_method=test_ubsan_backtrace

test results on build#76910

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
ClusterConfigLegacyDefaultTest	test_removal_of_legacy_default_overriden	{"wipe_cache": true}	integration	https://buildkite.com/redpanda/redpanda/builds/76910#019ab7cc-e238-4d9a-9dad-210368459140	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_removal_of_legacy_default_overriden
CreateTopicsTest	test_case_insensitive_boolean_property	null	integration	https://buildkite.com/redpanda/redpanda/builds/76910#019ab7cc-e238-4d9a-9dad-210368459140	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CreateTopicsTest&test_method=test_case_insensitive_boolean_property
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/76910#019ab7ce-94c4-4ce8-bcd8-0d78a5a09bec	FLAKY	18/21	upstream reliability is '91.51047409040794'. current run reliability is '85.71428571428571'. drift is 5.79619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

StephanDollberg

parallel search may not help on docker where one underlying device is used

Are you saying we are disk bound on those?

travisdowns · 2025-11-24T16:50:00Z

parallel search may not help on docker where one underlying device is used

Are you saying we are disk bound on those?

No, only that if we are disk bound (or close to it, i.e., using more than 1/Nth of the disk BW for N redpanda node log search) then this won't help (or will get less N speedup if we are close).

m5d.12xlarge has at least 1.4 GB/s thoughput on its local disks, while grep CPU side seems like it will be < 500 MB/s, so I don't think we are disk limited in CI (plus page caching, which is hard to evaluate).

travisdowns · 2025-11-24T20:58:06Z

f3f2bc9 is a pure rebase

travisdowns · 2025-11-24T20:58:30Z

7665019 converts the expression from GNU BRE to ERE, which will fix the two self test failures.

Adds ability to log a message to the Redpanda server logs via an internal admin API. Use for DT tests.

I am going to make some changes here, so add a self-test. It uses the new log line API to send log lines to one Redpanda node and checks that raise on bad logs catches them.

Fully types at strict. This required a couple of touch-up changes in other files, and exposed in a bug on the cloud tests side where we passed tuples instead of list of tuples.

Install ripgrep from github. In my testing ripgrep is close to 5x faster when searching large logfiles, though it needs to be ripgrep 15, not <= 13 that we'd get on older Ubuntu distros. So install 15.1 from github releases.

@ignored

Use ripgrep (rg) instead of grep for log searching in rptest. In my benchmark this results in about a 3x speedup when searching through 3 logs in parallel resulting from creating 10 topics of 1000 partitions. The benchmark is also included in this change, though @ignored so it does not run CI (it takes a minute or so). This requires translating GNU BRE (used by grep by default) to ERE (used by grep -E, and ripgrep). It was probably a mistake to use BRE in the first place, but it is what it is.

We query each node one by one to check the log for bad log lines, but we can do this in parallel: it puts almost no load on the runner, and the work happens on the nodes, so let's speed it up. This results in about a 3x speed in my benchmark, on top of the 3x speedup from switching to rg (so about 9x overall).

travisdowns · 2025-11-25T12:59:46Z

e62fe62 is a pure rebase in an attempt to get green CI.

travisdowns requested a review from StephanDollberg November 21, 2025 18:04

travisdowns requested a review from a team as a code owner November 21, 2025 18:04

travisdowns requested review from cjayani and Copilot and removed request for a team November 21, 2025 18:04

github-actions bot added the area/redpanda label Nov 21, 2025

travisdowns changed the title ~~BadLogLines speeds ups~~ BadLogLines speed ups Nov 21, 2025

Copilot AI reviewed Nov 21, 2025

View reviewed changes

travisdowns force-pushed the td-BLL-speedup branch from 7b83f1e to a2fb898 Compare November 21, 2025 18:07

travisdowns requested a review from Copilot November 21, 2025 18:08

Copilot AI reviewed Nov 21, 2025

View reviewed changes

tests/rptest/services/openmessaging_benchmark.py Show resolved Hide resolved

tests/rptest/services/openmessaging_benchmark.py Show resolved Hide resolved

tests/rptest/services/utils.py Show resolved Hide resolved

travisdowns force-pushed the td-BLL-speedup branch from a2fb898 to c6357d8 Compare November 21, 2025 18:15

travisdowns requested a review from Copilot November 21, 2025 18:16

Copilot AI reviewed Nov 21, 2025

View reviewed changes

travisdowns commented Nov 21, 2025

View reviewed changes

travisdowns force-pushed the td-BLL-speedup branch from c6357d8 to a359d5a Compare November 21, 2025 18:40

StephanDollberg previously approved these changes Nov 21, 2025

View reviewed changes

travisdowns dismissed StephanDollberg’s stale review via 7665019 November 24, 2025 20:56

travisdowns force-pushed the td-BLL-speedup branch 2 times, most recently from 7665019 to f3f2bc9 Compare November 24, 2025 20:57

StephanDollberg approved these changes Nov 24, 2025

View reviewed changes

travisdowns added 4 commits November 25, 2025 09:59

admin: add internal v2 API to log a message

56ebcda

Adds ability to log a message to the Redpanda server logs via an internal admin API. Use for DT tests.

rptest: self-test for raise_on_bad_logs

1469251

I am going to make some changes here, so add a self-test. It uses the new log line API to send log lines to one Redpanda node and checks that raise on bad logs catches them.

rptest: type services/utils.py

c815300

Fully types at strict. This required a couple of touch-up changes in other files, and exposed in a bug on the cloud tests side where we passed tuples instead of list of tuples.

rptest/docker-deps: install ripgrep

452b00b

Install ripgrep from github. In my testing ripgrep is close to 5x faster when searching large logfiles, though it needs to be ripgrep 15, not <= 13 that we'd get on older Ubuntu distros. So install 15.1 from github releases.

travisdowns added 2 commits November 25, 2025 09:59

travisdowns force-pushed the td-BLL-speedup branch from f3f2bc9 to e62fe62 Compare November 25, 2025 12:59

StephanDollberg approved these changes Nov 25, 2025

View reviewed changes

travisdowns merged commit 81a497f into redpanda-data:dev Nov 25, 2025
34 checks passed

	"terminating due to uncaught exception",
	"libc++abi: terminating due to uncaught exception",
	"terminate called after throwing an instance of",

BadLogLines speed ups #28692

BadLogLines speed ups #28692

Uh oh!

Conversation

travisdowns commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

travisdowns Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

travisdowns Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#76817

Uh oh!

vbotbuildovich commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

StephanDollberg left a comment

Choose a reason for hiding this comment

Uh oh!

travisdowns commented Nov 24, 2025

Uh oh!

travisdowns commented Nov 24, 2025

Uh oh!

travisdowns commented Nov 24, 2025

Uh oh!

travisdowns commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

travisdowns commented Nov 21, 2025 •

edited

Loading

vbotbuildovich commented Nov 21, 2025 •

edited

Loading

vbotbuildovich commented Nov 21, 2025 •

edited

Loading