Skip to content

Conversation

@travisdowns
Copy link
Member

@travisdowns travisdowns commented Nov 21, 2025

We have observed timeouts, especially in scale tests, while doing the BadLogLines (BLL) search after the test finishes. Speed this up in two ways:

  • Use rg instead of grep
  • Search the nodes in parallel

This gives about a 9x benefit locally for me, combined. The actual benefit may vary: e.g., if the search is IO limited then the rg change may not help and even the parallel search may not help on docker where one underlying device is used. However on scale tests we use distinct nodes so we are guaranteed a speedup there (and we often use more than 3 nodes as well, giving a better than 3x speedup).

Individual changes have details.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

@travisdowns travisdowns requested a review from a team as a code owner November 21, 2025 18:04
@travisdowns travisdowns requested review from cjayani and Copilot and removed request for a team November 21, 2025 18:04
@travisdowns travisdowns changed the title BadLogLines speeds ups BadLogLines speed ups Nov 21, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the BadLogLines (BLL) search mechanism that runs after tests to identify problematic log entries, addressing timeout issues observed in scale tests. The optimization employs two key strategies: replacing grep with the faster ripgrep tool and parallelizing log searches across nodes using Python's ThreadPoolExecutor.

Key changes:

  • Switch from grep to ripgrep (rg) for faster log searching
  • Implement parallel log scanning across cluster nodes
  • Add comprehensive type hints to the affected Python modules

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/rptest/services/utils.py Core BLL logic refactored with parallel scanning, type hints, and ripgrep support
tests/rptest/services/redpanda.py Updated function signature to use non-nullable default for test_start_time
tests/rptest/services/openmessaging_benchmark.py Updated type hint to use new NodeToLines type
tests/rptest/tests/services_self_test.py Added tests for BLL functionality and LogMessage admin API
tests/docker/ducktape-deps/tool-pkgs Added ripgrep installation to Docker environment
tools/type-checking/type-check-strictness.json Removed utils.py from type-checking exclusion list
proto/redpanda/core/admin/internal/v1/debug.proto Added LogMessage RPC and LogLevel enum definitions
src/v/redpanda/admin/services/internal/debug.{h,cc} Implemented LogMessage RPC handler
tests/rptest/clients/admin/proto/redpanda/core/admin/internal/v1/debug_pb2* Generated protobuf bindings for LogMessage API

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

"UndefinedBehaviorSanitizer",
"Aborting on shard",
"libc++abi: terminating due to uncaught exception",
"terminating due to uncaught exception",
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The pattern was changed from 'libc++abi: terminating due to uncaught exception' to just 'terminating due to uncaught exception', which may match unintended log lines. Consider whether this broader pattern could introduce false positives in error detection.

Suggested change
"terminating due to uncaught exception",
"libc++abi: terminating due to uncaught exception",
"terminate called after throwing an instance of",

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was by design. BRE and ERE differ in the treatment of + (both can use it as a quantifier, but BRE wants it escaped to do so, ERE the reverse) so I just removed it to avoid any possible issues.

Pretty sure the remaining string is good enough!


# Prepare matching terms
self.match_terms = self.DEFAULT_MATCH_TERMS
self.match_terms: list[str] = list(self.DEFAULT_MATCH_TERMS)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this copy, we were actually modifying DEFAULT_MATCH_TERMS (an class variable) every time we invoked the constructor, this this was growing over time with more ^ERROR terms. Doesn't affect the match and I don't even think it affects the perf since the algos used essentially remove redundant strings as a side-effect.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 21, 2025

Retry command for Build#76817

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/services_self_test.py::RedpandaServiceSelfTest.test_asan_backtrace
tests/rptest/tests/services_self_test.py::RedpandaServiceSelfTest.test_ubsan_backtrace

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 21, 2025

CI test results

test results on build#76817
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterConfigLegacyDefaultTest test_legacy_default {"wipe_cache": true} integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_legacy_default
ControlCharacterPermittedAfterUpgrade test_upgrade_from_pre_v23_2 {"initial_version_as_list": [22, 3, 11]} integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControlCharacterPermittedAfterUpgrade&test_method=test_upgrade_from_pre_v23_2
IcebergTogglingTest test_iceberg_mode_toggling {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=IcebergTogglingTest&test_method=test_iceberg_mode_toggling
SchemaEvolutionE2ETests test_reorder_columns {"cloud_storage_type": 1, "qe_and_cat": ["trino", "rest_hadoop"]} integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaEvolutionE2ETests&test_method=test_reorder_columns
EndToEndShadowIndexingTestWithDisruptions test_write_with_node_failures {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7dd-096b-4979-abd9-7cf8c3035528 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndShadowIndexingTestWithDisruptions&test_method=test_write_with_node_failures
MaintenanceTest test_exclusive_maintenance {"use_rpk": true} integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MaintenanceTest&test_method=test_exclusive_maintenance
RemoteLabelsTest test_share_bucket_delete_topic {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7c9-f6af-44e0-a922-2dbb99805bd7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RemoteLabelsTest&test_method=test_share_bucket_delete_topic
RedpandaServiceSelfTest test_asan_backtrace null integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7dd-095d-4778-af5f-f11b7dace52c FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaServiceSelfTest&test_method=test_asan_backtrace
RedpandaServiceSelfTest test_ubsan_backtrace null integration https://buildkite.com/redpanda/redpanda/builds/76817#019aa7dd-0969-4281-8cbe-ebca1a3b3098 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaServiceSelfTest&test_method=test_ubsan_backtrace
test results on build#76910
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterConfigLegacyDefaultTest test_removal_of_legacy_default_overriden {"wipe_cache": true} integration https://buildkite.com/redpanda/redpanda/builds/76910#019ab7cc-e238-4d9a-9dad-210368459140 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_removal_of_legacy_default_overriden
CreateTopicsTest test_case_insensitive_boolean_property null integration https://buildkite.com/redpanda/redpanda/builds/76910#019ab7cc-e238-4d9a-9dad-210368459140 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CreateTopicsTest&test_method=test_case_insensitive_boolean_property
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/76910#019ab7ce-94c4-4ce8-bcd8-0d78a5a09bec FLAKY 18/21 upstream reliability is '91.51047409040794'. current run reliability is '85.71428571428571'. drift is 5.79619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel search may not help on docker where one underlying device is used

Are you saying we are disk bound on those?

@travisdowns
Copy link
Member Author

parallel search may not help on docker where one underlying device is used

Are you saying we are disk bound on those?

No, only that if we are disk bound (or close to it, i.e., using more than 1/Nth of the disk BW for N redpanda node log search) then this won't help (or will get less N speedup if we are close).

m5d.12xlarge has at least 1.4 GB/s thoughput on its local disks, while grep CPU side seems like it will be < 500 MB/s, so I don't think we are disk limited in CI (plus page caching, which is hard to evaluate).

@travisdowns
Copy link
Member Author

f3f2bc9 is a pure rebase

@travisdowns
Copy link
Member Author

7665019 converts the expression from GNU BRE to ERE, which will fix the two self test failures.

Adds ability to log a message to the Redpanda server logs via an
internal admin API. Use for DT tests.
I am going to make some changes here, so add a self-test. It uses the
new log line API to send log lines to one Redpanda node and checks that
raise on bad logs catches them.
Fully types at strict. This required a couple of touch-up changes in
other files, and exposed in a bug on the cloud tests side where we
passed tuples instead of list of tuples.
Install ripgrep from github.

In my testing ripgrep is close to 5x faster when searching large
logfiles, though it needs to be ripgrep 15, not <= 13 that we'd get
on older Ubuntu distros. So install 15.1 from github releases.
Use ripgrep (rg) instead of grep for log searching in rptest.

In my benchmark this results in about a 3x speedup when searching
through 3 logs in parallel resulting from creating 10 topics of 1000
partitions.

The benchmark is also included in this change, though @ignored so it
does not run CI (it takes a minute or so).

This requires translating GNU BRE (used by grep by default) to ERE
(used by grep -E, and ripgrep). It was probably a mistake to use BRE
in the first place, but it is what it is.
We query each node one by one to check the log for bad log lines, but
we can do this in parallel: it puts almost no load on the runner, and
the work happens on the nodes, so let's speed it up.

This results in about a 3x speed in my benchmark, on top of the 3x
speedup from switching to rg (so about 9x overall).
@travisdowns
Copy link
Member Author

e62fe62 is a pure rebase in an attempt to get green CI.

@travisdowns travisdowns merged commit 81a497f into redpanda-data:dev Nov 25, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants