Skip to content

Conversation

@thetumbled
Copy link
Member

@thetumbled thetumbled commented Jul 30, 2025

Motivation

Each time the bookie gc is triggered, the read latency of zookeeper soars high to tens of seconds, threatening the stability of the cluster.
image
The unlimited zookeeper read operation in gc is responsible for the issue.
image

Changes

Add config gcMetadataOpRateLimit to limit the read operation rate in gc.

@thetumbled thetumbled changed the title add rate limit for gc. add rate limit for zk read rate in gc. Jul 30, 2025
@thetumbled
Copy link
Member Author

PTAL, thanks. @StevenLuMT @hezhangjian @hangc0276 @nodece @zymap

Copy link
Member

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a gcRateLimiter doesn't fundamentally solve the problem, nor is it practically feasible to implement or limit the number of calls.

I think it could be more practical to break down the rate limits into several categories: rate limits for GC calls to ZooKeeper, rate limits for GC file reads, rate limits for GC file writes, or rate limits for other GC resources.

@thetumbled
Copy link
Member Author

I think adding a gcRateLimiter doesn't fundamentally solve the problem, nor is it practically feasible to implement or limit the number of calls.

Why? I believe adding a rate limiter and acquiring it prior to each zk operation would suffice.

I think it could be more practical to break down the rate limits into several categories: rate limits for GC calls to ZooKeeper, rate limits for GC file reads, rate limits for GC file writes, or rate limits for other GC resources.

We haven't encountered any issues with GC file reads or writes. The only thing I need to limit is the rate of zk reads.

@thetumbled thetumbled requested a review from StevenLuMT August 1, 2025 09:58
Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM.

Copy link
Member

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM: The previous definition was a bit broad. Now you have sunk to MetadataOpRateLimit, which I understand is still useful and configurable.

@StevenLuMT
Copy link
Member

By the way, is it possible to add a testcase? If it is possible, please add a testcase to verify the correctness of the function. Thank you @thetumbled

@StevenLuMT
Copy link
Member

rerun failure checks

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds rate limiting for ZooKeeper read operations during garbage collection to prevent ZooKeeper latency spikes that threaten cluster stability. The unlimited ZooKeeper operations during GC were causing read latencies to reach tens of seconds.

  • Adds a new configuration parameter gcZkOpRateLimit with a default value of 1000 operations per second
  • Implements rate limiting using Google Guava's RateLimiter in the ScanAndCompareGarbageCollector
  • Applies rate limiting to ZooKeeper read operations during ledger metadata scanning and overreplicated ledger processing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
ServerConfiguration.java Adds configuration parameter and getter/setter methods for GC ZooKeeper operation rate limiting
ScanAndCompareGarbageCollector.java Implements rate limiting using RateLimiter for ZooKeeper operations during garbage collection
Comments suppressed due to low confidence (1)

bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ServerConfiguration.java:495

  • The parameter name gcRateLimit in the javadoc doesn't match the actual parameter name gcRateLimit used in the method signature. Consider using a more descriptive name like gcZkOpRateLimit to match the configuration constant and method name.
     * @param gcRateLimit

@thetumbled
Copy link
Member Author

By the way, is it possible to add a testcase? If it is possible, please add a testcase to verify the correctness of the function. Thank you @thetumbled

A test case has been added to verify the correctness of the rate limit feature, PTAL, thanks.

@thetumbled
Copy link
Member Author

rerun failure checks

Copy link
Member

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thetumbled thetumbled requested a review from hangc0276 August 11, 2025 02:17
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Names should be thought through from the point of view of the person using the config.
  • The default values should be thought as best value for people who runs the program without tuning parameters.
  • All configuration settings should be added to default configuration file and documented.
  • The new configuration should be cross-referenced with the existing "compaction" rate limiting settings since it will also impact the rate.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thetumbled thetumbled requested a review from lhotari August 12, 2025 06:58
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari lhotari merged commit 417ff16 into apache:master Aug 12, 2025
25 checks passed
@lhotari lhotari added this to the 4.18.0 milestone Aug 12, 2025
zymap pushed a commit that referenced this pull request Jan 12, 2026
* add rate limit for gc.

* fix checkstyle.

* fix conf name and acqurie.

* rename gcZkOpRateLimit to gcMetadataOpRateLimit.

* rename gcZkOpRateLimit to gcMetadataOpRateLimit.

* add return label.

* add test code.

* document conf.

---------

Co-authored-by: fengwenzhi <fengwenzhi.max@bigo.sg>
(cherry picked from commit 417ff16)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants