Skip to content

Conversation

@swuferhong
Copy link
Contributor

Purpose

Linked issue: #616

When I creating a partitioned kv table with a large number of buckets like 512. I found that the CoordinatorServe heap memory kept increasing, eventually reaching around 15 GB. Further inspection of the memory usage revealed that most of it was occupied by KvFileHandler and KvSnapshotHandle.

The main reason for the accumulation of KvFileHandler and KvSnapshotHandle is that the speed of deleting remote snapshot files is very slow (the remote storage is OSS). Additionally, since the current concurrency level for deletion threads is only 1, the backlog becomes increasingly severe, eventually leading to frequent full GCs and even OOM. The preliminary solution is to increase the 'coordinator.io-pool.size' to 10, allowing 10 concurrent threads to handle the deletion tasks. From the test results, CPU usage increases by only about 3%, but the accumulation of KvFileHandler and KvSnapshotHandle no longer occurs.

In the long term, we need to further optimize the memory usage of KvFileHandler and KvSnapshotHandle. Issue #528 will continue to track this.

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong requested review from luoyuxia and wuchong March 17, 2025 03:13
@swuferhong swuferhong force-pushed the coordinator-io branch 3 times, most recently from ca67135 to 1cc8f5e Compare March 20, 2025 05:24
@swuferhong swuferhong changed the title [server] Change the default value of 'coordinator.io-pool.size' from 1 to 10 [server] Coordinator use 'server.background.threads' instead of 'coordinator.io-pool.size' as the default executor config Mar 20, 2025
@swuferhong swuferhong linked an issue Mar 20, 2025 that may be closed by this pull request
2 tasks
@swuferhong
Copy link
Contributor Author

@wuchong @luoyuxia comments addressed, pr ready.

@wuchong
Copy link
Member

wuchong commented Mar 22, 2025

Upon further review, I realized that all server.* configurations pertain to the TabletServer. Therefore, using server.background.threads for Coordinator is misleading. The original configuration, coordinator.io-pool.size, is more appropriate and self-explanatory.

Additionally, I found that Flink uses a similar configuration, cluster.io-pool.size, which defaults to 4 * CPU cores for both the Master and TaskExecutor. Given this, I suggest simply increasing the coordinator.io-pool.size to 10. This adjustment is straightforward and aligns with common practices.

Sorry for the back-and-forth on this matter. @swuferhong

@swuferhong
Copy link
Contributor Author

Upon further review, I realized that all server.* configurations pertain to the TabletServer. Therefore, using server.background.threads for Coordinator is misleading. The original configuration, coordinator.io-pool.size, is more appropriate and self-explanatory.

Additionally, I found that Flink uses a similar configuration, cluster.io-pool.size, which defaults to 4 * CPU cores for both the Master and TaskExecutor. Given this, I suggest simply increasing the coordinator.io-pool.size to 10. This adjustment is straightforward and aligns with common practices.

Sorry for the back-and-forth on this matter. @swuferhong

Upon further review, I realized that all server.* configurations pertain to the TabletServer. Therefore, using server.background.threads for Coordinator is misleading. The original configuration, coordinator.io-pool.size, is more appropriate and self-explanatory.

Additionally, I found that Flink uses a similar configuration, cluster.io-pool.size, which defaults to 4 * CPU cores for both the Master and TaskExecutor. Given this, I suggest simply increasing the coordinator.io-pool.size to 10. This adjustment is straightforward and aligns with common practices.

Sorry for the back-and-forth on this matter. @swuferhong

Ok, I will change it back, but I feel that the ioExecutor can still be moved to be managed within the CoordinatorServer. I will keep this part of the code.

@swuferhong swuferhong changed the title [server] Coordinator use 'server.background.threads' instead of 'coordinator.io-pool.size' as the default executor config [server] Change the default value of 'coordinator.io-pool.size' from 1 to 10 Mar 24, 2025
@swuferhong
Copy link
Contributor Author

@wuchong comments addressed.

ConfigOptions.TABLET_SERVER_ID.key()));
}

if (conf.get(ConfigOptions.BACKGROUND_THREADS) < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COORDINATOR_IO_POOL_SIZE ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TabletServer use BACKGROUND_THREADS as io param. Here I just noticed that there was no such check before, so I added it

@wuchong wuchong merged commit 6b487a7 into apache:main Mar 24, 2025
4 checks passed
ZmmBigdata pushed a commit to ZmmBigdata/fluss that referenced this pull request Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KvFileHandler and KvSnapshotHandle cost too much memory in CoordinatorServer

3 participants