Skip to content

Commit cb73e4c

Browse files
committed
perf(chunkserver): Protect IO performance during massive delete
This PR further reduces trash-GC impact on foreground I/O by dynamically throttling expired-file deletion based on observed disk throughput, and improves delete wall-clock time by running removals per-disk in parallel. Key changes: - I/O-aware GC throttling (with floor guard) - Adds GC-specific I/O counters in HddStats and increments them on every read/write accounting path. - In ChunkTrashManagerImpl::collectGarbage(): - Samples and resets GC I/O counters each cycle (exchange(0)). - Normalizes by disk count (gDisks.size() under gDisksMutex) to estimate per-disk pressure. - Maintains max observed per-disk bytes (read/write) to create a stable baseline. - Computes totalIOPercentage and feeds it through an inverted sigmoid (steepness=12, center=0.3) to scale down trashGarbageCollectorBulkSize under load. - Adds a minimum threshold: only perform expired-file deletion when the scaled bulk size is >= 5, preventing “death by a thousand tiny deletes” during heavy I/O. - Parallel trash removal across disks -removeTrashFiles() now spawns a std::jthread per disk and delegates actual deletion/index updates to removeTrashFilesFromDisk() allowing independent disks to progress concurrently. - Behavioral defaults - CHUNK_TRASH_ENABLED default flips 0 → 1 (trash manager enabled unless explicitly disabled). - Default retention (kDefaultTrashTimeLimitSeconds) flips 259200 → 0 (immediate expiry unless configured). - Changed documentation accordingly. - Resource thread pacing - hddFreeResourcesThread() loop period increases 2s → 3s, replacing sleep() with Timeout + usleep(timeout.remaining_us()) for more precise loop cadence. - Changed the functionality of implMutex for protecting only the getter and setter of the current GC implementation. - Test dependency: - Disabled GC for rebalancing test, mainly because chunks in trash affect disk usage statistics. -Added more sleep time to sfschunkserver_check_no_buffer_in_use() given that GC cycles now take more time. Rationale: Large delete storms can saturate disks and degrade chunkserver latency. This patch makes the garbage collector self-throttling using real I/O signals and avoids thrashing by skipping deletion work when the system is already busy, while parallelizing deletes per disk to reclaim space faster when bandwidth is available. Signed-off-by: GigaCronos <jorge.cabrera@leil.io>
1 parent 6d0fe4e commit cb73e4c

16 files changed

+134
-42
lines changed

doc/sfschunkserver.cfg.5.adoc

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@ defined at build time), /usr/local/lib/saunafs/plugins/chunkserver, and
186186

187187
*CHUNK_TRASH_ENABLED (EXPERIMENTAL)*:: enables or disables the chunk trash
188188
feature. When enabled, deleted chunks are moved to a trash directory instead of
189-
being immediately removed. (Default: 0)
189+
being immediately removed. (Default: 1)
190190

191191
*CHUNK_TRASH_EXPIRATION_SECONDS (EXPERIMENTAL)*:: specifies the timeout in
192192
seconds for chunks to remain in the trash before being permanently deleted.
@@ -198,9 +198,11 @@ threshold, the system will start deleting older chunks from the trash to free
198198
up space. A suggested value is about the 15% of the capacity of the smaller
199199
configured disk. (Default: 0)
200200

201-
*CHUNK_TRASH_GC_BATCH_SIZE (EXPERIMENTAL)*:: defines the bulk size for the
202-
garbage collector when processing chunks in the trash. This determines how many
203-
files are processed in each garbage collection cycle. (Default: 1000)
201+
*CHUNK_TRASH_GC_BATCH_SIZE (EXPERIMENTAL)*:: defines the maximum bulk size
202+
per disk for the garbage collector when processing chunks in the trash.
203+
This value is used when no I/O pressure is detected and is dynamically
204+
scaled down based on current disk load.
205+
(Default: 500)
204206

205207
*CHUNK_TRASH_GC_SPACE_RECOVERY_BATCH_SIZE (EXPERIMENTAL)*:: [ADVANCED] The
206208
number of files to remove from the trash in a single GC cycle, in case the disk

src/admin/dump_config_command.cc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -182,10 +182,10 @@ const static std::unordered_map<std::string, std::string> defaultOptionsCS = {
182182
{"REPLICATION_CONNECTION_TIMEOUT_MS", "1000"},
183183
{"REPLICATION_WAVE_TIMEOUT_MS", "500"},
184184
{"PLUGINS_DIR", ""},
185-
{"CHUNK_TRASH_ENABLED", "0"},
185+
{"CHUNK_TRASH_ENABLED", "1"},
186186
{"CHUNK_TRASH_EXPIRATION_SECONDS", "259200"},
187187
{"CHUNK_TRASH_FREE_SPACE_THRESHOLD_GB", "0"},
188-
{"CHUNK_TRASH_GC_BATCH_SIZE", "1000"},
188+
{"CHUNK_TRASH_GC_BATCH_SIZE", "500"},
189189
{"CHUNK_TRASH_GC_SPACE_RECOVERY_BATCH_SIZE", "100"},
190190
};
191191

src/chunkserver/chunkserver-common/chunk_trash_index.cc

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -80,10 +80,8 @@ ChunkTrashIndex::TrashIndexDiskEntries ChunkTrashIndex::getExpiredFilesInternal(
8080
const time_t &timeLimit, size_t bulkSize) {
8181
TrashIndexDiskEntries expiredFiles;
8282

83-
size_t count = 0;
8483
for (const auto &diskEntry : trashIndex) {
85-
if (bulkSize != 0 && count >= bulkSize) { break; }
86-
count += getExpiredFilesInternal(diskEntry.first, timeLimit, expiredFiles, bulkSize);
84+
getExpiredFilesInternal(diskEntry.first, timeLimit, expiredFiles, bulkSize);
8785
}
8886

8987
return expiredFiles;

src/chunkserver/chunkserver-common/chunk_trash_manager.cc

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -54,9 +54,11 @@ int ChunkTrashManager::moveToTrash(const std::filesystem::path &filePath,
5454
if (!isEnabled) { return 0; }
5555

5656
// Protect against concurrent access
57-
std::lock_guard<std::mutex> lock(implMutex);
58-
59-
auto &impl = getImpl();
57+
ImplementationPtr impl;
58+
{
59+
std::lock_guard<std::mutex> lock(implMutex);
60+
impl = getImpl();
61+
}
6062
if (!impl) {
6163
safs::log_error_code(SAUNAFS_ERROR_EINVAL,
6264
"ChunkTrashManager implementation not initialized");
@@ -69,9 +71,11 @@ int ChunkTrashManager::init(const std::string &diskPath) {
6971
reloadConfig();
7072

7173
// Protect against concurrent access
72-
std::lock_guard<std::mutex> lock(implMutex);
73-
74-
auto &impl = getImpl();
74+
ImplementationPtr impl;
75+
{
76+
std::lock_guard<std::mutex> lock(implMutex);
77+
impl = getImpl();
78+
}
7579
if (!impl) {
7680
safs::log_error_code(SAUNAFS_ERROR_EINVAL,
7781
"ChunkTrashManager implementation not initialized");
@@ -84,9 +88,11 @@ void ChunkTrashManager::collectGarbage() {
8488
if (!isEnabled) { return; }
8589

8690
// Protect against concurrent access
87-
std::lock_guard<std::mutex> lock(implMutex);
88-
89-
auto &impl = getImpl();
91+
ImplementationPtr impl;
92+
{
93+
std::lock_guard<std::mutex> lock(implMutex);
94+
impl = getImpl();
95+
}
9096
if (!impl) {
9197
safs::log_error_code(SAUNAFS_ERROR_EINVAL,
9298
"ChunkTrashManager implementation not initialized");
@@ -96,17 +102,19 @@ void ChunkTrashManager::collectGarbage() {
96102
}
97103

98104
void ChunkTrashManager::reloadConfig() {
99-
// Protect against concurrent access
100-
std::lock_guard<std::mutex> lock(implMutex);
101-
102-
auto &impl = getImpl();
105+
// Protect against concurrent access
106+
ImplementationPtr impl;
107+
{
108+
std::lock_guard<std::mutex> lock(implMutex);
109+
impl = getImpl();
110+
}
103111
if (!impl) {
104112
safs::log_error_code(SAUNAFS_ERROR_EINVAL,
105113
"ChunkTrashManager implementation not initialized");
106114
return;
107115
}
108116

109-
isEnabled = cfg_get("CHUNK_TRASH_ENABLED", static_cast<u_short>(0));
117+
isEnabled = cfg_get("CHUNK_TRASH_ENABLED", static_cast<u_short>(1));
110118
safs::log_info("Chunk trash manager is {}", isEnabled ? "enabled" : "disabled");
111119
impl->reloadConfig();
112120
}

src/chunkserver/chunkserver-common/chunk_trash_manager_impl.cc

Lines changed: 63 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,13 @@
2020

2121
#include <sys/statvfs.h>
2222
#include <algorithm>
23+
#include <cstdint>
2324
#include <ctime>
2425

2526
#include "chunkserver-common/chunk_trash_manager_impl.h"
2627
#include "config/cfg.h"
2728
#include "errors/saunafs_error_codes.h"
29+
#include "global_shared_resources.h"
2830
#include "hdd_stats.h"
2931
#include "slogger/slogger.h"
3032

@@ -36,6 +38,12 @@ size_t ChunkTrashManagerImpl::trashGarbageCollectorBulkSize = kDefaultTrashGarba
3638
size_t ChunkTrashManagerImpl::garbageCollectorSpaceRecoveryStep =
3739
kDefaultGarbageCollectorSpaceRecoveryStep;
3840

41+
uint64_t ChunkTrashManagerImpl::maxBytesReadPerDisk = 1024 * 1024; //1MiB
42+
uint64_t ChunkTrashManagerImpl::maxBytesWritePerDisk = 1024 * 1024; //1MiB
43+
44+
uint64_t ChunkTrashManagerImpl::previousBytesReadPerDisk = 1024 * 1024; //1MiB
45+
uint64_t ChunkTrashManagerImpl::previousBytesWritePerDisk = 1024 * 1024; //1MiB
46+
3947
const std::string ChunkTrashManagerImpl::kTrashGuardString =
4048
std::string("/") + ChunkTrashManager::kTrashDirname + "/";
4149

@@ -159,12 +167,19 @@ int ChunkTrashManagerImpl::moveToTrash(const fs::path &filePath, const fs::path
159167

160168
void ChunkTrashManagerImpl::removeTrashFiles(
161169
const ChunkTrashIndex::TrashIndexDiskEntries &filesToRemove) const {
170+
std::vector<std::jthread> parallelRemovers;
171+
162172
for (const auto &[diskPath, fileEntries] : filesToRemove) {
163-
for (const auto &fileEntry : fileEntries) {
164-
if (removeFileFromTrash(fileEntry.second) != SAUNAFS_STATUS_OK) { continue; }
173+
parallelRemovers.emplace_back(removeTrashFilesFromDisk, fileEntries, diskPath);
174+
}
175+
}
176+
177+
void ChunkTrashManagerImpl::removeTrashFilesFromDisk(
178+
const ChunkTrashIndex::TrashIndexFileEntries &filesToRemove, const std::string &diskPath) {
179+
for (const auto &fileEntry : filesToRemove) {
180+
if (removeFileFromTrash(fileEntry.second) != SAUNAFS_STATUS_OK) { continue; }
165181
HddStats::gStatsOperationsGCPurge++;
166-
getTrashIndex().remove(fileEntry.first, fileEntry.second, diskPath);
167-
}
182+
getTrashIndex().remove(fileEntry.first, fileEntry.second, diskPath);
168183
}
169184
}
170185

@@ -256,8 +271,51 @@ void ChunkTrashManagerImpl::collectGarbage() {
256271
if (!ChunkTrashManager::isEnabled) { return; }
257272
std::time_t const currentTime = std::time(nullptr);
258273
std::time_t const expirationTime = currentTime - trashTimeLimitSeconds;
259-
removeExpiredFiles(expirationTime, trashGarbageCollectorBulkSize);
274+
275+
uint64_t currentBytesWrite = HddStats::gBytesWrittenSinceLastGCSweep.exchange(0);
276+
uint64_t currentBytesRead = HddStats::gBytesReadSinceLastGCSweep.exchange(0);
277+
uint64_t currentDiskCount = 1;
278+
{
279+
std::lock_guard disksLockGuard(gDisksMutex);
280+
currentDiskCount = std::max(currentDiskCount, gDisks.size());
281+
}
282+
currentBytesRead /= currentDiskCount;
283+
currentBytesWrite /= currentDiskCount;
284+
285+
// 0.99997 ^ (30 cycles/min * 60 minutes an hour * 72 hours a day) = 0.02 (2%)
286+
maxBytesReadPerDisk =
287+
std::max({uint64_t(maxBytesReadPerDisk * 0.99997), currentBytesRead, uint64_t(1'000'000)});
288+
maxBytesWritePerDisk = std::max(
289+
{uint64_t(maxBytesWritePerDisk * 0.99997), currentBytesWrite, uint64_t(1'000'000)});
290+
291+
double totalIOPercentage =
292+
static_cast<double>(currentBytesRead) * 100.0 / maxBytesReadPerDisk +
293+
static_cast<double>(currentBytesWrite) * 100.0 / maxBytesWritePerDisk;
294+
295+
auto invertedSigmoid = [](double val) -> double {
296+
const double steepness = 10.0; // steepness
297+
const double center = 0.15; // center in [0,1]
298+
const double valnorm = val / 100.0; // normalize so that 100% total I/O maps to 1.0
299+
300+
const double res = 1.0 / (1.0 + std::exp(-steepness * (valnorm - center)));
301+
return 1.0 - res;
302+
};
303+
304+
uint64_t bulksizeScaled = trashGarbageCollectorBulkSize * invertedSigmoid(totalIOPercentage);
305+
if ((currentBytesRead + 1) / (previousBytesReadPerDisk + 1) +
306+
(currentBytesWrite + 1) / (previousBytesWritePerDisk + 1) >=
307+
10) {
308+
bulksizeScaled = 0;
309+
}
310+
311+
static constexpr uint64_t kMinGCBulkSizeForActivation = 5;
312+
313+
if (bulksizeScaled >= kMinGCBulkSizeForActivation) {
314+
removeExpiredFiles(expirationTime, bulksizeScaled);
315+
}
260316
makeSpace(availableThresholdGB, garbageCollectorSpaceRecoveryStep);
317+
previousBytesReadPerDisk = currentBytesRead;
318+
previousBytesWritePerDisk = currentBytesWrite;
261319
}
262320

263321
bool ChunkTrashManagerImpl::isTrashPath(const std::string &filePath) {

src/chunkserver/chunkserver-common/chunk_trash_manager_impl.h

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,14 @@ class ChunkTrashManagerImpl : public IChunkTrashManagerImpl {
8080
*/
8181
void removeTrashFiles(const ChunkTrashIndex::TrashIndexDiskEntries &filesToRemove) const;
8282

83+
/**
84+
* @brief Removes a set of specified files from the trash for an specific FileEntries.
85+
* @param filesToRemove The list of files to be permanently deleted.
86+
* @param diskPath The path of the disk of the files.
87+
*/
88+
static void removeTrashFilesFromDisk(
89+
const ChunkTrashIndex::TrashIndexFileEntries &filesToRemove, const std::string &diskPath);
90+
8391
/**
8492
* @brief Checks if a given timestamp string matches the expected format.
8593
* @param timestamp The timestamp string to validate.
@@ -137,6 +145,14 @@ class ChunkTrashManagerImpl : public IChunkTrashManagerImpl {
137145
void reloadConfig() override;
138146

139147
private:
148+
/// Maximum recorded BytesReadPerDisk and BytesWritePerDisk in a GC cycle
149+
static uint64_t maxBytesReadPerDisk;
150+
static uint64_t maxBytesWritePerDisk;
151+
152+
/// Previous recorded BytesReadPerDisk and BytesWritePerDisk in a GC cycle
153+
static uint64_t previousBytesReadPerDisk; //1MiB
154+
static uint64_t previousBytesWritePerDisk; //1MiB
155+
140156
/// Minimum available space threshold (in GB) before triggering garbage
141157
/// collection.
142158
static size_t availableThresholdGB;
@@ -145,12 +161,12 @@ class ChunkTrashManagerImpl : public IChunkTrashManagerImpl {
145161
/// Time limit (in seconds) for files to be considered eligible for
146162
/// deletion.
147163
static size_t trashTimeLimitSeconds;
148-
static constexpr size_t kDefaultTrashTimeLimitSeconds = 259200;
164+
static constexpr size_t kDefaultTrashTimeLimitSeconds = 0;
149165

150166
/// Number of files processed in each bulk operation during garbage
151167
/// collection.
152168
static size_t trashGarbageCollectorBulkSize;
153-
static constexpr size_t kDefaultTrashGarbageCollectorBulkSize = 1000;
169+
static constexpr size_t kDefaultTrashGarbageCollectorBulkSize = 500;
154170

155171
/// Number of files to remove in each step to free up space when required.
156172
static size_t garbageCollectorSpaceRecoveryStep;

src/chunkserver/chunkserver-common/hdd_stats.cc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ static inline void totalRead(IDisk *disk, uint64_t size, MicroSeconds duration)
4242

4343
gStatsTotalOperationsRead++;
4444
gStatsTotalBytesRead += size;
45+
gBytesReadSinceLastGCSweep += size;
4546
gStatsTotalTimeRead += duration;
4647

4748
auto &diskStats = disk->getCurrentStats();
@@ -61,6 +62,7 @@ static inline void totalWrite(IDisk *disk, uint64_t size,
6162

6263
gStatsTotalOperationsWrite++;
6364
gStatsTotalBytesWrite += size;
65+
gBytesWrittenSinceLastGCSweep += size;
6466
gStatsTotalTimeWrite += duration;
6567

6668
auto &diskStats = disk->getCurrentStats();

src/chunkserver/chunkserver-common/hdd_stats.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,12 @@ inline std::atomic<uint32_t> gStatsOperationsDupTrunc(0);
5858

5959
inline std::atomic<uint32_t> gStatsOperationsGCPurge(0);
6060

61+
62+
// This is for internal use of GarbageCollector
63+
inline std::atomic<uint64_t> gBytesWrittenSinceLastGCSweep(0);
64+
inline std::atomic<uint64_t> gBytesReadSinceLastGCSweep(0);
65+
66+
6167
struct statsReport {
6268
statsReport(uint64_t *overBytesRead, uint64_t *overBytesWrite,
6369
uint32_t *overOpsRead, uint32_t *overOpsWrite,

src/chunkserver/hddspacemgr.cc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2532,14 +2532,14 @@ void hddFreeResourcesThread() {
25322532
pthread_setname_np(pthread_self(), "freeResThread");
25332533

25342534
while (!gTerminate) {
2535+
Timeout timeout(std::chrono::microseconds(kDelayedStep * 1'000'000));
25352536
gOpenChunks.freeUnused(eventloop_time(), gChunksMapMutex,
25362537
kMaxFreeUnused);
25372538
ChunkTrashManager::collectGarbage();
25382539
hddReleaseDisksToBeDeleted();
25392540
/// Release buffers older than kDelayedStep seconds
25402541
releaseOldIoBuffers(kOldIoBuffersExpirationTimeMs);
2541-
2542-
sleep(kDelayedStep);
2542+
usleep(timeout.remaining_us());
25432543
}
25442544
}
25452545

src/data/sfschunkserver.cfg.in

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -253,10 +253,11 @@
253253
## the capacity of the smaller configured disk. (Default: 0)
254254
# CHUNK_TRASH_FREE_SPACE_THRESHOLD_GB = 0
255255

256-
## Defines the bulk size for the garbage collector when processing chunks in the
257-
## trash. This determines how many files are processed in each garbage
258-
## collection cycle. (Default: 1000)
259-
# CHUNK_TRASH_GC_BATCH_SIZE = 1000
256+
## Defines the maximum bulk size per disk for the garbage
257+
## collector when processing chunks in the trash.
258+
## This value is used when no I/O pressure is detected and is dynamically
259+
## scaled down based on current disk load. (Default: 1000)
260+
# CHUNK_TRASH_GC_BATCH_SIZE = 500
260261

261262
## [ADVANCED] The number of files to remove from the trash in a single GC cycle,
262263
## in case the disk is full and space needs to be recovered. (Default: 100)

0 commit comments

Comments
 (0)