fix: fix CS kill/crash when writing data by dmga44 · Pull Request #755 · leil-io/saunafs

dmga44 · 2026-02-18T15:12:55Z

Recent tests show chunks unavailable when performing the following test:

start writing small files in ec(6,2) in the background.
kill two chunkservers.
wait for the writes of the files to finish.
bring the two chunkservers back.
wait for the data to be replicated.
stop some other two chunkservers.
validate data is available.

In the last step, there are six chunkservers available and no chunk parts missing so there should be chunks unavailable. The error happening was CRC error in the kill and restarted chunkservers.

The issue found is the following:

some chunk gets its data parts successfully written to the drive.
the client gets to know this (chunk write finished OK) and sends
WRITE_END packet to the CSs.
the CS gets killed after receiving the WRITE_END but before doing
the job_close (hddClose) that is the responsable function to sync the metadata parts to the drive. Therefore, the data parts of those chunks are fine, but the CRC of the blocks is incorrect.
the client unlocks the chunk in the master side (WRITE_END packet)
without noticing any issue and without retrying the write (since it finished everything it had to write).
there is no version increase in the other chunk parts and after the
CS is restarted, its chunk parts are registered as good ones, despite the previously mentioned CRC error (which no component knows about).
after stopping other CSs and trying to read, the issue emerges.

The solution so far is to move the endChunkLock call to after the job_close is processed and increase the priority of the close operations. This way we make sure that master receives notice about the write end after all that chunk part related operations are completed.

This solution does not solve the case when USE_CHUNKSERVER_SIDE_CHUNK_LOCK option is disabled.

A test was added to check the previously mentioned scenario.

Related to LS-351.

Signed-off-by: Dave dave@leil.io

gemini-code-assist · 2026-02-18T15:13:14Z

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical data integrity issue where chunks could become unavailable due to CRC errors after chunkservers were killed and restarted during write operations. The problem stemmed from the chunk lock being released before metadata synchronization (job_close) was guaranteed to complete. The fix ensures that the chunk lock is only released after the metadata has been successfully synced to disk, preventing inconsistencies. Additionally, the priority of Close operations has been elevated to expedite this critical synchronization. A new long system test has been introduced to validate the robustness of this solution under stress conditions.

Highlights

Prioritized Close Operations: Elevated the priority of ChunkOperation::Close jobs within the JobPool to ensure metadata synchronization completes more quickly.
Synchronized Chunk Lock Release: Modified the WriteHighLevelOp::cleanup logic to delay the release of chunk locks (endChunkLock) until after the job_close operation, which syncs metadata, has successfully completed.
Introduced Callback for Close Operations: Implemented jobCloseWriteCallback to facilitate the asynchronous execution of job_close and ensure the chunk lock is released with the correct status post-synchronization.
Added System Test: Included a new long system test (test_kill_cs_while_writing_small_files.sh) to reliably reproduce the original issue and verify the effectiveness of the fix under scenarios involving chunkserver failures during writes.

Changelog

src/chunkserver/bgjobs.cc
- Added ChunkOperation::Close to the list of high-priority operations in the JobPool.
src/chunkserver/chunk_high_level_ops.cc
- Modified WriteHighLevelOp::cleanup to conditionally use jobCloseWriteCallback for job_close when a chunk is locked, ensuring endChunkLock is called after metadata sync.
- Implemented jobCloseWriteCallback to handle the status propagation and chunk lock release after a job_close operation.
src/chunkserver/chunk_high_level_ops.h
- Declared the jobCloseWriteCallback function.
tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh
- Added a new long system test script to simulate chunkserver kills during small file writes and validate data integrity.

Activity

No human activity (comments, reviews, etc.) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the critical issue of data inconsistency and CRC errors observed after chunkserver crashes during write operations. The core of the fix involves ensuring that the endChunkLock call, which notifies the master about the completion of a write, is deferred until after the job_close operation (responsible for syncing metadata to disk) has successfully completed. This is achieved by introducing a callback mechanism for job_close in WriteHighLevelOp::cleanup.

Additionally, the priority of ChunkOperation::Close jobs has been elevated in the JobPool, which is a sensible change to ensure that metadata synchronization operations are processed promptly. The inclusion of a comprehensive long system test (test_kill_cs_while_writing_small_files.sh) is highly commendable. This test rigorously simulates the failure scenario, providing strong validation for the fix and helping prevent future regressions. Overall, the changes are well-reasoned, correctly implemented, and significantly improve the robustness of the system under failure conditions.

Copilot

Pull request overview

This PR fixes a critical data integrity issue where chunkservers that are killed during write operations can leave chunks with correct data but incorrect CRC metadata, leading to chunks becoming unavailable when other chunkservers are stopped.

Changes:

Modified write operation cleanup to ensure metadata sync completes before releasing chunk locks
Increased priority of close operations to match open and getBlocks operations
Added comprehensive test to validate the fix by simulating the kill/restart scenario

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh	New test that reproduces the bug scenario by writing files, killing chunkservers mid-write, and validating data integrity
src/chunkserver/chunk_high_level_ops.h	Added declaration for new callback function that handles chunk lock release after close completion
src/chunkserver/chunk_high_level_ops.cc	Implemented deferred chunk lock release using callback to ensure metadata sync completes first
src/chunkserver/bgjobs.cc	Updated priority logic to give close operations the same high priority as open and getBlocks

tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh

lgsilva3087

LGTM

ralcolea

Nice job @dmga44! 👍

rolysr

Very nice catch! LGTM.

dmga44 · 2026-02-20T05:28:44Z

@gemini-code-assist Please review again.

gemini-code-assist

Code Review

This pull request addresses a critical issue where a chunkserver crash during a write operation could lead to data corruption. The solution, which involves ensuring metadata is synced before releasing the chunk lock via a callback and prioritizing close operations, appears to be correct and well-implemented. The addition of a new system test to validate the fix is also a great step.

I've identified a critical data race in the accompanying termination logic refactoring that needs to be addressed. I've also provided a suggestion to improve the robustness of the new test script.

src/chunkserver/network_worker_thread.cc

tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

src/chunkserver/chunk_high_level_ops.cc

Recent tests show chunks unavailable when performing the following test: - start writing small files in ec(6,2) in the background. - kill two chunkservers. - wait for the writes of the files to finish. - bring the two chunkservers back. - wait for the data to be replicated. - stop some other two chunkservers. - validate data is available. In the last step, there are six chunkservers available and no chunk parts missing so there should be chunks unavailable. The error happening was CRC error in the kill and restarted chunkservers. The issue found is the following: - some chunk gets its data parts successfully written to the drive. - the client gets to know this (chunk write finished OK) and sends WRITE_END packet to the CSs. - the CS gets killed after receiving the WRITE_END but before doing the job_close (hddClose) that is the responsable function to sync the metadata parts to the drive. Therefore, the data parts of those chunks are fine, but the CRC of the blocks is incorrect. - the client unlocks the chunk in the master side (WRITE_END packet) without noticing any issue and without retrying the write (since it finished everything it had to write). - there is no version increase in the other chunk parts and after the CS is restarted, its chunk parts are registered as good ones, despite the previously mentioned CRC error (which no component knows about). - after stopping other CSs and trying to write, the issue emerges. The solution so far is to move the endChunkLock call to after the job_close is processed and increase the priority of the close operations. This way we make sure that master receives notice about the write end after all that chunk part related operations are completed. This solution does not solve the case when USE_CHUNKSERVER_SIDE_CHUNK_LOCK option is disabled. A test was added to check the previously mentioned scenario. Signed-off-by: Dave <dave@leil.io>

dmga44 self-assigned this Feb 18, 2026

dmga44 requested review from GigaCronos, antuan96314, Copilot, lgsilva3087, ralcolea, rolysr and uristdwarf February 18, 2026 15:13

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

Copilot AI reviewed Feb 18, 2026

View reviewed changes

tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh Show resolved Hide resolved

tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh Outdated Show resolved Hide resolved

dmga44 force-pushed the fix-cs-kill-when-writing-data branch 3 times, most recently from 639354c to aa85e82 Compare February 19, 2026 14:26

lgsilva3087 approved these changes Feb 19, 2026

View reviewed changes

ralcolea approved these changes Feb 19, 2026

View reviewed changes

rolysr approved these changes Feb 19, 2026

View reviewed changes

dmga44 force-pushed the fix-cs-kill-when-writing-data branch 3 times, most recently from 593c5e5 to 196e86f Compare February 20, 2026 05:27

dmga44 requested a review from Copilot February 20, 2026 05:28

Copilot started reviewing on behalf of dmga44 February 20, 2026 05:28 View session

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

src/chunkserver/network_worker_thread.cc Show resolved Hide resolved

tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh Outdated Show resolved Hide resolved

Copilot AI reviewed Feb 20, 2026

View reviewed changes

src/chunkserver/chunk_high_level_ops.cc Outdated Show resolved Hide resolved

dmga44 force-pushed the fix-cs-kill-when-writing-data branch from 196e86f to de7104e Compare February 20, 2026 05:45

dmga44 merged commit 642aaf6 into dev Feb 20, 2026
11 checks passed

dmga44 deleted the fix-cs-kill-when-writing-data branch February 20, 2026 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix CS kill/crash when writing data#755

fix: fix CS kill/crash when writing data#755
dmga44 merged 1 commit intodevfrom
fix-cs-kill-when-writing-data

dmga44 commented Feb 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

lgsilva3087 left a comment

Uh oh!

ralcolea left a comment

Uh oh!

rolysr left a comment

Uh oh!

dmga44 commented Feb 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dmga44 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

lgsilva3087 left a comment

Choose a reason for hiding this comment

Uh oh!

ralcolea left a comment

Choose a reason for hiding this comment

Uh oh!

rolysr left a comment

Choose a reason for hiding this comment

Uh oh!

dmga44 commented Feb 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dmga44 commented Feb 18, 2026 •

edited

Loading