Skip to content

fix: fix CS kill/crash when writing data#755

Merged
dmga44 merged 1 commit intodevfrom
fix-cs-kill-when-writing-data
Feb 20, 2026
Merged

fix: fix CS kill/crash when writing data#755
dmga44 merged 1 commit intodevfrom
fix-cs-kill-when-writing-data

Conversation

@dmga44
Copy link
Collaborator

@dmga44 dmga44 commented Feb 18, 2026

Recent tests show chunks unavailable when performing the following test:

  • start writing small files in ec(6,2) in the background.
  • kill two chunkservers.
  • wait for the writes of the files to finish.
  • bring the two chunkservers back.
  • wait for the data to be replicated.
  • stop some other two chunkservers.
  • validate data is available.

In the last step, there are six chunkservers available and no chunk parts missing so there should be chunks unavailable. The error happening was CRC error in the kill and restarted chunkservers.

The issue found is the following:

  • some chunk gets its data parts successfully written to the drive.
  • the client gets to know this (chunk write finished OK) and sends
    WRITE_END packet to the CSs.
  • the CS gets killed after receiving the WRITE_END but before doing
    the job_close (hddClose) that is the responsable function to sync the metadata parts to the drive. Therefore, the data parts of those chunks are fine, but the CRC of the blocks is incorrect.
  • the client unlocks the chunk in the master side (WRITE_END packet)
    without noticing any issue and without retrying the write (since it finished everything it had to write).
  • there is no version increase in the other chunk parts and after the
    CS is restarted, its chunk parts are registered as good ones, despite the previously mentioned CRC error (which no component knows about).
  • after stopping other CSs and trying to read, the issue emerges.

The solution so far is to move the endChunkLock call to after the job_close is processed and increase the priority of the close operations. This way we make sure that master receives notice about the write end after all that chunk part related operations are completed.

This solution does not solve the case when USE_CHUNKSERVER_SIDE_CHUNK_LOCK option is disabled.

A test was added to check the previously mentioned scenario.

Related to LS-351.

Signed-off-by: Dave dave@leil.io

@gemini-code-assist
Copy link

Summary of Changes

Hello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical data integrity issue where chunks could become unavailable due to CRC errors after chunkservers were killed and restarted during write operations. The problem stemmed from the chunk lock being released before metadata synchronization (job_close) was guaranteed to complete. The fix ensures that the chunk lock is only released after the metadata has been successfully synced to disk, preventing inconsistencies. Additionally, the priority of Close operations has been elevated to expedite this critical synchronization. A new long system test has been introduced to validate the robustness of this solution under stress conditions.

Highlights

  • Prioritized Close Operations: Elevated the priority of ChunkOperation::Close jobs within the JobPool to ensure metadata synchronization completes more quickly.
  • Synchronized Chunk Lock Release: Modified the WriteHighLevelOp::cleanup logic to delay the release of chunk locks (endChunkLock) until after the job_close operation, which syncs metadata, has successfully completed.
  • Introduced Callback for Close Operations: Implemented jobCloseWriteCallback to facilitate the asynchronous execution of job_close and ensure the chunk lock is released with the correct status post-synchronization.
  • Added System Test: Included a new long system test (test_kill_cs_while_writing_small_files.sh) to reliably reproduce the original issue and verify the effectiveness of the fix under scenarios involving chunkserver failures during writes.
Changelog
  • src/chunkserver/bgjobs.cc
    • Added ChunkOperation::Close to the list of high-priority operations in the JobPool.
  • src/chunkserver/chunk_high_level_ops.cc
    • Modified WriteHighLevelOp::cleanup to conditionally use jobCloseWriteCallback for job_close when a chunk is locked, ensuring endChunkLock is called after metadata sync.
    • Implemented jobCloseWriteCallback to handle the status propagation and chunk lock release after a job_close operation.
  • src/chunkserver/chunk_high_level_ops.h
    • Declared the jobCloseWriteCallback function.
  • tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh
    • Added a new long system test script to simulate chunkserver kills during small file writes and validate data integrity.
Activity
  • No human activity (comments, reviews, etc.) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the critical issue of data inconsistency and CRC errors observed after chunkserver crashes during write operations. The core of the fix involves ensuring that the endChunkLock call, which notifies the master about the completion of a write, is deferred until after the job_close operation (responsible for syncing metadata to disk) has successfully completed. This is achieved by introducing a callback mechanism for job_close in WriteHighLevelOp::cleanup.

Additionally, the priority of ChunkOperation::Close jobs has been elevated in the JobPool, which is a sensible change to ensure that metadata synchronization operations are processed promptly. The inclusion of a comprehensive long system test (test_kill_cs_while_writing_small_files.sh) is highly commendable. This test rigorously simulates the failure scenario, providing strong validation for the fix and helping prevent future regressions. Overall, the changes are well-reasoned, correctly implemented, and significantly improve the robustness of the system under failure conditions.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical data integrity issue where chunkservers that are killed during write operations can leave chunks with correct data but incorrect CRC metadata, leading to chunks becoming unavailable when other chunkservers are stopped.

Changes:

  • Modified write operation cleanup to ensure metadata sync completes before releasing chunk locks
  • Increased priority of close operations to match open and getBlocks operations
  • Added comprehensive test to validate the fix by simulating the kill/restart scenario

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh New test that reproduces the bug scenario by writing files, killing chunkservers mid-write, and validating data integrity
src/chunkserver/chunk_high_level_ops.h Added declaration for new callback function that handles chunk lock release after close completion
src/chunkserver/chunk_high_level_ops.cc Implemented deferred chunk lock release using callback to ensure metadata sync completes first
src/chunkserver/bgjobs.cc Updated priority logic to give close operations the same high priority as open and getBlocks

@dmga44 dmga44 force-pushed the fix-cs-kill-when-writing-data branch 3 times, most recently from 639354c to aa85e82 Compare February 19, 2026 14:26
Copy link
Contributor

@lgsilva3087 lgsilva3087 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ralcolea ralcolea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @dmga44! 👍

Copy link
Collaborator

@rolysr rolysr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch! LGTM.

@dmga44 dmga44 force-pushed the fix-cs-kill-when-writing-data branch 3 times, most recently from 593c5e5 to 196e86f Compare February 20, 2026 05:27
@dmga44 dmga44 requested a review from Copilot February 20, 2026 05:28
@dmga44
Copy link
Collaborator Author

dmga44 commented Feb 20, 2026

@gemini-code-assist Please review again.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical issue where a chunkserver crash during a write operation could lead to data corruption. The solution, which involves ensuring metadata is synced before releasing the chunk lock via a callback and prioritizing close operations, appears to be correct and well-implemented. The addition of a new system test to validate the fix is also a great step.

I've identified a critical data race in the accompanying termination logic refactoring that needs to be addressed. I've also provided a suggestion to improve the robustness of the new test script.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Recent tests show chunks unavailable when performing the following
test:
- start writing small files in ec(6,2) in the background.
- kill two chunkservers.
- wait for the writes of the files to finish.
- bring the two chunkservers back.
- wait for the data to be replicated.
- stop some other two chunkservers.
- validate data is available.

In the last step, there are six chunkservers available and no chunk
parts missing so there should be chunks unavailable. The error
happening was CRC error in the kill and restarted chunkservers.

The issue found is the following:
- some chunk gets its data parts successfully written to the drive.
- the client gets to know this (chunk write finished OK) and sends
WRITE_END packet to the CSs.
- the CS gets killed after receiving the WRITE_END but before doing
the job_close (hddClose) that is the responsable function to sync the
metadata parts to the drive. Therefore, the data parts of those chunks
are fine, but the CRC of the blocks is incorrect.
- the client unlocks the chunk in the master side (WRITE_END packet)
without noticing any issue and without retrying the write (since it
finished everything it had to write).
- there is no version increase in the other chunk parts and after the
CS is restarted, its chunk parts are registered as good ones, despite
the previously mentioned CRC error (which no component knows about).
- after stopping other CSs and trying to write, the issue emerges.

The solution so far is to move the endChunkLock call to after the
job_close is processed and increase the priority of the close
operations. This way we make sure that master receives notice about the
write end after all that chunk part related operations are completed.

This solution does not solve the case when
USE_CHUNKSERVER_SIDE_CHUNK_LOCK option is disabled.

A test was added to check the previously mentioned scenario.

Signed-off-by: Dave <dave@leil.io>
@dmga44 dmga44 force-pushed the fix-cs-kill-when-writing-data branch from 196e86f to de7104e Compare February 20, 2026 05:45
@dmga44 dmga44 merged commit 642aaf6 into dev Feb 20, 2026
11 checks passed
@dmga44 dmga44 deleted the fix-cs-kill-when-writing-data branch February 20, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants