Conversation
Summary of ChangesHello @dmga44, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical data integrity issue where chunks could become unavailable due to CRC errors after chunkservers were killed and restarted during write operations. The problem stemmed from the chunk lock being released before metadata synchronization (job_close) was guaranteed to complete. The fix ensures that the chunk lock is only released after the metadata has been successfully synced to disk, preventing inconsistencies. Additionally, the priority of Close operations has been elevated to expedite this critical synchronization. A new long system test has been introduced to validate the robustness of this solution under stress conditions. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the critical issue of data inconsistency and CRC errors observed after chunkserver crashes during write operations. The core of the fix involves ensuring that the endChunkLock call, which notifies the master about the completion of a write, is deferred until after the job_close operation (responsible for syncing metadata to disk) has successfully completed. This is achieved by introducing a callback mechanism for job_close in WriteHighLevelOp::cleanup.
Additionally, the priority of ChunkOperation::Close jobs has been elevated in the JobPool, which is a sensible change to ensure that metadata synchronization operations are processed promptly. The inclusion of a comprehensive long system test (test_kill_cs_while_writing_small_files.sh) is highly commendable. This test rigorously simulates the failure scenario, providing strong validation for the fix and helping prevent future regressions. Overall, the changes are well-reasoned, correctly implemented, and significantly improve the robustness of the system under failure conditions.
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical data integrity issue where chunkservers that are killed during write operations can leave chunks with correct data but incorrect CRC metadata, leading to chunks becoming unavailable when other chunkservers are stopped.
Changes:
- Modified write operation cleanup to ensure metadata sync completes before releasing chunk locks
- Increased priority of close operations to match open and getBlocks operations
- Added comprehensive test to validate the fix by simulating the kill/restart scenario
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh | New test that reproduces the bug scenario by writing files, killing chunkservers mid-write, and validating data integrity |
| src/chunkserver/chunk_high_level_ops.h | Added declaration for new callback function that handles chunk lock release after close completion |
| src/chunkserver/chunk_high_level_ops.cc | Implemented deferred chunk lock release using callback to ensure metadata sync completes first |
| src/chunkserver/bgjobs.cc | Updated priority logic to give close operations the same high priority as open and getBlocks |
tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh
Show resolved
Hide resolved
tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh
Outdated
Show resolved
Hide resolved
639354c to
aa85e82
Compare
593c5e5 to
196e86f
Compare
|
@gemini-code-assist Please review again. |
There was a problem hiding this comment.
Code Review
This pull request addresses a critical issue where a chunkserver crash during a write operation could lead to data corruption. The solution, which involves ensuring metadata is synced before releasing the chunk lock via a callback and prioritizing close operations, appears to be correct and well-implemented. The addition of a new system test to validate the fix is also a great step.
I've identified a critical data race in the accompanying termination logic refactoring that needs to be addressed. I've also provided a suggestion to improve the robustness of the new test script.
tests/test_suites/LongSystemTests/test_kill_cs_while_writing_small_files.sh
Outdated
Show resolved
Hide resolved
Recent tests show chunks unavailable when performing the following test: - start writing small files in ec(6,2) in the background. - kill two chunkservers. - wait for the writes of the files to finish. - bring the two chunkservers back. - wait for the data to be replicated. - stop some other two chunkservers. - validate data is available. In the last step, there are six chunkservers available and no chunk parts missing so there should be chunks unavailable. The error happening was CRC error in the kill and restarted chunkservers. The issue found is the following: - some chunk gets its data parts successfully written to the drive. - the client gets to know this (chunk write finished OK) and sends WRITE_END packet to the CSs. - the CS gets killed after receiving the WRITE_END but before doing the job_close (hddClose) that is the responsable function to sync the metadata parts to the drive. Therefore, the data parts of those chunks are fine, but the CRC of the blocks is incorrect. - the client unlocks the chunk in the master side (WRITE_END packet) without noticing any issue and without retrying the write (since it finished everything it had to write). - there is no version increase in the other chunk parts and after the CS is restarted, its chunk parts are registered as good ones, despite the previously mentioned CRC error (which no component knows about). - after stopping other CSs and trying to write, the issue emerges. The solution so far is to move the endChunkLock call to after the job_close is processed and increase the priority of the close operations. This way we make sure that master receives notice about the write end after all that chunk part related operations are completed. This solution does not solve the case when USE_CHUNKSERVER_SIDE_CHUNK_LOCK option is disabled. A test was added to check the previously mentioned scenario. Signed-off-by: Dave <dave@leil.io>
196e86f to
de7104e
Compare
Recent tests show chunks unavailable when performing the following test:
In the last step, there are six chunkservers available and no chunk parts missing so there should be chunks unavailable. The error happening was CRC error in the kill and restarted chunkservers.
The issue found is the following:
WRITE_END packet to the CSs.
the job_close (hddClose) that is the responsable function to sync the metadata parts to the drive. Therefore, the data parts of those chunks are fine, but the CRC of the blocks is incorrect.
without noticing any issue and without retrying the write (since it finished everything it had to write).
CS is restarted, its chunk parts are registered as good ones, despite the previously mentioned CRC error (which no component knows about).
The solution so far is to move the endChunkLock call to after the job_close is processed and increase the priority of the close operations. This way we make sure that master receives notice about the write end after all that chunk part related operations are completed.
This solution does not solve the case when USE_CHUNKSERVER_SIDE_CHUNK_LOCK option is disabled.
A test was added to check the previously mentioned scenario.
Related to LS-351.
Signed-off-by: Dave dave@leil.io