-
Notifications
You must be signed in to change notification settings - Fork 1.9k
FreeBSD: zfs_getpages: Don't zero freshly allocated pages #17851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FreeBSD: zfs_getpages: Don't zero freshly allocated pages #17851
Conversation
Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. Signed-off-by: Jean-Sébastien Pédron <[email protected]>
ebc5a1e
to
8a3533a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. We should try to land this change in 15.0--please let me know if I can help with that.
I don’t know if the question was for me, but I don’t know the process to port this to FreeBSD. I have this commit in my local FreeBSD branch that I’m running on a daily basis, but I suppose it’s not just a matter of pushing this commit to the main branch. Is it documented somewhere? |
I presume releng/15.0 isn't going to receive any more ZFS merges. If so, then assuming that there is a merge to main in the next week or two, we can cherry-pick the commit into stable/15 -> releng/15.0. If there will be another ZFS merge into 15.0, then there is nothing for us to do. The bug is tracked here so we'll catch it one way or another. |
I see. I can submit a review to Phabricator for this specific patch tonight to get the ball rolling, and if a ZFS merge happens meanwhile, I will abandon it. |
Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. This patch was submitted to OpenZFS as openzfs/zfs#17851 which was approved. Obtained from: OpenZFS OpenZFS commit: 8a3533a366e6df2ea770ad7d80b7b68a94a81023 MFC after: 3 days Differential revision:
Review submitted: |
Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. This patch was submitted to OpenZFS as openzfs/zfs#17851 which was approved. Reviewed by: avg, mav Obtained from: OpenZFS OpenZFS commit: 8a3533a366e6df2ea770ad7d80b7b68a94a81023 MFC after: 3 days Differential revision: https://reviews.freebsd.org/D53219
Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Jean-Sébastien Pédron <[email protected]> Closes openzfs#17851
Motivation and Context
I hit a failing testsuite while working on RabbitMQ on FreeBSD: a read(2) returned unexpected (zero'd) data even though the data was successfully written to the file.
Description
Initially,
zfs_getpages()
is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrentzfs_write()
running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock withzfs_write()
. After that, it grabs the pages again and retries to acquire the range lock, and so on.Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages.
The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content.
This was discovered while working on RabbitMQ on FreeBSD. The RabbitMQ testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash.
The patch consists of not setting the
VM_ALLOC_ZERO
flag whenzfs_getpages()
grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content.How Has This Been Tested?
I could reproduce the problem easily with the following commands:
I’m running FreeBSD 16-CURRENT from a couple weeks ago. I could reproduce the problem for the past 6 month, just didn’t have the time to work on it.
I run the testsuite in a loop. On my laptop, it took about 3-4 minutes to hit the problem without the patch. With the patch, the testsuite ran successfully for 4 hours.
Types of changes
Checklist:
Signed-off-by
.