Skip to content

Conversation

dumbbell
Copy link
Contributor

Motivation and Context

I hit a failing testsuite while working on RabbitMQ on FreeBSD: a read(2) returned unexpected (zero'd) data even though the data was successfully written to the file.

Description

Initially, zfs_getpages() is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent zfs_write() running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with zfs_write(). After that, it grabs the pages again and retries to acquire the range lock, and so on.

Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages.

The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content.

This was discovered while working on RabbitMQ on FreeBSD. The RabbitMQ testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash.

The patch consists of not setting the VM_ALLOC_ZERO flag when zfs_getpages() grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content.

How Has This Been Tested?

I could reproduce the problem easily with the following commands:

git clone https://github.com/rabbitmq/rabbitmq-server.git
cd rabbitmq-server/deps/rabbit

gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
  ct-amqp_client t=cluster_size_3:leader_transfer_stream_send

I’m running FreeBSD 16-CURRENT from a couple weeks ago. I could reproduce the problem for the past 6 month, just didn’t have the time to work on it.

I run the testsuite in a loop. On my laptop, it took about 3-4 minutes to hit the problem without the patch. With the patch, the testsuite ran successfully for 4 hours.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn
Copy link
Member

robn commented Oct 16, 2025

@dumbbell dang, that's unpleasant! Thanks for the patch and clear description.

@markjdb would appreciate your eye on this. Seems plausible?

Initially, `zfs_getpages()` is provided with an array of busy pages by
the vnode pager. It then tries to acquire the range lock, but if there
is a concurrent `zfs_write()` running and fails to acquire that range
lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`.
After that, it grabs the pages again and retries to acquire the range
lock, and so on.

Once it got the range lock, it filters out valid pages, then copy DMU
data to the remaining invalid pages.

The problem is that freshly allocated zero'd pages it grabbed itself are
marked as valid. Therefore they are skipped by the second part of the
function and DMU data is never copied to these pages. This causes mapped
pages to contain zeros instead of the expected file content.

This was discovered while working on RabbitMQ on FreeBSD. I could
reproduce the problem easily with the following commands:

    git clone https://github.com/rabbitmq/rabbitmq-server.git
    cd rabbitmq-server/deps/rabbit

    gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
      ct-amqp_client t=cluster_size_3:leader_transfer_stream_send

The testsuite fails because there is a sendfile(2) that can happen
concurrently to a write(2) on the same file. This leads to sendfile(2)
or read(2) (after the sendfile) sending/returning data with zeros, which
causes a function to crash.

The patch consists of not setting the `VM_ALLOC_ZERO` flag when
`zfs_getpages()` grabs pages again. Then, the last page is zero'd if it
is invalid, in case it would be partially filled with the end of the
file content. Other pages are either valid (and will be skipped) or they
will be entirely overwritten by the file content.

Signed-off-by: Jean-Sébastien Pédron <[email protected]>
@dumbbell dumbbell force-pushed the fix-zfs_getpages-skipping-zeroed-pages branch from ebc5a1e to 8a3533a Compare October 19, 2025 14:47
@amotin amotin added the Status: Code Review Needed Ready for review and testing label Oct 20, 2025
Copy link
Contributor

@markjdb markjdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. We should try to land this change in 15.0--please let me know if I can help with that.

@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 20, 2025
@dumbbell
Copy link
Contributor Author

I don’t know if the question was for me, but I don’t know the process to port this to FreeBSD. I have this commit in my local FreeBSD branch that I’m running on a daily basis, but I suppose it’s not just a matter of pushing this commit to the main branch.

Is it documented somewhere?

@amotin
Copy link
Member

amotin commented Oct 20, 2025

@dumbbell It is not unusual to see direct commits to FreeBSD when needed faster, just making sure the upstream is in sync. But generally @mmatuska merges quite regularly. We'll need to prepare ZFS 2.4.0-RC3 release soon though for it to happen.

@markjdb
Copy link
Contributor

markjdb commented Oct 20, 2025

@dumbbell It is not unusual to see direct commits to FreeBSD when needed faster, just making sure the upstream is in sync. But generally @mmatuska merges quite regularly. We'll need to prepare ZFS 2.4.0-RC3 release soon though for it to happen.

I presume releng/15.0 isn't going to receive any more ZFS merges. If so, then assuming that there is a merge to main in the next week or two, we can cherry-pick the commit into stable/15 -> releng/15.0. If there will be another ZFS merge into 15.0, then there is nothing for us to do.

The bug is tracked here so we'll catch it one way or another.

@dumbbell
Copy link
Contributor Author

I see. I can submit a review to Phabricator for this specific patch tonight to get the ball rolling, and if a ZFS merge happens meanwhile, I will abandon it.

dumbbell added a commit to dumbbell/freebsd-src that referenced this pull request Oct 20, 2025
Initially, `zfs_getpages()` is provided with an array of busy pages by
the vnode pager. It then tries to acquire the range lock, but if there
is a concurrent `zfs_write()` running and fails to acquire that range
lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`.
After that, it grabs the pages again and retries to acquire the range
lock, and so on.

Once it got the range lock, it filters out valid pages, then copy DMU
data to the remaining invalid pages.

The problem is that freshly allocated zero'd pages it grabbed itself are
marked as valid. Therefore they are skipped by the second part of the
function and DMU data is never copied to these pages. This causes mapped
pages to contain zeros instead of the expected file content.

This was discovered while working on RabbitMQ on FreeBSD. I could
reproduce the problem easily with the following commands:

    git clone https://github.com/rabbitmq/rabbitmq-server.git
    cd rabbitmq-server/deps/rabbit

    gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
      ct-amqp_client t=cluster_size_3:leader_transfer_stream_send

The testsuite fails because there is a sendfile(2) that can happen
concurrently to a write(2) on the same file. This leads to sendfile(2)
or read(2) (after the sendfile) sending/returning data with zeros, which
causes a function to crash.

The patch consists of not setting the `VM_ALLOC_ZERO` flag when
`zfs_getpages()` grabs pages again. Then, the last page is zero'd if it
is invalid, in case it would be partially filled with the end of the
file content. Other pages are either valid (and will be skipped) or they
will be entirely overwritten by the file content.

This patch was submitted to OpenZFS as openzfs/zfs#17851 which was
approved.

Obtained from:	OpenZFS
OpenZFS commit:	8a3533a366e6df2ea770ad7d80b7b68a94a81023
MFC after:	3 days
Differential revision:
@dumbbell
Copy link
Contributor Author

Review submitted:
https://reviews.freebsd.org/D53219

freebsd-git pushed a commit to freebsd/freebsd-src that referenced this pull request Oct 20, 2025
Initially, `zfs_getpages()` is provided with an array of busy pages by
the vnode pager. It then tries to acquire the range lock, but if there
is a concurrent `zfs_write()` running and fails to acquire that range
lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`.
After that, it grabs the pages again and retries to acquire the range
lock, and so on.

Once it got the range lock, it filters out valid pages, then copy DMU
data to the remaining invalid pages.

The problem is that freshly allocated zero'd pages it grabbed itself are
marked as valid. Therefore they are skipped by the second part of the
function and DMU data is never copied to these pages. This causes mapped
pages to contain zeros instead of the expected file content.

This was discovered while working on RabbitMQ on FreeBSD. I could
reproduce the problem easily with the following commands:

    git clone https://github.com/rabbitmq/rabbitmq-server.git
    cd rabbitmq-server/deps/rabbit

    gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
      ct-amqp_client t=cluster_size_3:leader_transfer_stream_send

The testsuite fails because there is a sendfile(2) that can happen
concurrently to a write(2) on the same file. This leads to sendfile(2)
or read(2) (after the sendfile) sending/returning data with zeros, which
causes a function to crash.

The patch consists of not setting the `VM_ALLOC_ZERO` flag when
`zfs_getpages()` grabs pages again. Then, the last page is zero'd if it
is invalid, in case it would be partially filled with the end of the
file content. Other pages are either valid (and will be skipped) or they
will be entirely overwritten by the file content.

This patch was submitted to OpenZFS as openzfs/zfs#17851 which was
approved.

Reviewed by:	avg, mav
Obtained from:	OpenZFS
OpenZFS commit:	8a3533a366e6df2ea770ad7d80b7b68a94a81023
MFC after:	3 days
Differential revision: https://reviews.freebsd.org/D53219
@behlendorf behlendorf merged commit 3a55e76 into openzfs:master Oct 21, 2025
23 of 25 checks passed
@dumbbell dumbbell deleted the fix-zfs_getpages-skipping-zeroed-pages branch October 21, 2025 08:22
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Oct 21, 2025
Initially, `zfs_getpages()` is provided with an array of busy pages by
the vnode pager. It then tries to acquire the range lock, but if there
is a concurrent `zfs_write()` running and fails to acquire that range
lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`.
After that, it grabs the pages again and retries to acquire the range
lock, and so on.

Once it got the range lock, it filters out valid pages, then copy DMU
data to the remaining invalid pages.

The problem is that freshly allocated zero'd pages it grabbed itself are
marked as valid. Therefore they are skipped by the second part of the
function and DMU data is never copied to these pages. This causes mapped
pages to contain zeros instead of the expected file content.

This was discovered while working on RabbitMQ on FreeBSD. I could
reproduce the problem easily with the following commands:

    git clone https://github.com/rabbitmq/rabbitmq-server.git
    cd rabbitmq-server/deps/rabbit

    gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
      ct-amqp_client t=cluster_size_3:leader_transfer_stream_send

The testsuite fails because there is a sendfile(2) that can happen
concurrently to a write(2) on the same file. This leads to sendfile(2)
or read(2) (after the sendfile) sending/returning data with zeros, which
causes a function to crash.

The patch consists of not setting the `VM_ALLOC_ZERO` flag when
`zfs_getpages()` grabs pages again. Then, the last page is zero'd if it
is invalid, in case it would be partially filled with the end of the
file content. Other pages are either valid (and will be skipped) or they
will be entirely overwritten by the file content.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Mark Johnston <[email protected]>
Signed-off-by: Jean-Sébastien Pédron <[email protected]>
Closes openzfs#17851
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Accepted Ready to integrate (reviewed, tested)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants