FreeBSD: zfs_getpages: Don't zero freshly allocated pages #17851

dumbbell · 2025-10-16T08:51:33Z

Motivation and Context

I hit a failing testsuite while working on RabbitMQ on FreeBSD: a read(2) returned unexpected (zero'd) data even though the data was successfully written to the file.

Description

Initially, zfs_getpages() is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent zfs_write() running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with zfs_write(). After that, it grabs the pages again and retries to acquire the range lock, and so on.

Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages.

The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content.

This was discovered while working on RabbitMQ on FreeBSD. The RabbitMQ testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash.

The patch consists of not setting the VM_ALLOC_ZERO flag when zfs_getpages() grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content.

How Has This Been Tested?

I could reproduce the problem easily with the following commands:

git clone https://github.com/rabbitmq/rabbitmq-server.git
cd rabbitmq-server/deps/rabbit

gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
  ct-amqp_client t=cluster_size_3:leader_transfer_stream_send

I’m running FreeBSD 16-CURRENT from a couple weeks ago. I could reproduce the problem for the past 6 month, just didn’t have the time to work on it.

I run the testsuite in a loop. On my laptop, it took about 3-4 minutes to hit the problem without the patch. With the patch, the testsuite ran successfully for 4 hours.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. Signed-off-by: Jean-Sébastien Pédron <[email protected]>

robn · 2025-10-16T23:48:08Z

@dumbbell dang, that's unpleasant! Thanks for the patch and clear description.

@markjdb would appreciate your eye on this. Seems plausible?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FreeBSD: zfs_getpages: Don't zero freshly allocated pages #17851

FreeBSD: zfs_getpages: Don't zero freshly allocated pages #17851

dumbbell commented Oct 16, 2025

Uh oh!

robn commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FreeBSD: zfs_getpages: Don't zero freshly allocated pages #17851

Are you sure you want to change the base?

FreeBSD: zfs_getpages: Don't zero freshly allocated pages #17851

Conversation

dumbbell commented Oct 16, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

robn commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants