add support for larger software page sizes on amd64 #1852

chucksilvers · 2025-09-23T19:12:45Z

NOTE: this pull request is only to make these changes available for review, I don't intend to merge them in their current state. Also it is probably not worthwhile to examine the individual commits, only the cumulative change, since the individual commits contain a lot of noise that I will rebase away before submitting this for real.

This branch adds support for defining PAGE_SIZE on amd64 to values larger than the base x86 hardware page size of 4k. This has the benefit of reduced CPU consumption for some workloads; in particular, a 16k-page kernel uses about 12% fewer CPU cycles for the Netflix streaming-video workload than a traditional 4k-page kernel. This is accomplished by adding a layer abstracting PTE access and TLB invalidation to be (mostly) independent of the kernel's definition of PAGE_SIZE using new "data page" ("datapg") terminology for mappings of whole vm_page_t's, and defining page table pages ("ptpage_t") to be a separate type from the VM system's "vm_page_t".

Two implementations of this new abstraction layer are provided, one where PAGE_SIZE equals the hardware 4k page size and another where PAGE_SIZE can be larger than 4k. For the PAGE_SIZE=4096 implementation, ptpage_t is implemented as the existing vm_page_t, and the new pte_datapg functions are implemented as the existing pte functions, so basically everything works exactly the same way as in the existing code. For the larger-pages version, multi-PTE datapg mappings are handled by looping over the individual PTEs as needed.

Not all features of the existing code are supported yet for larger-page kernels, notably these:

la57
nested page tables
iommu
kasan
kmsan
xenhvm
suspend/resume
pti
pmap_large_* (only used by nvdimm)

All of these could be supported together with larger pages, we just don't use these here at Netflix so I didn't do the work to make them co-exist.

One obvious optimization that is missing in this branch is to use less than a full vm_page_t page to store a page table page. I intend to implement this before this feature is merged upstream, it just has not been a priority for us, and should not hold up reviewing the rest of the code.

Also note that enabling invlpgb in this branch causes the kernel to crash very early in boot on CPUs which support invlpgb, so we just have invlpgb disabled for now until I can figure out this bug.

There is one other bug still lurking in this branch, which is that process anonymous memory becomes corrupted in some extremely rare circumstance. Typically it takes around 2 weeks of our production workload to trigger this corruption, and we have not found any way to reproduce the problem more quickly. I'm also looking for any help in figuring out this problem.

Any feedback on these changes would be greatly appreciated.

Change code dealing with page table pages from manipulating vm_page_t directly to using a new ptpage_t abstraction to hide the implementation of a page table page. Initially support PAGE_SIZE=4096, support for larger page sizes to come later.

this is work-in-progress. it works pretty well in a bhyve VM and on a physical box with an AMD CPU, but crashes while running tests on an intel CPU.

use "options OS_PAGE_SHIFT=14" for a 16k-page kernel for example.

fix the assertion in pmap_init() about kernel ptps being in the range to have pgpage_t structures.

…eded

…k accounting

When initializing the vm_page memattr mode for efirt pages, if the page is already initialized then assert that the existing mode is the same as the new mode we want to set for this efirt page. This requires that efirt be able to tell when a vm_page structures has been initialized already, but nothing was zeroing those structures, so zero them now when we allocate them.

fix pmap_advise() to check all ptes of a vm_page rather than just the first. more cleanup of comments and debug code.

don't trunc_page() the va given to smp_masked_invlpg(). assert that the va is already aligned correctly. fix stride for for TLB range invalidation "invlrng" IPI handlers.

… properly

The "base" argument to vfs_bio_bzero_buf() is the offset within the buf, but when the page size is larger than the buf size then the buf might not start at the beginning of its page. Add the offset of the buf within the page to account for this.

in kmem_bootstrap_free() we round the start and end of the range to free to avoid freeing unrelated records might share the first or last pages of the range we are freeing. this rounding can result in a range that is zero or negative size (though negative becomes large positive because the types are unsigned). in this case there is nothing that can actually be freed, so just return early.

This value is where PIE execuables are mapped when ASLR is disabled, so it needs to be a multiple of PAGE_SIZE for the mappings to work right.

…er pages.

update the larger-pages version of pte_load_datapg() to - assert that the PG_FRAME bits describe consecutive 4k pages. - assert that all bits other than PG_FRAME and PG_M and PG_A are the same in each pte. - merge the PG_M and PG_A bits by or'ing together the values from all the ptes. use pte_load_datapg() in pmap_page_test_mappings() and pmap_ts_referenced() so that PG_M and PG_A bits from all ptes are detected properly. use pte_load_datapg() in pmap_page_wired_mappings() mainly for the assertions that bits other than PG_M and PG_A (such as PG_W) should match between the ptes.

github-actions · 2025-09-23T19:12:58Z

Thank you for taking the time to contribute to FreeBSD!
There is an issue that needs to be fixed:

Missing Signed-off-by lines^{b6e8b39, 743514d, 363d7e7, 2911420, 2939060, fbd8cc9, a8a7a12, 44a6beb, 1c15ae2, f187b54, 2efac19, bcc8725, 0637fcd, 0687f8a, 5a15100, f6548e8, a54c667, 2062300, 250fba4, b77befc, 674ff83, ab1d95a, 94e1cdd, a227e13, 79b8c33, 1cf2aac, 83f627f, 7216b1e, a6cf527, 052c885, 8d7de9f, a8e58eb, 27265d6}

Please review CONTRIBUTING.md, then update and push your branch again.

chucksilvers added 30 commits September 3, 2025 16:36

amd64: support 16k pages

743514d

this is work-in-progress. it works pretty well in a bhyve VM and on a physical box with an AMD CPU, but crashes while running tests on an intel CPU.

ktls: first try at fixing page size > 16k

363d7e7

amd64: make the minidump format independent of the kernel page size

2911420

amd64/elf32: ia32 compat code hard-codes 4k page size, reject for now

2939060

amd64: make the choice of page shift a kernel config option

fbd8cc9

use "options OS_PAGE_SHIFT=14" for a 16k-page kernel for example.

amd64 pmap: simplifiy some of the XXX things.

a8a7a12

amd64/pmap: fix assertion in pmap_init()

44a6beb

fix the assertion in pmap_init() about kernel ptps being in the range to have pgpage_t structures.

amd64/pmap: remove debug code.

1c15ae2

libkvm: uncomment and fix assertions that I commented out previously

f187b54

vm: remove rounding I added in vm_phys_add_seg() that is no longer ne…

2efac19

…eded

vm_kern.c: remove includes I added that are no longer needed

bcc8725

vm_mmap.c: remove debug code and hacks to work around my bugs in mloc…

0637fcd

…k accounting

amd64/trap.c: fixup whitespace

0687f8a

amd64/pmap.h: use PAGE_SIZE_4K macro instead of literal

5a15100

_pv_entry.h: add a default value for PAGE_SHIFT_PV too

f6548e8

arm64: define MINIDUMP_PAGE_*

a54c667

amd64/pmap: tidy some more

2062300

amd64/efirt: tidy

250fba4

amd64/pmap: remove more CHUQ comments

b77befc

amd64/pmap: fix pmap_advise() and clean up

ab1d95a

fix pmap_advise() to check all ptes of a vm_page rather than just the first. more cleanup of comments and debug code.

amd64: fix TLB invalidation for larger pages

94e1cdd

don't trunc_page() the va given to smp_masked_invlpg(). assert that the va is already aligned correctly. fix stride for for TLB range invalidation "invlrng" IPI handlers.

amd64/pmap: make sure allocpages() aligns to PAGE_SIZE

a227e13

link_elf: re-enable link_elf_protect() now that the loader will align…

79b8c33

… properly

mdconfig_test: remove assumption that page size is 4k

1cf2aac

livedump: use MINIDUMP_PAGE_SIZE here too

83f627f

mmap_test: determine page size at run time rather than compile time

052c885

chucksilvers added 3 commits September 3, 2025 16:36

elf: ensure that pie_base is a multiple of PAGE_SIZE

8d7de9f

This value is where PIE execuables are mapped when ASLR is disabled, so it needs to be a multiple of PAGE_SIZE for the mappings to work right.

amd64/pmap: fix pmap_resident_count() to use the right units for larg…

a8e58eb

…er pages.

chucksilvers requested a review from bsdjhb as a code owner September 23, 2025 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add support for larger software page sizes on amd64 #1852

add support for larger software page sizes on amd64 #1852

Uh oh!

chucksilvers commented Sep 23, 2025

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

add support for larger software page sizes on amd64 #1852

Are you sure you want to change the base?

add support for larger software page sizes on amd64 #1852

Uh oh!

Conversation

chucksilvers commented Sep 23, 2025

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant