Skip to content

Conversation

@chucksilvers
Copy link

NOTE: this pull request is only to make these changes available for review, I don't intend to merge them in their current state. Also it is probably not worthwhile to examine the individual commits, only the cumulative change, since the individual commits contain a lot of noise that I will rebase away before submitting this for real.

This branch adds support for defining PAGE_SIZE on amd64 to values larger than the base x86 hardware page size of 4k. This has the benefit of reduced CPU consumption for some workloads; in particular, a 16k-page kernel uses about 12% fewer CPU cycles for the Netflix streaming-video workload than a traditional 4k-page kernel. This is accomplished by adding a layer abstracting PTE access and TLB invalidation to be (mostly) independent of the kernel's definition of PAGE_SIZE using new "data page" ("datapg") terminology for mappings of whole vm_page_t's, and defining page table pages ("ptpage_t") to be a separate type from the VM system's "vm_page_t".

Two implementations of this new abstraction layer are provided, one where PAGE_SIZE equals the hardware 4k page size and another where PAGE_SIZE can be larger than 4k. For the PAGE_SIZE=4096 implementation, ptpage_t is implemented as the existing vm_page_t, and the new pte_datapg functions are implemented as the existing pte functions, so basically everything works exactly the same way as in the existing code. For the larger-pages version, multi-PTE datapg mappings are handled by looping over the individual PTEs as needed.

Not all features of the existing code are supported yet for larger-page kernels, notably these:

  • la57
  • nested page tables
  • iommu
  • kasan
  • kmsan
  • xenhvm
  • suspend/resume
  • pti
  • pmap_large_* (only used by nvdimm)

All of these could be supported together with larger pages, we just don't use these here at Netflix so I didn't do the work to make them co-exist.

One obvious optimization that is missing in this branch is to use less than a full vm_page_t page to store a page table page. I intend to implement this before this feature is merged upstream, it just has not been a priority for us, and should not hold up reviewing the rest of the code.

Also note that enabling invlpgb in this branch causes the kernel to crash very early in boot on CPUs which support invlpgb, so we just have invlpgb disabled for now until I can figure out this bug.

There is one other bug still lurking in this branch, which is that process anonymous memory becomes corrupted in some extremely rare circumstance. Typically it takes around 2 weeks of our production workload to trigger this corruption, and we have not found any way to reproduce the problem more quickly. I'm also looking for any help in figuring out this problem.

Any feedback on these changes would be greatly appreciated.

Change code dealing with page table pages from manipulating vm_page_t directly
to using a new ptpage_t abstraction to hide the implementation of a page table page.
Initially support PAGE_SIZE=4096, support for larger page sizes to come later.
this is work-in-progress.  it works pretty well in a bhyve VM and on
a physical box with an AMD CPU, but crashes while running tests on
an intel CPU.
use "options OS_PAGE_SHIFT=14" for a 16k-page kernel for example.
fix the assertion in pmap_init() about kernel ptps being in the range to have pgpage_t structures.
When initializing the vm_page memattr mode for efirt pages,
if the page is already initialized then assert that the existing mode
is the same as the new mode we want to set for this efirt page.
This requires that efirt be able to tell when a vm_page structures has
been initialized already, but nothing was zeroing those structures,
so zero them now when we allocate them.
fix pmap_advise() to check all ptes of a vm_page rather than just the first.
more cleanup of comments and debug code.
don't trunc_page() the va given to smp_masked_invlpg().
assert that the va is already aligned correctly.
fix stride for for TLB range invalidation "invlrng" IPI handlers.
The "base" argument to vfs_bio_bzero_buf() is the offset within
the buf, but when the page size is larger than the buf size
then the buf might not start at the beginning of its page.
Add the offset of the buf within the page to account for this.
in kmem_bootstrap_free() we round the start and end of the range to free
to avoid freeing unrelated records might share the first or last pages
of the range we are freeing.  this rounding can result in a range
that is zero or negative size (though negative becomes large positive
because the types are unsigned).  in this case there is nothing that
can actually be freed, so just return early.
This value is where PIE execuables are mapped when ASLR is disabled,
so it needs to be a multiple of PAGE_SIZE for the mappings to work right.
update the larger-pages version of pte_load_datapg() to
 - assert that the PG_FRAME bits describe consecutive 4k pages.
 - assert that all bits other than PG_FRAME and PG_M and PG_A
   are the same in each pte.
 - merge the PG_M and PG_A bits by or'ing together the values from
   all the ptes.

use pte_load_datapg() in pmap_page_test_mappings() and pmap_ts_referenced()
so that PG_M and PG_A bits from all ptes are detected properly.
use pte_load_datapg() in pmap_page_wired_mappings() mainly for the assertions
that bits other than PG_M and PG_A (such as PG_W) should match between the ptes.
@github-actions
Copy link

Thank you for taking the time to contribute to FreeBSD!
There is an issue that needs to be fixed:

Please review CONTRIBUTING.md, then update and push your branch again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant