-
Notifications
You must be signed in to change notification settings - Fork 3.1k
add support for larger software page sizes on amd64 #1852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
chucksilvers
wants to merge
33
commits into
freebsd:main
Choose a base branch
from
chucksilvers:chs-larger-pages
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Change code dealing with page table pages from manipulating vm_page_t directly to using a new ptpage_t abstraction to hide the implementation of a page table page. Initially support PAGE_SIZE=4096, support for larger page sizes to come later.
this is work-in-progress. it works pretty well in a bhyve VM and on a physical box with an AMD CPU, but crashes while running tests on an intel CPU.
use "options OS_PAGE_SHIFT=14" for a 16k-page kernel for example.
fix the assertion in pmap_init() about kernel ptps being in the range to have pgpage_t structures.
When initializing the vm_page memattr mode for efirt pages, if the page is already initialized then assert that the existing mode is the same as the new mode we want to set for this efirt page. This requires that efirt be able to tell when a vm_page structures has been initialized already, but nothing was zeroing those structures, so zero them now when we allocate them.
fix pmap_advise() to check all ptes of a vm_page rather than just the first. more cleanup of comments and debug code.
don't trunc_page() the va given to smp_masked_invlpg(). assert that the va is already aligned correctly. fix stride for for TLB range invalidation "invlrng" IPI handlers.
The "base" argument to vfs_bio_bzero_buf() is the offset within the buf, but when the page size is larger than the buf size then the buf might not start at the beginning of its page. Add the offset of the buf within the page to account for this.
in kmem_bootstrap_free() we round the start and end of the range to free to avoid freeing unrelated records might share the first or last pages of the range we are freeing. this rounding can result in a range that is zero or negative size (though negative becomes large positive because the types are unsigned). in this case there is nothing that can actually be freed, so just return early.
This value is where PIE execuables are mapped when ASLR is disabled, so it needs to be a multiple of PAGE_SIZE for the mappings to work right.
update the larger-pages version of pte_load_datapg() to - assert that the PG_FRAME bits describe consecutive 4k pages. - assert that all bits other than PG_FRAME and PG_M and PG_A are the same in each pte. - merge the PG_M and PG_A bits by or'ing together the values from all the ptes. use pte_load_datapg() in pmap_page_test_mappings() and pmap_ts_referenced() so that PG_M and PG_A bits from all ptes are detected properly. use pte_load_datapg() in pmap_page_wired_mappings() mainly for the assertions that bits other than PG_M and PG_A (such as PG_W) should match between the ptes.
|
Thank you for taking the time to contribute to FreeBSD!
Please review CONTRIBUTING.md, then update and push your branch again. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: this pull request is only to make these changes available for review, I don't intend to merge them in their current state. Also it is probably not worthwhile to examine the individual commits, only the cumulative change, since the individual commits contain a lot of noise that I will rebase away before submitting this for real.
This branch adds support for defining PAGE_SIZE on amd64 to values larger than the base x86 hardware page size of 4k. This has the benefit of reduced CPU consumption for some workloads; in particular, a 16k-page kernel uses about 12% fewer CPU cycles for the Netflix streaming-video workload than a traditional 4k-page kernel. This is accomplished by adding a layer abstracting PTE access and TLB invalidation to be (mostly) independent of the kernel's definition of PAGE_SIZE using new "data page" ("datapg") terminology for mappings of whole vm_page_t's, and defining page table pages ("ptpage_t") to be a separate type from the VM system's "vm_page_t".
Two implementations of this new abstraction layer are provided, one where PAGE_SIZE equals the hardware 4k page size and another where PAGE_SIZE can be larger than 4k. For the PAGE_SIZE=4096 implementation, ptpage_t is implemented as the existing vm_page_t, and the new pte_datapg functions are implemented as the existing pte functions, so basically everything works exactly the same way as in the existing code. For the larger-pages version, multi-PTE datapg mappings are handled by looping over the individual PTEs as needed.
Not all features of the existing code are supported yet for larger-page kernels, notably these:
All of these could be supported together with larger pages, we just don't use these here at Netflix so I didn't do the work to make them co-exist.
One obvious optimization that is missing in this branch is to use less than a full vm_page_t page to store a page table page. I intend to implement this before this feature is merged upstream, it just has not been a priority for us, and should not hold up reviewing the rest of the code.
Also note that enabling invlpgb in this branch causes the kernel to crash very early in boot on CPUs which support invlpgb, so we just have invlpgb disabled for now until I can figure out this bug.
There is one other bug still lurking in this branch, which is that process anonymous memory becomes corrupted in some extremely rare circumstance. Typically it takes around 2 weeks of our production workload to trigger this corruption, and we have not found any way to reproduce the problem more quickly. I'm also looking for any help in figuring out this problem.
Any feedback on these changes would be greatly appreciated.