In the early days of Intel and AMD based 32-bit personal computers and servers, machines often had less than 16MB of memory. Nowadays the high-end servers used in high-performance computing and big data application may have multiple terabytes of memory. While the total amount of memory available increased up to a million times, current operating systems still manage memory with the same 4KB granularity used in those early days. This makes memory management become a significant overhead for many workloads, since all this memory needs to be mapped through very large page tables that need to be maintained by the OS. Another source of overhead is that the use of 4KB pages to map large amounts of memory increases the chance for TLB misses to occur (https://lwn.net/Articles/379748/).
This project aims to evaluate the potential performance bennefits of using a larger page size supported by the x86-64 architecture (ideally 2MB) as the default allocation unit for managing memory on systems with very large amounts of memory. IT IS NOT a solution for improving performance on every system.
Default page sizes different from 4KB can be already used on platforms such as Alpha, ARM64 and IA-64, so bugs related to applications, file systems or device drivers' developers assuming memory pages are always 4KB were probably already identified and fixed. This however does not mean that the task to add support for a different default page size in x86-64 Linux is not daunting. Problems already fixed on code specific for the other platforms may still be present in x86-64 specific software components.
While some gain in execution performance is almost certain (https://www.kernel.org/doc/Documentation/vm/transhuge.txt), the exact numbers are difficult to determine without testing on an actual implementation with real workloads. Expected bennefits are smaller page tables to be managed and kept in memory, faster memory mapping operations, faster process forking, fewer TLB misses during code execution and data access, reduction of the number of page levels from 4 to 3 (resulting in ~25% faster handling of TLB misses that still occur), among other bennefits.
There are also trade-offs, such as higher memory consumption for small processes due to increases in memory fragmentation (especially internal fragmentation). Copy-on-write operations will incurr in larger pages being copied. Demand paging will read larger amounts of binary code, of which not everything might actually be executed by the corresponding process.
Huge memory systems (especially those with persistent byte-addressable memory) might want to use the new DAX enhancements (https://www.kernel.org/doc/Documentation/filesystems/dax.txt), which require that the file system block size be equal to the Kernel's default page size, so file systems stored in byte-addressable memory (either volatile or non-volatile) will need to use 2MB allocation units to use DAX.
With so many unknowns, the first task is to get better estimates of the potential gains and trade-offs. The lowest hanging fruit (and suggested first step) is to estimate the actual impact on memory fragmentation, which can be calculated by obtaining snapshots of all the memory segments present in memory on a system running typicall workloads. All the segments created using 4KB pages would need to have their beggining and ending addresses rounded to 2MB alignemnts, and then total all the paddings inserted to determine how much memory would become wasted due to increased internal fragmentation. We should also compute the reduction in the size of the page tables, which will counterbalance part of the losses from fragmentation.
Once we become confident that the increase in memory consumption lies within acceptable limits, we can start to evaluate performance through an basic implementation of an all-huge page Kernel. The kernel already contains most of the low level code needed to manipulate huge-pages, and making it default to 2MB allocation granularity should be reasonably straightforward. However, this could potentially have unpredictable adverse side-effects in other parts of the x86-64 platform specific code (or even on the platform independent code, though less likely).
Getting the 2MB default page implementation correct might not be enough to successfully boot Linux. Any executable that requires that a memory segment be created at a specific address which is not 2MB aligned will potentialy (if not certainly) crash. If address space layout randomization (https://pax.grsecurity.net/docs/aslr.txt) is enabled in the kernel, it will probably be running only position independent executables, which probably fixes the problem. However, if there are executables that require segments at fixed addresses not aligned at 2MB (and recompiling them is not possible), we may need the modify the Kernel to add padding before these segments during their creation to make them 2MB aligned (but the impact of this throughout the Kernel has yet to be investigated: demand paging? swapping? Do all these procedures need to become aware that the segment was not originally 2MB aligned?)
Do not transparent huge pages (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html) already provide a good enough solution?
THP is a feature that causes some debate. Transparent huge pages abstracts the complexities of using larger pages from developers and system administrators, but its implementation also incurs additional overhead due to continuous scanning of pages that could be merged into larger pages (khugepaged kernel thread), and splitting the pages back to smaller pages in certain situations (such as copy-on-write). THP also makes the Kernel code more complex. Moreover, THP is not suited for database workloads and currently only maps anonymous memory regions such as heap and stack space. THP is currently disabled by default in order to avoid the risk of increasing the memory footprint of applications without a guaranteed benefit.
This can only be determined precisely performing tests with typical workloads on real hardware. Indeed, on modern chips such as Intel's Kaby Lake (https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake) we have 128 entries for 4KB pages on the instruction TLB versus only 8 entries for the 2MB pages (the 4KB TLB is 16 times larger than the 2MB TLB!). For data TLB, we have 64 entries for 4KB pages, versus 32 entries for 2MB pages. While there are more entries for 4KB pages, each 2MB page replaces 512 4KB pages, so the overall number of TLB misses might still be reduced, even with the higher pressure incurred on the 2MB TLBs. We also need to take into account that page tables describing 2MB page mappings use only 3 page table levels, resulting in faster handling of each TLB miss.
That is true. The TLBs are certainly one of the most expensive components of the CPUs. However, if the resulting overall system performance actually increases, why would anyone care? Moreover, processors are evolving each year. If using 2MB pages as minimum memory allocation granularity proves to be worthwhile, future x86-64 processors may include flags in their control registers to reconfigure or repurpose the 4KB TLBs to handle 2MB pages (and then we would have even larger leaps in performance).
How does this effort relate to 5-level Page Tables that is coming to Intel's latest CPUs and to Linux? (https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf)
The new 5-level page tables will enable newer Intel CPUs to address up to 4 Petabytes of physical memory and manage up to 128PB of virtual memory. This ensures that the trend to increase the amounts of memory on newer systems will continue. The downside is that an additional page table layer also increases the cost of TLB misses and the size of the page tables that have to be managed. Increasing the default page size to 2MB will make even more sense on such systems, bringing the number of page layers back to four, as well as significantly reducing the size of the page tables that have to be maintained by the OS to map all this memory. Support for 5-level page tables on Linux is already progressing, see http://lkml.iu.edu/hypermail/linux/kernel/1612.1/00383.html.
P. Weisberg and Y. Wiseman at the Bar-llan University (Israel) investigated this topic in 2009. See "Using 4KB page size for Virtual Memory is obsolete" (https://ieeexplore.ieee.org/document/5211562). They concluded that 16kb would be a better default size at that time, but memory sizes have continued to grow since then. Also we do not want to address the average Linux machine, but systems containing huge memory used for specific, memory intensive workloads.