Skip to content

Conversation

@kg
Copy link
Member

@kg kg commented May 4, 2024

Introduction

The biggest hotspot in many of my startup measurements is memset. Out of memset's various callers in the runtime, the main culprits here are sgen_alloc_os_memory and sgen_alloc_os_memory_aligned, and memset is being performed on their behalf by the mmap implementation in emscripten's libc:

image

emscripten libc implements mmap/munmap as a broken adapter on top of malloc (not calloc), which means it has no choice but to invoke memset on every allocation in order to provide properly zeroed bytes for our allocation requests. this is theoretically fixable on the emscripten end (I'm still looking into it, but it's not trivial), but in the process of testing my theory that the memset isn't necessary, I noticed some other problems and decided to try and put together a complete implementation of mmap/munmap for our needs that can skip calling memset. The implementation is controlled by a runtime option, so you can turn it off to get emscripten mmap instead.

Some more details in no particular order:

  • as mentioned, emscripten mmap has to memset() to zero every allocation, even if it's already zero and even if the caller doesn't need it to be zeroed. Since it's a shim over malloc, it could theoretically use calloc (which allocates zeroed memory), but...
  • emscripten offers two production-quality allocators, dlmalloc and mimalloc. both of them have calloc implementations that could theoretically skip memset, but due to how emscripten configures them, that optimization doesn't engage so they still call memset. fixing this in mimalloc looks possible (but nontrivial), but dlmalloc's version of the optimization requires access to mmap (which it can't use, since mmap is implemented using malloc here).
  • emscripten's version of sbrk works by growing the webassembly memory, and that produces pre-zeroed pages. so memory allocated via sbrk is pre-zeroed, we shouldn't call memset. emscripten's dlmalloc and mimalloc are configured to get their memory from sbrk.
  • according to spec, munmap should attempt to unmap any pages in the specified region, and silently fail if the region includes pages that aren't mapped. emscripten's version of munmap fails if you attempt to do a partial unmap or unmap multiple mappings in one call. it also locates mappings by doing a full scan of an unordered list of all mappings, so the cost of munmap goes up as you allocate more memory. my testing shows that we have code that expects munmap to behave correctly, and it's just not checking for the error code emscripten is returning.
  • according to spec, mmap and munmap operate on whole pages, and in some cases should fail if passed addresses that aren't page-aligned. emscripten's implementation ignores the existence of pages entirely, so you can ask for 96 bytes and get back 96 bytes. (it does try to align the starting address, though!)
  • sgen_init_internal_allocator has a bunch of safety checks that haven't been running on wasm, because wasm's page size is so big that the termination condition on the loop is met immediately. If you enable the checks, the assertions fail. It seems like this is harmless, but I'm not certain.
  • the lock-free allocator has a hard limit of 16KB for its allocations, and those allocations are serviced via mmap, which means each time it allocates a block on wasm it throws 3/4 of a page away (in theory). it seems to work if I raise the limit to 64KB, but only with a real implementation of mmap - a limit of 64KB with emscripten's mmap causes crashes.
  • major_free_swept_blocks in sgen is hard-coded to not free pages on wasm. it appears this is because we get random crashes when freeing pages with emscripten's munmap; with my implementation we can turn freeing pages back on without issues. (I don't know if we actually want to.)
  • our aligned valloc implementation on wasm adds some overhead and memory waste, but with a real mmap that provides page size and alignment guarantees, we can usually skip running the pad-and-align logic and just allocate a page directly with knowledge that it will be aligned.
  • emscripten sbrk doesn't return aligned addresses or round allocation sizes up to the wasm page size, so it's not possible to use it to allocate a single page directly; we have to pad our allocations as a result and try to clean up the waste somehow.

Design overview

mono_wasm_page_manager (mwpm) is an extremely simple mmap implementation that has a statically-allocated page table for the whole 32-bit wasm address space. it manages fixed-size pages that match the native WASM page size (64KB), though it can be configured for smaller pages (my testing shows no real advantage from doing this.)

to service allocations, mwpm uses sbrk to grab large chunks of pre-zeroed memory, and updates its page table to record that those pages are currently free and zeroed. as a result when a caller asks for zeroed memory, there is no need to call memset to zero it. pages controlled by other callers of sbrk remain as dead zones in the page table and will not be used. in order to get aligned pages out of sbrk, it allocates an extra page worth of memory and discards the excess on either side of the allocated pages. if mwpm discovers that it has called sbrk twice in a row without another allocator (i.e. dlmalloc) getting to it, it welds the two allocations together, recovering most of the alignment waste bytes that were lost due to sbrk's lack of alignment.

Finding available pages to service allocations is done with a naive linear scan - I tested more complex strategies and they produced worse fragmentation, though it's possible I missed the perfect solution. the first suitable spot for an allocation is the one that gets used.

When pages are unmapped, they are returned to the page table in a 'free and dirty' state. Requests to map zeroed memory will eagerly use this memory but call memset to prepare it before returning it, while if the caller doesn't request zeroed memory we will skip the memset.

Performance

From an algorithmic perspective this is a bad implementation, but in testing the performance is on par with emscripten's mmap.

S.T.J.Tests finishes with a memory size of 1.409GB using emscripten libc, and a memory size of 1.410GB using mwpm. mwpm finishes the test run with 76.6% of its allocated pages in use, and enough space to allocate 461 sequential pages, so fragmentation is mostly under control. the mwpm version of the test run completed in 3:09, vs emscripten at 2:57, so it's not catastrophically slower (if it's slower at all... I still have to figure out a good way to do realistic benchmarking.) As one would expect, the memset startup bottleneck isn't present with mwpm active.

We could probably optimize page table searches and modifications using WASM SIMD once it's enabled for the runtime.

TODOs

  • Locking for multithreaded builds
  • Wire up a way for mono to communicate that it doesn't actually need zeroed memory when allocating pages
  • Instead of memsetting the entire allocation, memset only the pages that need to be zeroed
  • Find a way to make LOCK_FREE_ALLOC_SB_MAX_SIZE dynamic based on whether mwpm is active
  • Use a more complex page table representation so we don't ever have to scan pages we don't control
  • Realistic benchmark measurements other than 'is startup faster' and 'did tests get way slower'
  • Separate arenas of some sort for 'small' (1-2 page?) and 'large' allocations to control fragmentation?

Related emscripten-core/emscripten#21620

@kg kg added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) arch-wasm WebAssembly architecture labels May 4, 2024
@ghost ghost added the area-VM-meta-mono label May 4, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

@kg kg marked this pull request as ready for review May 7, 2024 15:06
@kg kg requested review from lambdageek and steveisok as code owners May 7, 2024 15:06
@kg kg requested a review from BrzVlad as a code owner May 7, 2024 15:06
@kg
Copy link
Member Author

kg commented May 7, 2024

There's more to be done to improve on this, but it feels mergeable so it would be great to at least get initial feedback.

@lambdageek lambdageek requested a review from pavelsavara May 7, 2024 15:08
Copy link
Member

@lambdageek lambdageek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

@lambdageek
Copy link
Member

Should we turn this off for threaded wasm? Is it already off for threaded wasm? Is it off for WASI?

@kg
Copy link
Member Author

kg commented May 7, 2024

Should we turn this off for threaded wasm? Is it already off for threaded wasm? Is it off for WASI?

Our current implementation is actually subtly broken for threaded wasm. I'm open to disabling this for MT anyway, but it should be thread-safe thanks to the mutex. It should be enabled for WASI I think.

@kg kg removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 7, 2024
@kg
Copy link
Member Author

kg commented May 8, 2024

After thinking some more about the skip structure I realized it has a failure mode if you perform partial unmappings.

For full unmapping, it returns the page table into a fully correct state, since all of the skip pages created by the original mapping get reset. But if you unmap the middle of an allocation you can end up with left-over skip pages that jump past free space, like so:

| alloced | free |
a98765432100000000

|
v

|alloc| free     |
a98765400000000000

It's also the case that if you have a big section of free blocks with skip data, you need to allocate from the front of it for the skip data to stay valid - in the same way otherwise you could end up skipping over free space because you haven't updated the preceding skip data.

I'm not sure how to fix this yet - maybe since skip values only go up to 64, I can just recalculate the skip values for the preceding 64 pages any time I modify pages? EDIT: That is what I did

Address PR feedback
kg added 2 commits May 8, 2024 11:35
After freeing pages, repair the skip values for the preceding 64 pages so they don't erroneously skip over newly free pages
Copy link
Member

@lambdageek lambdageek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Although we should make a follow-up work item to figure out why MT was unhappy

@kg kg merged commit 6b03ebd into dotnet:main May 10, 2024
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
…otnet#101871)

* emscripten libc implements mmap/munmap as a broken adapter on top of malloc (not calloc), which means it has no choice but to invoke memset on every allocation in order to provide properly zeroed bytes for our allocation requests. this commit adds a custom mmap/munmap implementation that can skip zeroing already-zeroed pages
* re-enable freeing of pages in sgen on wasm if custom mmap is active
* add runtime option for custom mmap
* add warning switches to fix build on debian
@github-actions github-actions bot locked and limited conversation to collaborators Jun 9, 2024
@pavelsavara pavelsavara added this to the 9.0.0 milestone Oct 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-wasm WebAssembly architecture area-VM-meta-mono

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants