-
Notifications
You must be signed in to change notification settings - Fork 734
Description
Feature
We'd like to propose an option in the WAMR AOT compiler that will reduce the size of the generated code while still keeping track of some information about the existing callstack.
Currently, when compiling our file (~6 MB, over 100k call instructions and ~15k functions) to AOT with call stack tracking enabled (--enable-dump-call-stack option), the resulting file is 25MB (and is even more for ARM). When the call stack tracking is disabled, the size is ~18 MB, but we can't turn it off in production so we can analyze the crashes. Our experiments show that with our settings (no GC, no EH which is not supported yet but will be in the future) we could generate ~20MB bundle while keeping the basic information (just function call stacks).
We're not targeting GC, JIT and EH (once it's ready) modes although the solution potentially can be extended to those features too (didn't dig deep into the implementation).
Benefit
The main benefit of the feature is a reduced AOT code size (perhaps at the cost of some of the additional features) therefore a lower memory usage and disk usage (or network usage if the bundle is downloaded).
Implementation
We considered two main approaches to implement this functionality. We're leaning towards option 1 given the concerns around the other option, but we'd like to open the discussion for the community and also hear some of the ideas.
Option 1: keep wrapping call instructions
Currently stack traces are implemented by adding extra code around call instruction for saving the frame on the stack before the instruction, and popping it after the instruction is executed. This approach assumes that we'll re-use the existing mechanism, but the code for pushing the stack trace will be simplified and configurable.
Right now the --enable-dump-call-stack generates a code for:
- keeping track of the callstack, including:
- checking if there's enough space to allocate a new frame
- previous frame
- function index
- instruction offset
- imported function's parameters
- operand stack (GC only)
- frame ref flags (GC only)
- performance info (only when profiling is enabled)
We'd like to introduce a new flag (the name is yet to be decided, perhaps something like --dump-call-stack-detail-level but it's yet to be decided) which will allows to define what's being tracked as part of each call and the level of validation. Each frame will include the following information:
- function number (always present)
- instruction offset (optional, configured through wamrc flag)
For now we'll completely disable validation for stack overflows, but if required, this can be added as an option too in the future.
This feature will not allow to include any other information - the --enable-dump-call-stack flag should be used instead if additional data is required.
To avoid additional instructions, we'll do the following:
- we'll not populate field for the previous frame; instead, we'll keep each frame will have the known size, so prev/next frame can be calculated by just subtracting or adding the size of the current frame
- code for checking bound checks will not be added
- populating instruction offset will be optionally enabled
The benefit of this approach is simplicity of the implementation as we can rely on the existing mechanism; we already implemented a proof of concept and tested it on a few examples. The major disadvantage (compare to option 2) is a need for instrumenting every single call instruction; based on a few programs we've noticed that there's usually 5-10x more call instructions than function definitions, so option 2) might result in a better code. Another disadvantage is that we'll need to make changes in the runtime (as the runtime, when generating stack trace, relies on the prev field from the AOTFrame now - we'll use AOT feature flags to detect which option is used), so users would have to update both AOT code and the runtime itself.
Option 2
In this option, instead of instrumenting every call instruction, we'll add a code for pushing the frame at the beginning of the function, and code for popping frame will be added on termination. Given that based on our experiments there's usually significantly more calls than function definitions, there should be a visible code size reduction too, which is what we're trying to achieve here. Another benefit of this approach is that the code compiled with a new compiler can work out of the box with the older runtimes (at least 2.x due to abi compatibility). There's a few concerns though:
- This option will not work for calls to imported functions, so for those we'll need to instrument the
callinstruction just like we do today. - This option won't allow us to track instruction offset, which might be a blocker for a lot of the users (our team currently only uses function indexes, but we anticipate using instruction offsets in the future too for crash analysis)
- There can be multiple terminators within the function, which can negatively affect the code size. It's possible to detect all those terminators but then either all of them would have to include the code for popping the frame, or they'd have to have a jump instruction to a shared code that does this (I haven't explored this too much though, maybe it's enough to add the code before the last
endin the function and before eachreturn).
### Tasks
- [x] Add parameters to wamrc to decide what callstack features should be enabled (https://github.com/bytecodealliance/wasm-micro-runtime/pull/3763)
- [x] Implement small frames that only contain ip and function index https://github.com/bytecodealliance/wasm-micro-runtime/pull/3773
- [x] Implement mode where frames are allocated at the beginning of the function, and not for each function call https://github.com/bytecodealliance/wasm-micro-runtime/pull/3773
- [ ] Expose option to explicitly enable/disable frame-per-function mode
- [ ] Implement frame-per-function mode for STANDARD frames
- [ ] add support for GC mode
- [ ] optimize `return` opcode so the code for freeing a frame is not duplicated: https://github.com/bytecodealliance/wasm-micro-runtime/pull/3773/files#r1749488852
- [x] Add func-idx parameter to the `--call-stack-features` option https://github.com/bytecodealliance/wasm-micro-runtime/pull/3785