Skip to content
This repository was archived by the owner on Oct 3, 2025. It is now read-only.

Commit f6dba82

Browse files
perf: small performance improvements
Signed-off-by: Henry Gressmann <[email protected]>
1 parent 6850e8b commit f6dba82

File tree

6 files changed

+104
-70
lines changed

6 files changed

+104
-70
lines changed

BENCHMARKS.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ All runtimes are compiled with the following settings:
3333
| `fib` | 6ns | 44.76µs | 48.96µs | 52µs |
3434
| `fib-rec` | 284ns | 25.565ms | 5.11ms | 0.50ms |
3535
| `argon2id` | 0.52ms | 110.08ms | 44.408ms | 4.76ms |
36-
| `selfhosted` | 45µs | 2.18ms | 4.25ms | 258.87ms |
36+
| `selfhosted` | 45µs | 2.08ms | 4.25ms | 258.87ms |
3737

3838
### Fib
3939

@@ -49,7 +49,7 @@ TinyWasm is a lot slower here, but that's because there's currently no way to re
4949

5050
This benchmark runs the Argon2id hashing algorithm, with 2 iterations, 1KB of memory, and 1 parallel lane.
5151
I had to decrease the memory usage from the default to 1KB, because especially the interpreters were struggling to finish in a reasonable amount of time.
52-
This is where `simd` instructions would be really useful, and it also highlights some of the issues with the current implementation of TinyWasm's Value Stack and Memory Instances.
52+
This is where `simd` instructions would be really useful, and it also highlights some of the issues with the current implementation of TinyWasm's Value Stack and Memory Instances. These spend a lot of time on `Vec` operations, so they might be a good place to start experimenting with Arena Allocation.
5353

5454
### Selfhosted
5555

@@ -62,6 +62,8 @@ Wasmer also offers a pre-parsed module format, so keep in mind that this number
6262

6363
After profiling and fixing some low-hanging fruits, I found the biggest bottleneck to be Vector operations, especially for the Value Stack, and having shared access to Memory Instances using RefCell. These are the two areas I will be focusing on improving in the future, trying out Arena Allocation and other data structures to improve performance. Additionally, typed FuncHandles have a significant overhead over the untyped ones, so I will be looking into improving that as well. Still, I'm quite happy with the results, especially considering the use of standard Rust data structures.
6464

65+
Something that made a much bigger difference than I expected was to give compiler hints about cold paths, and to force inlining of some functions. This made the benchmarks 30%+ faster in some cases. A lot of places in the codebase have comments about what optimizations have been done.
66+
6567
# Running benchmarks
6668

6769
Benchmarks are run using [Criterion.rs](https://github.com/bheisler/criterion.rs). To run a benchmark, use the following command:

crates/tinywasm/src/instance.rs

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,15 +36,18 @@ pub(crate) struct ModuleInstanceInner {
3636

3737
impl ModuleInstance {
3838
// drop the module instance reference and swap it with another one
39+
#[inline]
3940
pub(crate) fn swap(&mut self, other: Self) {
4041
self.0 = other.0;
4142
}
4243

44+
#[inline]
4345
pub(crate) fn swap_with(&mut self, other_addr: ModuleInstanceAddr, store: &mut Store) {
4446
self.swap(store.get_module_instance_raw(other_addr))
4547
}
4648

4749
/// Get the module instance's address
50+
#[inline]
4851
pub fn id(&self) -> ModuleInstanceAddr {
4952
self.0.idx
5053
}
@@ -118,44 +121,53 @@ impl ModuleInstance {
118121
Some(ExternVal::new(kind, *addr))
119122
}
120123

121-
pub(crate) fn func_addrs(&self) -> &[FuncAddr] {
122-
&self.0.func_addrs
123-
}
124-
124+
#[inline]
125125
pub(crate) fn new(inner: ModuleInstanceInner) -> Self {
126126
Self(Rc::new(inner))
127127
}
128128

129+
#[inline]
129130
pub(crate) fn func_ty(&self, addr: FuncAddr) -> &FuncType {
130131
self.0.types.get(addr as usize).expect("No func type for func, this is a bug")
131132
}
132133

134+
#[inline]
135+
pub(crate) fn func_addrs(&self) -> &[FuncAddr] {
136+
&self.0.func_addrs
137+
}
138+
133139
// resolve a function address to the global store address
140+
#[inline]
134141
pub(crate) fn resolve_func_addr(&self, addr: FuncAddr) -> FuncAddr {
135142
*self.0.func_addrs.get(addr as usize).expect("No func addr for func, this is a bug")
136143
}
137144

138145
// resolve a table address to the global store address
146+
#[inline]
139147
pub(crate) fn resolve_table_addr(&self, addr: TableAddr) -> TableAddr {
140148
*self.0.table_addrs.get(addr as usize).expect("No table addr for table, this is a bug")
141149
}
142150

143151
// resolve a memory address to the global store address
152+
#[inline]
144153
pub(crate) fn resolve_mem_addr(&self, addr: MemAddr) -> MemAddr {
145154
*self.0.mem_addrs.get(addr as usize).expect("No mem addr for mem, this is a bug")
146155
}
147156

148157
// resolve a data address to the global store address
158+
#[inline]
149159
pub(crate) fn resolve_data_addr(&self, addr: DataAddr) -> MemAddr {
150160
*self.0.data_addrs.get(addr as usize).expect("No data addr for data, this is a bug")
151161
}
152162

153163
// resolve a memory address to the global store address
164+
#[inline]
154165
pub(crate) fn resolve_elem_addr(&self, addr: ElemAddr) -> ElemAddr {
155166
*self.0.elem_addrs.get(addr as usize).expect("No elem addr for elem, this is a bug")
156167
}
157168

158169
// resolve a global address to the global store address
170+
#[inline]
159171
pub(crate) fn resolve_global_addr(&self, addr: GlobalAddr) -> GlobalAddr {
160172
self.0.global_addrs[addr as usize]
161173
}

crates/tinywasm/src/runtime/interpreter/macros.rs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,13 @@ macro_rules! mem_load {
2828
}};
2929

3030
($load_type:ty, $target_type:ty, $arg:ident, $stack:ident, $store:ident, $module:ident) => {{
31-
// TODO: there could be a lot of performance improvements here
3231
let mem_idx = $module.resolve_mem_addr($arg.mem_addr);
3332
let mem = $store.get_mem(mem_idx as usize)?;
3433
let mem_ref = mem.borrow_mut();
3534

3635
let addr = $stack.values.pop()?.raw_value();
3736
let addr = $arg.offset.checked_add(addr).ok_or_else(|| {
37+
cold();
3838
Error::Trap(crate::Trap::MemoryOutOfBounds {
3939
offset: $arg.offset as usize,
4040
len: core::mem::size_of::<$load_type>(),
@@ -43,6 +43,7 @@ macro_rules! mem_load {
4343
})?;
4444

4545
let addr: usize = addr.try_into().ok().ok_or_else(|| {
46+
cold();
4647
Error::Trap(crate::Trap::MemoryOutOfBounds {
4748
offset: $arg.offset as usize,
4849
len: core::mem::size_of::<$load_type>(),

crates/tinywasm/src/runtime/interpreter/mod.rs

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,28 +29,31 @@ impl InterpreterRuntime {
2929
let mut current_module = store.get_module_instance_raw(cf.func_instance.1);
3030

3131
loop {
32-
match exec_one(&mut cf, stack, store, &current_module)? {
32+
match exec_one(&mut cf, stack, store, &current_module) {
3333
// Continue execution at the new top of the call stack
34-
ExecResult::Call => {
34+
Ok(ExecResult::Call) => {
3535
cf = stack.call_stack.pop()?;
36+
37+
// keeping the pointer seperate from the call frame is about 2% faster
38+
// than storing it in the call frame
3639
if cf.func_instance.1 != current_module.id() {
3740
current_module.swap_with(cf.func_instance.1, store);
3841
}
3942
}
4043

4144
// return from the function
42-
ExecResult::Return => return Ok(()),
45+
Ok(ExecResult::Return) => return Ok(()),
4346

4447
// continue to the next instruction and increment the instruction pointer
45-
ExecResult::Ok => cf.instr_ptr += 1,
48+
Ok(ExecResult::Ok) => cf.instr_ptr += 1,
4649

4750
// trap the program
48-
ExecResult::Trap(trap) => {
51+
Err(error) => {
4952
cf.instr_ptr += 1;
5053
// push the call frame back onto the stack so that it can be resumed
5154
// if the trap can be handled
5255
stack.call_stack.push(cf)?;
53-
return Err(Error::Trap(trap));
56+
return Err(error);
5457
}
5558
}
5659
}
@@ -61,13 +64,14 @@ enum ExecResult {
6164
Ok,
6265
Return,
6366
Call,
64-
Trap(crate::Trap),
6567
}
6668

6769
/// Run a single step of the interpreter
6870
/// A seperate function is used so later, we can more easily implement
6971
/// a step-by-step debugger (using generators once they're stable?)
70-
#[inline(always)] // this improves performance by more than 20% in some cases
72+
// we want this be always part of the loop, rust just doesn't inline it as its too big
73+
// this can be a 30%+ performance difference in some cases
74+
#[inline(always)]
7175
fn exec_one(cf: &mut CallFrame, stack: &mut Stack, store: &mut Store, module: &ModuleInstance) -> Result<ExecResult> {
7276
let instrs = &cf.func_instance.0.instructions;
7377
if unlikely(cf.instr_ptr >= instrs.len() || instrs.is_empty()) {
@@ -84,7 +88,7 @@ fn exec_one(cf: &mut CallFrame, stack: &mut Stack, store: &mut Store, module: &M
8488
Nop => { /* do nothing */ }
8589
Unreachable => {
8690
cold();
87-
return Ok(ExecResult::Trap(crate::Trap::Unreachable));
91+
return Err(crate::Trap::Unreachable.into());
8892
} // we don't need to include the call frame here because it's already on the stack
8993
Drop => stack.values.pop().map(|_| ())?,
9094

crates/tinywasm/src/store/memory.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ impl MemoryInstance {
100100
Ok(val)
101101
}
102102

103+
#[inline]
103104
pub(crate) fn page_count(&self) -> usize {
104105
self.page_count
105106
}
@@ -186,10 +187,12 @@ macro_rules! impl_mem_loadable_for_primitive {
186187
$(
187188
#[allow(unsafe_code)]
188189
unsafe impl MemLoadable<$size> for $type {
190+
#[inline]
189191
fn from_le_bytes(bytes: [u8; $size]) -> Self {
190192
<$type>::from_le_bytes(bytes)
191193
}
192194

195+
#[inline]
193196
fn from_be_bytes(bytes: [u8; $size]) -> Self {
194197
<$type>::from_be_bytes(bytes)
195198
}

crates/tinywasm/src/store/mod.rs

Lines changed: 66 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,72 @@ impl Store {
116116
Ok(())
117117
}
118118

119+
#[cold]
120+
fn not_found_error(name: &str) -> Error {
121+
Error::Other(format!("{} not found", name))
122+
}
123+
124+
/// Get the function at the actual index in the store
125+
#[inline]
126+
pub(crate) fn get_func(&self, addr: usize) -> Result<&FunctionInstance> {
127+
self.data.funcs.get(addr).ok_or_else(|| Self::not_found_error("function"))
128+
}
129+
130+
/// Get the memory at the actual index in the store
131+
#[inline]
132+
pub(crate) fn get_mem(&self, addr: usize) -> Result<&Rc<RefCell<MemoryInstance>>> {
133+
self.data.memories.get(addr).ok_or_else(|| Self::not_found_error("memory"))
134+
}
135+
136+
/// Get the table at the actual index in the store
137+
#[inline]
138+
pub(crate) fn get_table(&self, addr: usize) -> Result<&Rc<RefCell<TableInstance>>> {
139+
self.data.tables.get(addr).ok_or_else(|| Self::not_found_error("table"))
140+
}
141+
142+
/// Get the data at the actual index in the store
143+
#[inline]
144+
pub(crate) fn get_data(&self, addr: usize) -> Result<&DataInstance> {
145+
self.data.datas.get(addr).ok_or_else(|| Self::not_found_error("data"))
146+
}
147+
148+
/// Get the data at the actual index in the store
149+
#[inline]
150+
pub(crate) fn get_data_mut(&mut self, addr: usize) -> Result<&mut DataInstance> {
151+
self.data.datas.get_mut(addr).ok_or_else(|| Self::not_found_error("data"))
152+
}
153+
154+
/// Get the element at the actual index in the store
155+
#[inline]
156+
pub(crate) fn get_elem(&self, addr: usize) -> Result<&ElementInstance> {
157+
self.data.elements.get(addr).ok_or_else(|| Self::not_found_error("element"))
158+
}
159+
160+
/// Get the global at the actual index in the store
161+
#[inline]
162+
pub(crate) fn get_global(&self, addr: usize) -> Result<&Rc<RefCell<GlobalInstance>>> {
163+
self.data.globals.get(addr).ok_or_else(|| Self::not_found_error("global"))
164+
}
165+
166+
/// Get the global at the actual index in the store
167+
#[inline]
168+
pub fn get_global_val(&self, addr: usize) -> Result<RawWasmValue> {
169+
self.data.globals.get(addr).ok_or_else(|| Self::not_found_error("global")).map(|global| global.borrow().value)
170+
}
171+
172+
/// Set the global at the actual index in the store
173+
#[inline]
174+
pub(crate) fn set_global_val(&mut self, addr: usize, value: RawWasmValue) -> Result<()> {
175+
self.data
176+
.globals
177+
.get(addr)
178+
.ok_or_else(|| Self::not_found_error("global"))
179+
.map(|global| global.borrow_mut().value = value)
180+
}
181+
}
182+
183+
// Linking related functions
184+
impl Store {
119185
/// Add functions to the store, returning their addresses in the store
120186
pub(crate) fn init_funcs(
121187
&mut self,
@@ -391,58 +457,4 @@ impl Store {
391457
};
392458
Ok(val)
393459
}
394-
395-
#[cold]
396-
fn not_found_error(name: &str) -> Error {
397-
Error::Other(format!("{} not found", name))
398-
}
399-
400-
/// Get the function at the actual index in the store
401-
pub(crate) fn get_func(&self, addr: usize) -> Result<&FunctionInstance> {
402-
self.data.funcs.get(addr).ok_or_else(|| Self::not_found_error("function"))
403-
}
404-
405-
/// Get the memory at the actual index in the store
406-
pub(crate) fn get_mem(&self, addr: usize) -> Result<&Rc<RefCell<MemoryInstance>>> {
407-
self.data.memories.get(addr).ok_or_else(|| Self::not_found_error("memory"))
408-
}
409-
410-
/// Get the table at the actual index in the store
411-
pub(crate) fn get_table(&self, addr: usize) -> Result<&Rc<RefCell<TableInstance>>> {
412-
self.data.tables.get(addr).ok_or_else(|| Self::not_found_error("table"))
413-
}
414-
415-
/// Get the data at the actual index in the store
416-
pub(crate) fn get_data(&self, addr: usize) -> Result<&DataInstance> {
417-
self.data.datas.get(addr).ok_or_else(|| Self::not_found_error("data"))
418-
}
419-
420-
/// Get the data at the actual index in the store
421-
pub(crate) fn get_data_mut(&mut self, addr: usize) -> Result<&mut DataInstance> {
422-
self.data.datas.get_mut(addr).ok_or_else(|| Self::not_found_error("data"))
423-
}
424-
425-
/// Get the element at the actual index in the store
426-
pub(crate) fn get_elem(&self, addr: usize) -> Result<&ElementInstance> {
427-
self.data.elements.get(addr).ok_or_else(|| Self::not_found_error("element"))
428-
}
429-
430-
/// Get the global at the actual index in the store
431-
pub(crate) fn get_global(&self, addr: usize) -> Result<&Rc<RefCell<GlobalInstance>>> {
432-
self.data.globals.get(addr).ok_or_else(|| Self::not_found_error("global"))
433-
}
434-
435-
/// Get the global at the actual index in the store
436-
pub fn get_global_val(&self, addr: usize) -> Result<RawWasmValue> {
437-
self.data.globals.get(addr).ok_or_else(|| Self::not_found_error("global")).map(|global| global.borrow().value)
438-
}
439-
440-
/// Set the global at the actual index in the store
441-
pub(crate) fn set_global_val(&mut self, addr: usize, value: RawWasmValue) -> Result<()> {
442-
self.data
443-
.globals
444-
.get(addr)
445-
.ok_or_else(|| Self::not_found_error("global"))
446-
.map(|global| global.borrow_mut().value = value)
447-
}
448460
}

0 commit comments

Comments
 (0)