-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Only box the function once when creating threads. #31621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
893ff0c
to
6e3582b
Compare
Personally I don't necessarily feel that this is worth the cost. The farther down we push the generics the more code we have to monomorphize for any vanilla call to Without performance numbers one way or another (and I would be quite skeptical that this could show up in any profile), I'd prefer to stick to concrete types instead of generics wherever possible. |
I had anticipated that as a possible downside, and the cost of any allocation would definitely be dwarfed by spawning the thread. In that sense, this change is unsubstantiated. However, I personally like to have very fine-grained control over how memory is used in my programs, and it's always a little disheartening to look into Another approach would be to violate parametricity in thread::spawn, with a different path for ZSTs. The only way to make that work would be to use "raw" in the ZST version to pass the vtable as the thread args, which feels very hacky. |
|
@arielb1 Yes, but the old code boxes the box, which does allocate. The core of this issue is that you need a pointer-width argument to pass to the system thread creation function. You can't get that from an *mut Trait, so the old code boxed it again to get an *mut Box<FnBox()>. What I meant by using |
Ah right. I would have pub unsafe fn new<F: FnOnce()>(stack: usize, p: F) -> io::Result<Thread> {
let p = box p;
match Self::new_inner(stack, &*p as *const F as *const libc::c_void, thread_start::<F>) {
Ok(o) => { mem::forget(p); Ok(o) }
Err(e) => Err(e)
}
extern fn thread_start<F: FnOnce()>(main: *mut libc::c_void)
-> *mut libc::c_void {
unsafe {
let main = Box::from_raw(main as *mut F);
start_thread(main);
}
ptr::null_mut()
}
}
unsafe fn new_inner<'a>(stack: usize, p: *const libc::c_void,
f: extern "C" fn(*mut libc::c_void) -> *mut libc::c_void)
-> io::Result<Thread> {
// ..
} There's a potential use-after-free if assert_eq!(libc::pthread_attr_destroy(&mut attr), 0); line - if it panics, we get a double-free of This way we also avoid creating a monomorphzied vtable. |
Making new a shim is probably a good idea -- it'll reduce the volume of monomorphized code. w.r.t. the destruction of the thread attributes, I really doubt that would ever fail if the pthreads implementation is compliant. |
Having concrete data showing that we pay for this change without showing how the tradeoff is worth it seems bad to me though? |
@alexcrichton I'll try to drum up a benchmark to see how much this really hurts compile times. It's not as though we don't already do some monomorphization for |
I did a for each of:
which yielded the following times after two runs:
The std tests use thread::spawn extensively in the I consider this level of regression acceptable, especially considering the sheer volume of thread::spawn calls in the tests, 199 in libstd alone (git grep thread::spawn | awk '/libstd/ { print $0 }' | wc -l). |
b4e9419
to
2543cd7
Compare
I figure that you might want to manually inline |
2543cd7
to
edd679a
Compare
edd679a
to
ee7a85c
Compare
@arielb1 I tried doing the inlined variety here: rphmeier@9fd687b But it seems to have slower compile times than the previous shim, but not as bad as my first attempt. |
thats... a bit odd, but go figure. What measurement method were you using? |
Just a If you have any other suggestions for benchmarking, I'd be glad to hear them. I don't know very much about the Rust build system, so I'm sure there are better ways to perform this test. |
You can cut out the middleman
in opt mode or in no-opt mode. |
I was curious what this did to compile times in a more extreme case, as stdtest may not give us great insight into compiling this function specifically. The test case I had was spawning 1000 threads (each returning a new integer) and then just dropping all the handles (no join). An optimized compile of this program went from 9:06.92 to 14:18.06, a 36% slowdown. Have you measured any runtime benefit to this? I'm not super eager to make the standard library more generic without any runtime benefit. |
I did the same test as you're proposing, but with only 200 threads as opposed to 1000, since 1000 ran on my laptop for about half an hour in LLVM with no signs of finishing. With 200 threads, I measured a 1.8% compilation slowdown with my changes (47.277s to 48.148s). Running Actually running the benchmark: master:
thread_box_once:
So it seems that the runtimes are just about even. The 1.8% slowdown amounted to a 0.871s increase in compile times, or about 5ms per thread::spawn call. I'm not going to dismiss it, but I think that this is definitely a tolerable increase. |
Maybe try a strategically placed |
What I mean to point out is that making these functions more generic comes at a cost, and that cost is inlining more code downstream. This is minimized here, but the cost is not zero.
While I agree with the conclusion the method here doesn't look like it's producing accurate results. The variance in those measurements is massive, so we can't really gain any information from that. I suspect, however, that it's basically absolutely guaranteed that there is no difference in perf from one more allocation when spawning a thread |
☔ The latest upstream changes (presumably #32390) made this pull request unmergeable. Please resolve the merge conflicts. |
Closing due to inactivity, but feel free to resubmit with a rebase! |
I felt like non-capturing thread functions really shouldn't require any heap allocations.