-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Integrate interpreter frames into backtraces #23973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yuyichao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| Get the backtrace of the current exception, for use within `catch` blocks. | ||
| """ | ||
| catch_backtrace() = ccall(:jl_get_backtrace, Array{Ptr{Void},1}, ()) | ||
| function catch_backtrace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to just avoid exposing the bare pointer information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can probably return two arrays here and keep a sentinel value to decide where to insert them.
src/interpreter.c
Outdated
| extern uintptr_t __stop_jl_interpreter_frame; | ||
|
|
||
| #define SECT_INTERP JL_SECTION("jl_interpreter_frame") | ||
| #define SECT_ENTER_INTERP JL_SECTION("jl_enter_interpreter_frame") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I got confused what the linux version is doing here. I assume this is the started but not finished version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, haven't tried on linux yet.
src/interpreter.c
Outdated
| #define SECT_INTERP JL_SECTION("__TEXT,__jif") | ||
| #define SECT_ENTER_INTERP JL_SECTION("__TEXT,__jeif") | ||
| #else | ||
| JL_DLLEXPORT int jl_is_interpreter_frame(uintptr_t ip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used anywhere? Is this used for removing C frames in the interpreter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the idea, yes. Not implemented yet, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we do that already for all functions in libjulia?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a mode where we display julia and C frames. I wanted to hide the interpreter frames in those traces, since we insert the reconstructed julia frame.
src/interpreter.c
Outdated
|
|
||
| JL_DLLEXPORT size_t jl_capture_interp_frame(uintptr_t *data, uintptr_t sp, size_t space_remaining) | ||
| { | ||
| interpreter_state *s = (interpreter_state *)sp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked that this may not work all the time. This is actually architecture dependent and AArch64 puts frame pointer and link register below (lower address) local variables. Stack protector also changes this on aarch64 though it seems to not affect x86....
Anyway, I think an asm version will be much better. I'll write the arm and aarch64 version when you have the x86 version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it seems that on aarch64, gcc follows the aapcs64 recommendation and put local variable above fp and lr while clang/llvm 5.0 put the fp and lr above local variables. Both put stack protection slot at the bottom though..
src/interpreter.c
Outdated
| // before the interpreter state. If that's not the case on some platform, that platform | ||
| // will need this function written in assembly. | ||
| void *NOINLINE SECT_ENTER_INTERP enter_interpreter_frame(void *(*callback)(interpreter_state *, void *), void *arg) { | ||
| interpreter_state state = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be a lot faster to represent this explicitly, and teach jl_throw how to unwind it?
struct interpreter_state_chain {
uintptr_t sp; // = __builtin_frame_address(0);
jl_code_info_t src;
uint32_t ip;
struct interpreter_state_chain *next;
} global;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't even need the sp field? (&global would be good enough.) I've thought about this before but don't know how reliable sp is. This PR also rely on that of course....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffBezanson and I thought about it, but it would mean more state to manage everywhere, which didn't seem worth the trade off.
src/interpreter.c
Outdated
|
|
||
| SECT_INTERP jl_value_t *jl_interpret_toplevel_thunk(jl_module_t *m, jl_code_info_t *src) | ||
| { | ||
| struct jl_interpret_toplevel_thunk_args args = {.m = m, .src = src}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This C99 syntax is not present in C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still trying to write code that can be compiled by MSVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Also supports the JULIA_TIMING option.
6aece2d to
4a1cf49
Compare
src/interpreter.c
Outdated
| ".type enter_interpreter_frame,@function\n" | ||
| "enter_interpreter_frame:\n" | ||
| ".cfi_startproc\n" | ||
| // sizeof() is 44, but we need to be 8 byte aligned, so subtract 48 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizeof(struct interpreter_state)?
src/interpreter.c
Outdated
| "\tsubq $8, %rsp\n" | ||
| ".cfi_def_cfa_offset 64\n" | ||
| "enter_interpreter_frame_start_val:\n" | ||
| "\tcallq *%rax\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of avoiding a (fast, simple) shadow stack in favor of using this (very slow, complicated) real stack design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is that you need to add code to maintain the shadow stack in task switching, exception handling, etc. Also, I don't see why this would be slower than the shadow stack mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One register to load/store instead of 10-20. No jmps, vs. 2 (where one is indirect).
The load/store cost is (nearly) zero (likely to be speculated executed). The jmp cost is high or very high depending on the quality of the processor (Recent Intel processors have optimizations so that this callq *%raq no longer causes a pipeline stall always)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really believe that the cost of one extra jmp is anything but negligible compared to the rest of the cost of function entry in the interpreter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two jmps, and both unnecessary and highly complex to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the function being called is not inlined (which is important for predictable value of $sp) that doesn't really save instructions. Calling functions with an address in the register is also in general easier to write....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would win you this argument is to convince me that the shadow stack solution is simpler
When I implemented this in the past, I used the shadow stack we need for #12205 anyways (e.g. specially marked gc frames). It only took a few minutes to implement, and I hadn't done anything with it (because you and Oscar had Gallium working), so I never made any PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make you a deal. Since this is almost good to go, I'll fix this up and get it merged. Then I'll put up a PR to replace it with the shadow stack approach at which point we can debate which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still failing on several systems and needs a reviewer, so it appears there's non-trivial work left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but the backtracing mechanism itself (which is the part that a shadow stack would replace) is implemented and working on all CI tested architectures. Something else still has a bug I believe (though I haven't seen it locally yet), but that part seems to work fine.
8c71c0b to
4476c1b
Compare
|
I understand the win32 failure here (though the lack of rr made this take an unfortunate amount of time). On win32, we set |
|
This is done from my perspective. Once CI finishes running through, I'll push one more commit to fix the C++ compatibility concern. Review away. |
6e30283 to
3a27f8c
Compare
ddd9b06 to
fffc366
Compare
|
The Win64 failure looks unrelated to me. @JeffBezanson I think this is good to go at your convenience. |
Before:
```
julia> let
error()
end
ERROR:
Stacktrace:
[1] error() at ./error.jl:44
```
After:
```
julia> let
error()
end
ERROR:
Stacktrace:
[1] error() at ./error.jl:44
[2] macro expansion at REPL[0]:2 [inlined]
[3] In toplevel scope
```
The extra `macro expansion` frame is a pre-existing julia bug (#23971)
that this PR doesn't address.
The mechanism used here is to add a no-inline enter_interpreter stack frame that has
a known stack layout (since it only has one local variable) and can thus be used
to retrieve the interpreter state. I do not believe that this is guaranteed by the
C standard, so depending on the whims of the compiler we may have to eventually write
this function in assembly, but it seems to work well enough for now.
One significant complication is that the backtrace buffer may now contain pointers
to gc-managed data, so we need to make the gc aware that and can't be as non-chalant
about copying the buffer around anymore.
fffc366 to
fa92f86
Compare
Before:
After:
The extra
macro expansionframe is a pre-existing julia bug (#23971)that this PR doesn't address.
The mechanism used here is to add a no-inline enter_interpreter stack frame that has
a known stack layout (since it only has one local variable) and can thus be used
to retrieve the interpreter state. I do not believe that this is guaranteed by the
C standard, so depending on the whims of the compiler we may have to eventually write
this function in assembly, but it seems to work well enough for now.
One significant complication is that the backtrace buffer may now contain pointers
to gc-managed data, so we need to make the gc aware that and can't be as non-chalant
about copying the buffer around anymore.