Skip to content

Conversation

@Keno
Copy link
Member

@Keno Keno commented Oct 2, 2017

Before:

julia> let
       error()
       end
ERROR:
Stacktrace:
 [1] error() at ./error.jl:44

After:

julia> let
       error()
       end
ERROR:
Stacktrace:
 [1] error() at ./error.jl:44
 [2] macro expansion at REPL[0]:2 [inlined]
 [3] In toplevel scope

The extra macro expansion frame is a pre-existing julia bug (#23971)
that this PR doesn't address.

The mechanism used here is to add a no-inline enter_interpreter stack frame that has
a known stack layout (since it only has one local variable) and can thus be used
to retrieve the interpreter state. I do not believe that this is guaranteed by the
C standard, so depending on the whims of the compiler we may have to eventually write
this function in assembly, but it seems to work well enough for now.

One significant complication is that the backtrace buffer may now contain pointers
to gc-managed data, so we need to make the gc aware that and can't be as non-chalant
about copying the buffer around anymore.

@Keno Keno requested review from JeffBezanson and yuyichao October 2, 2017 22:10
Copy link
Contributor

@yuyichao yuyichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Get the backtrace of the current exception, for use within `catch` blocks.
"""
catch_backtrace() = ccall(:jl_get_backtrace, Array{Ptr{Void},1}, ())
function catch_backtrace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to just avoid exposing the bare pointer information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can probably return two arrays here and keep a sentinel value to decide where to insert them.

extern uintptr_t __stop_jl_interpreter_frame;

#define SECT_INTERP JL_SECTION("jl_interpreter_frame")
#define SECT_ENTER_INTERP JL_SECTION("jl_enter_interpreter_frame")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I got confused what the linux version is doing here. I assume this is the started but not finished version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, haven't tried on linux yet.

#define SECT_INTERP JL_SECTION("__TEXT,__jif")
#define SECT_ENTER_INTERP JL_SECTION("__TEXT,__jeif")
#else
JL_DLLEXPORT int jl_is_interpreter_frame(uintptr_t ip)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used anywhere? Is this used for removing C frames in the interpreter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea, yes. Not implemented yet, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we do that already for all functions in libjulia?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a mode where we display julia and C frames. I wanted to hide the interpreter frames in those traces, since we insert the reconstructed julia frame.


JL_DLLEXPORT size_t jl_capture_interp_frame(uintptr_t *data, uintptr_t sp, size_t space_remaining)
{
interpreter_state *s = (interpreter_state *)sp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked that this may not work all the time. This is actually architecture dependent and AArch64 puts frame pointer and link register below (lower address) local variables. Stack protector also changes this on aarch64 though it seems to not affect x86....

Anyway, I think an asm version will be much better. I'll write the arm and aarch64 version when you have the x86 version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems that on aarch64, gcc follows the aapcs64 recommendation and put local variable above fp and lr while clang/llvm 5.0 put the fp and lr above local variables. Both put stack protection slot at the bottom though..

// before the interpreter state. If that's not the case on some platform, that platform
// will need this function written in assembly.
void *NOINLINE SECT_ENTER_INTERP enter_interpreter_frame(void *(*callback)(interpreter_state *, void *), void *arg) {
interpreter_state state = {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be a lot faster to represent this explicitly, and teach jl_throw how to unwind it?

struct interpreter_state_chain {
    uintptr_t sp; // = __builtin_frame_address(0);
    jl_code_info_t src;
    uint32_t ip;
    struct interpreter_state_chain *next;
} global;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't even need the sp field? (&global would be good enough.) I've thought about this before but don't know how reliable sp is. This PR also rely on that of course....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBezanson and I thought about it, but it would mean more state to manage everywhere, which didn't seem worth the trade off.


SECT_INTERP jl_value_t *jl_interpret_toplevel_thunk(jl_module_t *m, jl_code_info_t *src)
{
struct jl_interpret_toplevel_thunk_args args = {.m = m, .src = src};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This C99 syntax is not present in C++

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still trying to write code that can be compiled by MSVC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Also supports the JULIA_TIMING option.

@Keno Keno force-pushed the kf/interpretertraces branch from 6aece2d to 4a1cf49 Compare October 3, 2017 20:40
".type enter_interpreter_frame,@function\n"
"enter_interpreter_frame:\n"
".cfi_startproc\n"
// sizeof() is 44, but we need to be 8 byte aligned, so subtract 48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(struct interpreter_state)?

"\tsubq $8, %rsp\n"
".cfi_def_cfa_offset 64\n"
"enter_interpreter_frame_start_val:\n"
"\tcallq *%rax\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of avoiding a (fast, simple) shadow stack in favor of using this (very slow, complicated) real stack design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that you need to add code to maintain the shadow stack in task switching, exception handling, etc. Also, I don't see why this would be slower than the shadow stack mechanism?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One register to load/store instead of 10-20. No jmps, vs. 2 (where one is indirect).

The load/store cost is (nearly) zero (likely to be speculated executed). The jmp cost is high or very high depending on the quality of the processor (Recent Intel processors have optimizations so that this callq *%raq no longer causes a pipeline stall always)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really believe that the cost of one extra jmp is anything but negligible compared to the rest of the cost of function entry in the interpreter?

Copy link
Member

@vtjnash vtjnash Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two jmps, and both unnecessary and highly complex to maintain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the function being called is not inlined (which is important for predictable value of $sp) that doesn't really save instructions. Calling functions with an address in the register is also in general easier to write....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would win you this argument is to convince me that the shadow stack solution is simpler

When I implemented this in the past, I used the shadow stack we need for #12205 anyways (e.g. specially marked gc frames). It only took a few minutes to implement, and I hadn't done anything with it (because you and Oscar had Gallium working), so I never made any PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make you a deal. Since this is almost good to go, I'll fix this up and get it merged. Then I'll put up a PR to replace it with the shadow stack approach at which point we can debate which is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still failing on several systems and needs a reviewer, so it appears there's non-trivial work left.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but the backtracing mechanism itself (which is the part that a shadow stack would replace) is implemented and working on all CI tested architectures. Something else still has a bug I believe (though I haven't seen it locally yet), but that part seems to work fine.

@Keno Keno force-pushed the kf/interpretertraces branch 5 times, most recently from 8c71c0b to 4476c1b Compare October 9, 2017 19:21
@Keno
Copy link
Member Author

Keno commented Oct 10, 2017

I understand the win32 failure here (though the lack of rr made this take an unfortunate amount of time). On win32, we set -mincoming-stack-boundary=2 which makes gcc emit a stack realignment prologue. However, under certain circumstances (which may be compiler version specific), it performs the stack realignment before the main prologue [1], which throws off getting the interpreter struct from the stack. What made debugging this worse is that most of the callback do not have the pre-prologue realignment and GCC 6.2.0 doesn't seem to emit it at all for this code. Anyway, should be fairly straightforward to fix.

[1] https://gcc.gnu.org/ml/gcc/2007-12/msg00503.html

@Keno Keno changed the title WIP: Integrate interpreter frames into backtraces Integrate interpreter frames into backtraces Oct 10, 2017
@Keno
Copy link
Member Author

Keno commented Oct 10, 2017

This is done from my perspective. Once CI finishes running through, I'll push one more commit to fix the C++ compatibility concern. Review away.

@Keno Keno force-pushed the kf/interpretertraces branch from 6e30283 to 3a27f8c Compare October 12, 2017 21:48
@Keno Keno force-pushed the kf/interpretertraces branch 4 times, most recently from ddd9b06 to fffc366 Compare November 3, 2017 19:22
@Keno
Copy link
Member Author

Keno commented Nov 4, 2017

The Win64 failure looks unrelated to me. @JeffBezanson I think this is good to go at your convenience.

Before:
```
julia> let
       error()
       end
ERROR:
Stacktrace:
 [1] error() at ./error.jl:44
```

After:
```
julia> let
       error()
       end
ERROR:
Stacktrace:
 [1] error() at ./error.jl:44
 [2] macro expansion at REPL[0]:2 [inlined]
 [3] In toplevel scope
```

The extra `macro expansion` frame is a pre-existing julia bug (#23971)
that this PR doesn't address.

The mechanism used here is to add a no-inline enter_interpreter stack frame that has
a known stack layout (since it only has one local variable) and can thus be used
to retrieve the interpreter state. I do not believe that this is guaranteed by the
C standard, so depending on the whims of the compiler we may have to eventually write
this function in assembly, but it seems to work well enough for now.

One significant complication is that the backtrace buffer may now contain pointers
to gc-managed data, so we need to make the gc aware that and can't be as non-chalant
about copying the buffer around anymore.
@JeffBezanson JeffBezanson merged commit 4fe309a into master Nov 7, 2017
@JeffBezanson JeffBezanson deleted the kf/interpretertraces branch November 7, 2017 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants