diff --git a/docs/design/features/OSRX64EpilogRedesign.md b/docs/design/features/OSRX64EpilogRedesign.md new file mode 100644 index 00000000000000..d710083e089965 --- /dev/null +++ b/docs/design/features/OSRX64EpilogRedesign.md @@ -0,0 +1,247 @@ +## OSR x64 Epilog Redesign + +### Problem + +The current x64 OSR epilog generation creates "non-canonical" +epilogs. While the code sequences are correct, the windows x64 +unwinder depends on code generators to produce canonical epilogs, so +that the unwinder can reliably detect when an IP is within an epilog. + +The windows x64 unwind info has no data whatsoever on epilogs, so this +sort of implicit epilog detection is necessary. The unwinder +disassembles the code at starting at the IP to deduce if the IP is +within an epilog. Only very specific sequences of instructions are +expected, and anything unexpected causes the unwinder to deduce +that the IP is not in an epilog. + +The canonical epilog is a single RSP adjust followed by some number of +non-volatile integer register POPs, and then a RET or JMP. Non-volatile float +registers are restored outside the epilog via MOVs. + +OSR methods currently generate the following kind of epilog. It is +non-canonical because of the second RSP adjustment, whose purpose is +to remove the Tier0 frame from the stack. + +```asm + add rsp, 120 ;; pop OSR contribution to frame + pop rbx ;; restore non-volatile regs (callee-saves) + pop rsi + pop rdi + pop r12 + pop r13 + pop r14 + pop r15 + pop rbp + add rsp, 472 ;; pop Tier0 contribution to frame + pop rbp ;; second RBP restore (see below) + ret +``` + +These non-canonical OSR epilogs break the x64 unwinder's "in epilog" +detection and also break epilog unwind. This leads to assertions and +bugs during thread suspension, when suspended threads are in the +middle of OSR epilogs, and to broken stack traces when walking the +stack for diagnostic purposes (debugging or sampling). + +The CLR (mostly?) tries to avoid suspending threads in epilog, but it +does this by suspending the thread and then calling into the os +unwinder to determine if a thread is in an epilog. The non-canonical +OSR epilogs break thread suspension. + +So it is imperative that the x64 OSR epilog sequence be one that the +OS unwinder can reliably recognize as an epilog. It is also beneficial +(though perhaps not mandatory) to be able to unwind from such epilogs; this +improves diagnostic stackwalking accuracy and allows hijacking to +work normally during epilogs, if needed. + +Arm64 unwind codes are more flexible and the OSR epilogs we generate +today do not cause any known problems. + +### Solution + +If the OSR method is required to have a canonical epilog, a single +RSP adjust must remove both the OSR and Tier0 frames. This implies any +and all nonvolatile integer register saves must be stored at the root of the +Tier0 frame so that they can be properly restored by the OSR epilog +via POPs after the single RSP adjustment. + +Generally speaking, the Tier0 and OSR methods will not save the same +set of non-volatile registers, and there is no way for the Tier0 +method to know which registers the OSR methods might want to save. + +Thus we will require that any Tier0 method with patchpoints must +reserve the maximum sized area for integer registers (8 regs * 8 bytes +on Windows, 64 bytes). The Tier0 method will only use the part it +needs. The rest will be unused unless we end up creating an OSR +method. OSR methods will save any additional nonvolatile registers +they use in this area in their prologs. + +OSR method epilogs will then adjust the SP to remove both the OSR and +Tier0 frames, setting RSP to the appropriate offset into the save +area, so that the epilog can pop all the saved nonvolatile registers and +return. This gives OSR methods a canonical epilog. + +That fixes the epilogs. But we must now also ensure that all this can +be handled properly in the OSR prolog, so that in-prolog and in-body +unwind are still viable. + +A typical prolog would PUSH the non-volatiles it wants to save, but +on entry, the OSR method's RSP is pointing below the Tier0 frame, +and so is located well below the save area. So PUSHing is not possible. + +Instead, the OSR method will use MOVs to save nonvolatile +registers. Luckily, the x64 unwind format has support describing saves +done via MOVs instead of PUSHes via `UWOP_SAVE_NONVOL` (added for supporting +shrink-wrapping). We will use these codes to describe the callee save actions +in the OSR prolog. + +This new unwind code uses the established frame pointer (for x64 OSR this +is always RSP) and so integer callee saves must be saved only after any +RSP adjustments are made. This means in an OSR frame prolog the SP adjustment +happens first, then the (additional) callee saves are saved. We need +to take some care to ensure no callee save is trashed during the SP +adjustment (which may be more than just an add, say if stack probing is needed). + +### Work Needed + +* Update the Tier0 method to allocate a maximally sized integer save area. + +* OSR method prolog and unwind fixes + * To express the fact that some callee saves were saved by the Tier0 +method, the OSR method will first issue a phantom (unwind only, offset 0) +series of pushes for those callee saves. + * Next the OSR method will do a phantom SP adjust to account for the +remainder of the Tier0 frame and any SP adjustment done by the patchpoint +transition code. + * Since the Tier0 method is always an RBP frame and always saves RBP at the + top of the register save area, the OSR method does not need to save RBP, and + RBP can be restored from the Tier0 save. But (for RBP OSR frames) the x64 + OSR prolog must still set up a proper frame chain. So it will load from RBP + (into a scratch register) and push the result to establish a proper value + for RBP-based frame chaining. The OSR method is invoked with the Tier0 RBP, + so this load/push fetches the Tier0 caller RBP and stores it in a slot on + the OSR frame. This sets up a redundant copy of the saved RBP that does not + need to undone on method exit. + * Next the OSR prolog will establish its final RSP. + * Finally the OSR method will save any remaining callee saves, using MOV + instructions and `UWOP_NONVOL_SAVE` unwind records. + * Nonvolatile float (xmm) registers continue to be stored via MOVs + done after the int callee saves and RSP adjust -- their save area can be + disjoint from the integer save area. Thus XMM registers can be saved to and + restored from space on the OSR frame (otherwise the Tier0 frame would + need to reserve another 160 bytes (windows) to hold possible OSR XMM + saves). We do not yet take advantage of the fact that Tier0 methods + may have also saved XMMs so that the OSR method may only need to save + a subset. + +### Example + +Here is an example contrasting the new and old approaches on a test case. + +#### Old Approach +```asm +;; Tier0 prolog + + 55 push rbp + 56 push rsi + 4883EC38 sub rsp, 56 + 488D6C2440 lea rbp, [rsp+40H] + +;; Tier0 epilog + + 4883C438 add rsp, 56 + 5E pop rsi + 5D pop rbp + C3 ret + +;; Tier0 unwind + + CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 6 * 8 + 8 = 56 = 0x38 + CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) + CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) + +;; OSR prolog + + 57 push rdi + 56 push rsi // redundant + 4883EC28 sub rsp, 40 + +;; OSR epilog (non-standard) + + 4883C428 add rsp, 40 + 5E pop rsi + 5F pop rdi + 4883C448 add rsp, 72 + 5D pop rbp + C3 ret + +;; OSR unwind + + CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 4 * 8 + 8 = 40 = 0x28 + CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) + CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7) + + ;; "phantom unwind" records at offset 0 (Tier0 actions) + + CodeOffset: 0x00 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 8 * 8 + 8 = 72 = 0x48 + CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) +``` + +#### New Approach + +Note how the OSR method only saves RDI in its prolog, as RSI was already saved. +And this save happens *after* RSP is updated in the OSR frame. +Restore of RDI in unwind uses `UWOP_SAVE_NONVOL`. +```asm +;; Tier0 prolog + + 55 push rbp + 56 push rsi + 4883EC68 sub rsp, 104 // leave room for OSR + 488D6C2470 lea rbp, [rsp+70H] + +;; Tier0 epilog + + 4883C468 add rsp, 104 + 5E pop rsi + 5D pop rbp + C3 ret + +;; Tier0 unwind + + CodeOffset: 0x06 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 12 * 8 + 8 = 104 = 0x68 + CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) + CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) + +;; OSR prolog + + 4883EC38 sub rsp, 56 + 4889BC24A0000000 mov qword ptr [rsp+A0H], rdi + +;; OSR epilog (standard) + + 4881C4A0000000 add rsp, 160 + 5F pop rdi + 5E pop rsi + 5D pop rbp + C3 ret + +;; OSR unwind + + CodeOffset: 0x0C UnwindOp: UWOP_SAVE_NONVOL (4) OpInfo: rdi (7) + Scaled Small Offset: 20 * 8 = 160 = 0x000A0 + CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 6 * 8 + 8 = 56 = 0x38 + + ;; "phantom unwind" records at offset 0 (Tier0 actions) + + CodeOffset: 0x00 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 13 * 8 + 8 = 112 = 0x70 + CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6) + CodeOffset: 0x00 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5) +``` + +### Notes + +* We are not changing arm64 OSR at this time, it still uses the "old plan". Non-standard epilogs are handled on arm64 via epilog unwind codes. + +* The OSR frame still reserves space for callee saves on its frame, despite +not saving them there. \ No newline at end of file diff --git a/src/coreclr/jit/codegen.h b/src/coreclr/jit/codegen.h index d0ce7bfe45a285..8bd597360a1318 100644 --- a/src/coreclr/jit/codegen.h +++ b/src/coreclr/jit/codegen.h @@ -327,6 +327,11 @@ class CodeGen final : public CodeGenInterface void genPushCalleeSavedRegisters(); #endif +#if defined(TARGET_AMD64) + void genOSRRecordTier0CalleeSavedRegistersAndFrame(); + void genOSRSaveRemainingCalleeSavedRegisters(); +#endif // TARGET_AMD64 + void genAllocLclFrame(unsigned frameSize, regNumber initReg, bool* pInitRegZeroed, regMaskTP maskArgRegsLiveIn); void genPoisonFrame(regMaskTP bbRegLiveIn); @@ -475,6 +480,10 @@ class CodeGen final : public CodeGenInterface void genPopCalleeSavedRegisters(bool jmpEpilog = false); +#if defined(TARGET_XARCH) + unsigned genPopCalleeSavedRegistersFromMask(regMaskTP rsPopRegs); +#endif // !defined(TARGET_XARCH) + #endif // !defined(TARGET_ARM64) // diff --git a/src/coreclr/jit/codegencommon.cpp b/src/coreclr/jit/codegencommon.cpp index 17133f6e65540e..1b61d5b7a01211 100644 --- a/src/coreclr/jit/codegencommon.cpp +++ b/src/coreclr/jit/codegencommon.cpp @@ -5515,22 +5515,22 @@ void CodeGen::genFnProlog() psiBegProlog(); } -#if defined(TARGET_AMD64) || defined(TARGET_ARM64) - // For OSR there is a "phantom prolog" to account for the actions taken +#if defined(TARGET_ARM64) + // For arm64 OSR, emit a "phantom prolog" to account for the actions taken // in the tier0 frame that impact FP and SP on entry to the OSR method. + // + // x64 handles this differently; the phantom prolog unwind is emitted in + // genOSRRecordTier0CalleeSavedRegistersAndFrame. + // if (compiler->opts.IsOSR()) { PatchpointInfo* patchpointInfo = compiler->info.compPatchpointInfo; const int tier0FrameSize = patchpointInfo->TotalFrameSize(); -#if defined(TARGET_AMD64) - // FP is tier0 method's FP. - compiler->unwindPush(REG_FPBASE); -#endif // SP is tier0 method's SP. compiler->unwindAllocStack(tier0FrameSize); } -#endif // defined(TARGET_AMD64) || defined(TARGET_ARM64) +#endif // defined(TARGET_ARM64) #ifdef DEBUG @@ -5807,6 +5807,14 @@ void CodeGen::genFnProlog() } #endif // TARGET_ARM + const bool isRoot = (compiler->funCurrentFunc()->funKind == FuncKind::FUNC_ROOT); + +#ifdef TARGET_AMD64 + const bool isOSRx64Root = isRoot && compiler->opts.IsOSR(); +#else + const bool isOSRx64Root = false; +#endif // TARGET_AMD64 + tempMask = initRegs & ~excludeMask & ~regSet.rsMaskResvd; if (tempMask != RBM_NONE) @@ -5829,6 +5837,24 @@ void CodeGen::genFnProlog() } } +#if defined(TARGET_AMD64) + // For x64 OSR root frames, we can't use any as of yet unsaved + // callee save as initReg, as we defer saving these until later in + // the prolog, and we don't have normal arg regs. + if (isOSRx64Root) + { + initReg = REG_SCRATCH; // REG_EAX + } +#elif defined(TARGET_ARM64) + // For arm64 OSR root frames, we may need a scratch register for large + // offset addresses. Use a register that won't be allocated. + // + if (isRoot && compiler->opts.IsOSR()) + { + initReg = REG_IP1; + } +#endif + noway_assert(!compiler->compMethodRequiresPInvokeFrame() || (initReg != REG_PINVOKE_FRAME)); #if defined(TARGET_AMD64) @@ -5863,11 +5889,49 @@ void CodeGen::genFnProlog() } #endif // TARGET_ARM + unsigned extraFrameSize = 0; + #ifdef TARGET_XARCH + +#ifdef TARGET_AMD64 + if (isOSRx64Root) + { + // Account for the Tier0 callee saves + // + genOSRRecordTier0CalleeSavedRegistersAndFrame(); + + // We don't actually push any callee saves on the OSR frame, + // but we still reserve space, so account for this when + // allocating the local frame. + // + extraFrameSize = compiler->compCalleeRegsPushed * REGSIZE_BYTES; + } +#endif // TARGET_ARM64 + if (doubleAlignOrFramePointerUsed()) { - inst_RV(INS_push, REG_FPBASE, TYP_REF); - compiler->unwindPush(REG_FPBASE); + // OSR methods handle "saving" FP specially. + // + // For epilog and unwind, we restore the RBP saved by the + // Tier0 method. The save we do here is just to set up a + // proper RBP-based frame chain link. + // + if (isOSRx64Root && isFramePointerUsed()) + { + GetEmitter()->emitIns_R_AR(INS_mov, EA_8BYTE, initReg, REG_FPBASE, 0); + inst_RV(INS_push, initReg, TYP_REF); + initRegZeroed = false; + + // We account for the SP movement in unwind, but not for + // the "save" of RBP. + // + compiler->unwindAllocStack(REGSIZE_BYTES); + } + else + { + inst_RV(INS_push, REG_FPBASE, TYP_REF); + compiler->unwindPush(REG_FPBASE); + } #ifdef USING_SCOPE_INFO psiAdjustStackLevel(REGSIZE_BYTES); #endif // USING_SCOPE_INFO @@ -5890,7 +5954,10 @@ void CodeGen::genFnProlog() #ifdef TARGET_ARM64 genPushCalleeSavedRegisters(initReg, &initRegZeroed); #else // !TARGET_ARM64 - genPushCalleeSavedRegisters(); + if (!isOSRx64Root) + { + genPushCalleeSavedRegisters(); + } #endif // !TARGET_ARM64 #ifdef TARGET_ARM @@ -5926,16 +5993,26 @@ void CodeGen::genFnProlog() regMaskTP maskStackAlloc = RBM_NONE; #ifdef TARGET_ARM - maskStackAlloc = - genStackAllocRegisterMask(compiler->compLclFrameSize, regSet.rsGetModifiedRegsMask() & RBM_FLT_CALLEE_SAVED); + maskStackAlloc = genStackAllocRegisterMask(compiler->compLclFrameSize + extraFrameSize, + regSet.rsGetModifiedRegsMask() & RBM_FLT_CALLEE_SAVED); #endif // TARGET_ARM if (maskStackAlloc == RBM_NONE) { - genAllocLclFrame(compiler->compLclFrameSize, initReg, &initRegZeroed, intRegState.rsCalleeRegArgMaskLiveIn); + genAllocLclFrame(compiler->compLclFrameSize + extraFrameSize, initReg, &initRegZeroed, + intRegState.rsCalleeRegArgMaskLiveIn); } #endif // !TARGET_ARM64 +#ifdef TARGET_AMD64 + // For x64 OSR we have to finish saving int callee saves. + // + if (isOSRx64Root) + { + genOSRSaveRemainingCalleeSavedRegisters(); + } +#endif // TARGET_AMD64 + //------------------------------------------------------------------------- #ifdef TARGET_ARM diff --git a/src/coreclr/jit/codegenxarch.cpp b/src/coreclr/jit/codegenxarch.cpp index 120ddbb29a060b..d06fb32d564dd0 100644 --- a/src/coreclr/jit/codegenxarch.cpp +++ b/src/coreclr/jit/codegenxarch.cpp @@ -9091,6 +9091,158 @@ void CodeGen::genProfilingLeaveCallback(unsigned helper) #endif // PROFILING_SUPPORTED +#ifdef TARGET_AMD64 + +//------------------------------------------------------------------------ +// genOSRRecordTier0CalleeSavedRegistersAndFrame: for OSR methods, record the +// subset of callee saves already saved by the Tier0 method, and the frame +// created by Tier0. +// +void CodeGen::genOSRRecordTier0CalleeSavedRegistersAndFrame() +{ + assert(compiler->compGeneratingProlog); + assert(compiler->opts.IsOSR()); + assert(compiler->funCurrentFunc()->funKind == FuncKind::FUNC_ROOT); + +#if ETW_EBP_FRAMED + if (!isFramePointerUsed() && regSet.rsRegsModified(RBM_FPBASE)) + { + noway_assert(!"Used register RBM_FPBASE as a scratch register!"); + } +#endif + + // Figure out which set of int callee saves was already saved by Tier0. + // Emit appropriate unwind. + // + PatchpointInfo* const patchpointInfo = compiler->info.compPatchpointInfo; + regMaskTP const tier0CalleeSaves = (regMaskTP)patchpointInfo->CalleeSaveRegisters(); + regMaskTP tier0IntCalleeSaves = tier0CalleeSaves & RBM_OSR_INT_CALLEE_SAVED; + int const tier0IntCalleeSaveUsedSize = genCountBits(tier0IntCalleeSaves) * REGSIZE_BYTES; + + JITDUMP("--OSR--- tier0 has already saved "); + JITDUMPEXEC(dspRegMask(tier0IntCalleeSaves)); + JITDUMP("\n"); + + // We must account for the Tier0 callee saves. + // + // These have already happened at method entry; all these + // unwind records should be at offset 0. + // + // RBP is always aved by Tier0 and always pushed first. + // + assert((tier0IntCalleeSaves & RBM_FPBASE) == RBM_FPBASE); + compiler->unwindPush(REG_RBP); + tier0IntCalleeSaves &= ~RBM_FPBASE; + + // Now the rest of the Tier0 callee saves. + // + for (regNumber reg = REG_INT_LAST; tier0IntCalleeSaves != RBM_NONE; reg = REG_PREV(reg)) + { + regMaskTP regBit = genRegMask(reg); + + if ((regBit & tier0IntCalleeSaves) != 0) + { + compiler->unwindPush(reg); + } + tier0IntCalleeSaves &= ~regBit; + } + + // We must account for the post-callee-saves push SP movement + // done by the Tier0 frame and by the OSR transition. + // + // tier0FrameSize is the Tier0 FP-SP delta plus the fake call slot added by + // JIT_Patchpoint. We add one slot to account for the saved FP. + // + // We then need to subtract off the size the Tier0 callee saves as SP + // adjusts for those will have been modelled by the unwind pushes above. + // + int const tier0FrameSize = patchpointInfo->TotalFrameSize() + REGSIZE_BYTES; + int const tier0NetSize = tier0FrameSize - tier0IntCalleeSaveUsedSize; + compiler->unwindAllocStack(tier0NetSize); +} + +//------------------------------------------------------------------------ +// genOSRSaveRemainingCalleeSavedRegisters: save any callee save registers +// that Tier0 didn't save. +// +// Notes: +// This must be invoked after SP has been adjusted to allocate the local +// frame, because of how the UnwindSave records are interpreted. +// +// We rely on the fact that other "local frame" allocation actions (like +// stack probing) will not trash callee saves registers. +// +void CodeGen::genOSRSaveRemainingCalleeSavedRegisters() +{ + // We should be generating the prolog of an OSR root frame. + // + assert(compiler->compGeneratingProlog); + assert(compiler->opts.IsOSR()); + assert(compiler->funCurrentFunc()->funKind == FuncKind::FUNC_ROOT); + + // x86/x64 doesn't support push of xmm/ymm regs, therefore consider only integer registers for pushing onto stack + // here. Space for float registers to be preserved is stack allocated and saved as part of prolog sequence and not + // here. + regMaskTP rsPushRegs = regSet.rsGetModifiedRegsMask() & RBM_OSR_INT_CALLEE_SAVED; + +#if ETW_EBP_FRAMED + if (!isFramePointerUsed() && regSet.rsRegsModified(RBM_FPBASE)) + { + noway_assert(!"Used register RBM_FPBASE as a scratch register!"); + } +#endif + + // Figure out which set of int callee saves still needs saving. + // + PatchpointInfo* const patchpointInfo = compiler->info.compPatchpointInfo; + regMaskTP const tier0CalleeSaves = (regMaskTP)patchpointInfo->CalleeSaveRegisters(); + regMaskTP tier0IntCalleeSaves = tier0CalleeSaves & RBM_OSR_INT_CALLEE_SAVED; + unsigned const tier0IntCalleeSaveUsedSize = genCountBits(tier0IntCalleeSaves) * REGSIZE_BYTES; + regMaskTP const osrIntCalleeSaves = rsPushRegs & RBM_OSR_INT_CALLEE_SAVED; + regMaskTP osrAdditionalIntCalleeSaves = osrIntCalleeSaves & ~tier0IntCalleeSaves; + + JITDUMP("---OSR--- int callee saves are "); + JITDUMPEXEC(dspRegMask(osrIntCalleeSaves)); + JITDUMP("; tier0 already saved "); + JITDUMPEXEC(dspRegMask(tier0IntCalleeSaves)); + JITDUMP("; so only saving "); + JITDUMPEXEC(dspRegMask(osrAdditionalIntCalleeSaves)); + JITDUMP("\n"); + + // These remaining callee saves will be stored in the Tier0 callee save area + // below any saves already done by Tier0. Compute the offset. + // + // The OSR method doesn't actually use its callee save area. + // + int const osrFrameSize = compiler->compLclFrameSize; + int const tier0FrameSize = patchpointInfo->TotalFrameSize(); + int const osrCalleeSaveSize = compiler->compCalleeRegsPushed * REGSIZE_BYTES; + int const osrFramePointerSize = isFramePointerUsed() ? REGSIZE_BYTES : 0; + int offset = osrFrameSize + osrCalleeSaveSize + osrFramePointerSize + tier0FrameSize - tier0IntCalleeSaveUsedSize; + + // The tier0 frame is always an RBP frame, so the OSR method should never need to save RBP. + // + assert((tier0CalleeSaves & RBM_FPBASE) == RBM_FPBASE); + assert((osrAdditionalIntCalleeSaves & RBM_FPBASE) == RBM_NONE); + + // The OSR method must use MOVs to save additional callee saves. + // + for (regNumber reg = REG_INT_LAST; osrAdditionalIntCalleeSaves != RBM_NONE; reg = REG_PREV(reg)) + { + regMaskTP regBit = genRegMask(reg); + + if ((regBit & osrAdditionalIntCalleeSaves) != 0) + { + GetEmitter()->emitIns_AR_R(INS_mov, EA_8BYTE, reg, REG_SPBASE, offset); + compiler->unwindSaveReg(reg, offset); + offset -= REGSIZE_BYTES; + } + osrAdditionalIntCalleeSaves &= ~regBit; + } +} + +#endif // TARGET_AMD64 + //------------------------------------------------------------------------ // genPushCalleeSavedRegisters: Push any callee-saved registers we have used. // @@ -9098,6 +9250,17 @@ void CodeGen::genPushCalleeSavedRegisters() { assert(compiler->compGeneratingProlog); +#if DEBUG + // OSR root frames must handle this differently. See + // genOSRRecordTier0CalleeSavedRegisters() + // genOSRSaveRemainingCalleeSavedRegisters() + // + if (compiler->opts.IsOSR()) + { + assert(compiler->funCurrentFunc()->funKind != FuncKind::FUNC_ROOT); + } +#endif + // x86/x64 doesn't support push of xmm/ymm regs, therefore consider only integer registers for pushing onto stack // here. Space for float registers to be preserved is stack allocated and saved as part of prolog sequence and not // here. @@ -9152,13 +9315,57 @@ void CodeGen::genPopCalleeSavedRegisters(bool jmpEpilog) { assert(compiler->compGeneratingEpilog); +#ifdef TARGET_AMD64 + + const bool isFunclet = compiler->funCurrentFunc()->funKind != FuncKind::FUNC_ROOT; + const bool doesSupersetOfNormalPops = compiler->opts.IsOSR() && !isFunclet; + + // OSR methods must restore all registers saved by either the OSR or + // the Tier0 method. First restore any callee save not saved by + // Tier0, then the callee saves done by Tier0. + // + // OSR funclets do normal restores. + // + if (doesSupersetOfNormalPops) + { + regMaskTP rsPopRegs = regSet.rsGetModifiedRegsMask() & RBM_OSR_INT_CALLEE_SAVED; + regMaskTP tier0CalleeSaves = + ((regMaskTP)compiler->info.compPatchpointInfo->CalleeSaveRegisters()) & RBM_OSR_INT_CALLEE_SAVED; + regMaskTP additionalCalleeSaves = rsPopRegs & ~tier0CalleeSaves; + + // Registers saved by the OSR prolog. + // + genPopCalleeSavedRegistersFromMask(additionalCalleeSaves); + + // Registers saved by the Tier0 prolog. + // Tier0 frame pointer will be restored separately. + // + genPopCalleeSavedRegistersFromMask(tier0CalleeSaves & ~RBM_FPBASE); + return; + } + +#endif // TARGET_AMD64 + + // Registers saved by a normal prolog + // + regMaskTP rsPopRegs = regSet.rsGetModifiedRegsMask() & RBM_INT_CALLEE_SAVED; + const unsigned popCount = genPopCalleeSavedRegistersFromMask(rsPopRegs); + noway_assert(compiler->compCalleeRegsPushed == popCount); +} + +//------------------------------------------------------------------------ +// genPopCalleeSavedRegistersFromMask: pop specified set of callee saves +// in the "standard" order +// +unsigned CodeGen::genPopCalleeSavedRegistersFromMask(regMaskTP rsPopRegs) +{ unsigned popCount = 0; - if (regSet.rsRegsModified(RBM_EBX)) + if ((rsPopRegs & RBM_EBX) != 0) { popCount++; inst_RV(INS_pop, REG_EBX, TYP_I_IMPL); } - if (regSet.rsRegsModified(RBM_FPBASE)) + if ((rsPopRegs & RBM_FPBASE) != 0) { // EBP cannot be directly modified for EBP frame and double-aligned frames assert(!doubleAlignOrFramePointerUsed()); @@ -9169,12 +9376,12 @@ void CodeGen::genPopCalleeSavedRegisters(bool jmpEpilog) #ifndef UNIX_AMD64_ABI // For System V AMD64 calling convention ESI and EDI are volatile registers. - if (regSet.rsRegsModified(RBM_ESI)) + if ((rsPopRegs & RBM_ESI) != 0) { popCount++; inst_RV(INS_pop, REG_ESI, TYP_I_IMPL); } - if (regSet.rsRegsModified(RBM_EDI)) + if ((rsPopRegs & RBM_EDI) != 0) { popCount++; inst_RV(INS_pop, REG_EDI, TYP_I_IMPL); @@ -9182,22 +9389,22 @@ void CodeGen::genPopCalleeSavedRegisters(bool jmpEpilog) #endif // !defined(UNIX_AMD64_ABI) #ifdef TARGET_AMD64 - if (regSet.rsRegsModified(RBM_R12)) + if ((rsPopRegs & RBM_R12) != 0) { popCount++; inst_RV(INS_pop, REG_R12, TYP_I_IMPL); } - if (regSet.rsRegsModified(RBM_R13)) + if ((rsPopRegs & RBM_R13) != 0) { popCount++; inst_RV(INS_pop, REG_R13, TYP_I_IMPL); } - if (regSet.rsRegsModified(RBM_R14)) + if ((rsPopRegs & RBM_R14) != 0) { popCount++; inst_RV(INS_pop, REG_R14, TYP_I_IMPL); } - if (regSet.rsRegsModified(RBM_R15)) + if ((rsPopRegs & RBM_R15) != 0) { popCount++; inst_RV(INS_pop, REG_R15, TYP_I_IMPL); @@ -9209,7 +9416,7 @@ void CodeGen::genPopCalleeSavedRegisters(bool jmpEpilog) // space on stack in prolog sequence. PopCount is essentially // tracking the count of integer registers pushed. - noway_assert(compiler->compCalleeRegsPushed == popCount); + return popCount; } /***************************************************************************** @@ -9310,16 +9517,49 @@ void CodeGen::genFnEpilog(BasicBlock* block) // We have an ESP frame */ noway_assert(compiler->compLocallocUsed == false); // Only used with frame-pointer - /* Get rid of our local variables */ + unsigned int frameSize = compiler->compLclFrameSize; + +#ifdef TARGET_AMD64 + + // OSR must remove the entire OSR frame and the Tier0 frame down to the bottom + // of the used part of the Tier0 callee save area. + // + if (compiler->opts.IsOSR()) + { + // The patchpoint TotalFrameSize is SP-FP delta (plus "call" slot added by JIT_Patchpoint) + // so does not account for the Tier0 push of FP, so we add in an extra stack slot to get the + // offset to the top of the Tier0 callee saves area. + // + PatchpointInfo* const patchpointInfo = compiler->info.compPatchpointInfo; + + regMaskTP const tier0CalleeSaves = (regMaskTP)patchpointInfo->CalleeSaveRegisters(); + regMaskTP const tier0IntCalleeSaves = tier0CalleeSaves & RBM_OSR_INT_CALLEE_SAVED; + regMaskTP const osrIntCalleeSaves = regSet.rsGetModifiedRegsMask() & RBM_OSR_INT_CALLEE_SAVED; + regMaskTP const allIntCalleeSaves = osrIntCalleeSaves | tier0IntCalleeSaves; + unsigned const tier0FrameSize = patchpointInfo->TotalFrameSize() + REGSIZE_BYTES; + unsigned const tier0IntCalleeSaveUsedSize = genCountBits(allIntCalleeSaves) * REGSIZE_BYTES; + unsigned const osrCalleeSaveSize = compiler->compCalleeRegsPushed * REGSIZE_BYTES; + unsigned const osrFramePointerSize = isFramePointerUsed() ? REGSIZE_BYTES : 0; + unsigned const osrAdjust = + tier0FrameSize - tier0IntCalleeSaveUsedSize + osrCalleeSaveSize + osrFramePointerSize; + + JITDUMP("OSR epilog adjust factors: tier0 frame %u, tier0 callee saves -%u, osr callee saves %u, osr " + "framePointer %u\n", + tier0FrameSize, tier0IntCalleeSaveUsedSize, osrCalleeSaveSize, osrFramePointerSize); + JITDUMP(" OSR frame size %u; net osr adjust %u, result %u\n", frameSize, osrAdjust, + frameSize + osrAdjust); + frameSize += osrAdjust; + } +#endif // TARGET_AMD64 - if (compiler->compLclFrameSize) + if (frameSize > 0) { #ifdef TARGET_X86 /* Add 'compiler->compLclFrameSize' to ESP */ /* Use pop ECX to increment ESP by 4, unless compiler->compJmpOpUsed is true */ - if ((compiler->compLclFrameSize == TARGET_POINTER_SIZE) && !compiler->compJmpOpUsed) + if ((frameSize == TARGET_POINTER_SIZE) && !compiler->compJmpOpUsed) { inst_RV(INS_pop, REG_ECX, TYP_I_IMPL); regSet.verifyRegUsed(REG_ECX); @@ -9329,7 +9569,7 @@ void CodeGen::genFnEpilog(BasicBlock* block) { /* Add 'compiler->compLclFrameSize' to ESP */ /* Generate "add esp, " */ - inst_RV_IV(INS_add, REG_SPBASE, compiler->compLclFrameSize, EA_PTRSIZE); + inst_RV_IV(INS_add, REG_SPBASE, frameSize, EA_PTRSIZE); } } @@ -9338,34 +9578,22 @@ void CodeGen::genFnEpilog(BasicBlock* block) #ifdef TARGET_AMD64 // In the case where we have an RSP frame, and no frame pointer reported in the OS unwind info, // but we do have a pushed frame pointer and established frame chain, we do need to pop RBP. - if (doubleAlignOrFramePointerUsed()) - { - inst_RV(INS_pop, REG_EBP, TYP_I_IMPL); - } -#endif // TARGET_AMD64 - - // Extra OSR adjust to get to where RBP was saved by the tier0 frame to restore RBP. // - // Note the other callee saves made in that frame are dead, the OSR method - // will save and restore what it needs. - if (compiler->opts.IsOSR()) + // OSR methods must always pop RBP (pushed by Tier0 frame) + if (doubleAlignOrFramePointerUsed() || compiler->opts.IsOSR()) { - PatchpointInfo* const patchpointInfo = compiler->info.compPatchpointInfo; - const int tier0FrameSize = patchpointInfo->TotalFrameSize(); - - // Simply add since we know frame size is the SP-to-FP delta of the tier0 method plus - // the extra slot pushed by the runtime when we simulate calling the OSR method. - // - // If we ever support OSR from tier0 methods with localloc, this will need to change. - // - inst_RV_IV(INS_add, REG_SPBASE, tier0FrameSize, EA_PTRSIZE); inst_RV(INS_pop, REG_EBP, TYP_I_IMPL); } +#endif // TARGET_AMD64 } else { noway_assert(doubleAlignOrFramePointerUsed()); + // We don't support OSR for methods that must report an FP in unwind. + // + assert(!compiler->opts.IsOSR()); + /* Tear down the stack frame */ bool needMovEspEbp = false; diff --git a/src/coreclr/jit/lclvars.cpp b/src/coreclr/jit/lclvars.cpp index 40fc77c92d8628..87d50a0fc82cf5 100644 --- a/src/coreclr/jit/lclvars.cpp +++ b/src/coreclr/jit/lclvars.cpp @@ -5947,13 +5947,18 @@ int Compiler::lvaAssignVirtualFrameOffsetToArg(unsigned lclNum, } #endif // !UNIX_AMD64_ABI -/***************************************************************************** - * lvaAssignVirtualFrameOffsetsToLocals() : Assign virtual stack offsets to - * locals, temps, and anything else. These will all be negative offsets - * (stack grows down) relative to the virtual '0'/return address - */ +//----------------------------------------------------------------------------- +// lvaAssingVirtualFrameOffsetsToLocals: compute the virtual stack offsets for +// all elements on the stackframe. +// +// Notes: +// Can be called multiple times. Early calls can be used to estimate various +// frame offsets, but details may change. +// void Compiler::lvaAssignVirtualFrameOffsetsToLocals() { + // (1) Account for things that are set up by the prolog and undone by the epilog. + // int stkOffs = 0; int originalFrameStkOffs = 0; int originalFrameSize = 0; @@ -6069,10 +6074,42 @@ void Compiler::lvaAssignVirtualFrameOffsetsToLocals() stkOffs -= compCalleeRegsPushed * REGSIZE_BYTES; #endif // !TARGET_ARM64 + // (2) Account for the remainder of the frame + // + // From this point on the code must generally adjust both + // stkOffs and the local frame size. The latter is done via: + // + // lvaIncrementFrameSize -- for space not associated with a local var + // lvaAllocLocalAndSetVirtualOffset -- for space associated with a local var + // + // One exception to the above: OSR locals that have offsets within the Tier0 + // portion of the frame. + // compLclFrameSize = 0; #ifdef TARGET_AMD64 - // In case of Amd64 compCalleeRegsPushed includes float regs (Xmm6-xmm15) that + // For methods with patchpoints, the Tier0 method must reserve + // space for all the callee saves, as this area is shared with the + // OSR method, and we have to anticipate that collectively the + // Tier0 and OSR methods end up saving all callee saves. + // + // Currently this is x64 only. + // + if (doesMethodHavePatchpoints() || doesMethodHavePartialCompilationPatchpoints()) + { + const unsigned regsPushed = compCalleeRegsPushed + (codeGen->isFramePointerUsed() ? 1 : 0); + const unsigned extraSlots = genCountBits(RBM_OSR_INT_CALLEE_SAVED) - regsPushed; + const unsigned extraSlotSize = extraSlots * REGSIZE_BYTES; + + JITDUMP("\nMethod has patchpoints and has %u callee saves.\n" + "Reserving %u extra slots (%u bytes) for potential OSR method callee saves\n", + regsPushed, extraSlots, extraSlotSize); + + stkOffs -= extraSlotSize; + lvaIncrementFrameSize(extraSlotSize); + } + + // In case of Amd64 compCalleeRegsPushed does not include float regs (Xmm6-xmm15) that // need to be pushed. But Amd64 doesn't support push/pop of xmm registers. // Instead we need to allocate space for them on the stack and save them in prolog. // Therefore, we consider xmm registers being saved while computing stack offsets @@ -6087,7 +6124,7 @@ void Compiler::lvaAssignVirtualFrameOffsetsToLocals() // xmm registers to/from stack to match Jit64 codegen. Without the aligning on 16-byte // boundary we would have to use movups when offset turns out unaligned. Movaps is more // performant than movups. - unsigned calleeFPRegsSavedSize = genCountBits(compCalleeFPRegsSavedMask) * XMM_REGSIZE_BYTES; + const unsigned calleeFPRegsSavedSize = genCountBits(compCalleeFPRegsSavedMask) * XMM_REGSIZE_BYTES; // For OSR the alignment pad computation should not take the original frame into account. // Original frame size includes the pseudo-saved RA and so is always = 8 mod 16. diff --git a/src/coreclr/jit/targetamd64.h b/src/coreclr/jit/targetamd64.h index 50ced88dcb88a3..804dc4be814fc7 100644 --- a/src/coreclr/jit/targetamd64.h +++ b/src/coreclr/jit/targetamd64.h @@ -133,6 +133,8 @@ #define RBM_FLT_CALLEE_TRASH (RBM_XMM0|RBM_XMM1|RBM_XMM2|RBM_XMM3|RBM_XMM4|RBM_XMM5) #endif // !UNIX_AMD64_ABI + #define RBM_OSR_INT_CALLEE_SAVED (RBM_INT_CALLEE_SAVED | RBM_EBP) + #define REG_FLT_CALLEE_SAVED_FIRST REG_XMM6 #define REG_FLT_CALLEE_SAVED_LAST REG_XMM15