Skip to content

Conversation

@tannergooding
Copy link
Member

instrDesc::idOpSize() and instrDesc::idOpSize(emitAttr opsz) were using an array lookup when a simple shift or bit scan would have sufficed since all inputs are powers of two.

On MSVC, the codegen changes from:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0AA0: 8B 01              mov         eax,dword ptr [rcx]
  00000001800F0AA2: 48 8D 0D F7 8C 05  lea         rcx,[?emitSizeDecode@emitter@@1QBW4emitAttr@@B]
                    00
  00000001800F0AA9: 48 C1 E8 15        shr         rax,15h
  00000001800F0AAD: 83 E0 07           and         eax,7
  00000001800F0AB0: 8B 04 81           mov         eax,dword ptr [rcx+rax*4]
  00000001800F0AB3: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114EC0: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114EC6: 48 63 C2           movsxd      rax,edx
  0000000180114EC9: 48 8D 15 50 48 03  lea         rdx,[?emitSizeEncode@emitter@@1QBW4opSize@1@B]
                    00
  0000000180114ED0: 8B 54 82 FC        mov         edx,dword ptr [rdx+rax*4-4]
  0000000180114ED4: 83 E2 07           and         edx,7
  0000000180114ED7: C1 E2 15           shl         edx,15h
  0000000180114EDA: 09 11              or          dword ptr [rcx],edx
  0000000180114EDC: C3                 ret

To:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0A70: 8B 09              mov         ecx,dword ptr [rcx]
  00000001800F0A72: B8 01 00 00 00     mov         eax,1
  00000001800F0A77: C1 E9 15           shr         ecx,15h
  00000001800F0A7A: 83 E1 07           and         ecx,7
  00000001800F0A7D: D3 E0              shl         eax,cl
  00000001800F0A7F: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114E70: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114E76: 0F BC C2           bsf         eax,edx
  0000000180114E79: 83 E0 07           and         eax,7
  0000000180114E7C: C1 E0 15           shl         eax,15h
  0000000180114E7F: 09 01              or          dword ptr [rcx],eax
  0000000180114E81: C3                 ret

As you can see, for idOpSize rather than doing 3 memory lookups we now just do the one lookup and a shl. Likewise, for idOpSize(emitAttr) rather than doing 4 memory accesses we now just do 2 with a simple bsf (effectively a lzcnt).

The memory accesses, even in the best case scenario where they were cached, ended up being fairly complex both in terms of the addressing modes required to resolve them but also in the number of indirections required to access the data. Each indirection, even in the case of L1 data would take approx 4 cycles to resolve. shl and bsf in comparison are both highly optimized instructions that typically take 1-4 cycles to compute (depending on target CPU).

This also marks idOpSize() as const so the compiler can understand it is non-mutating and only reading a field.

@ghost ghost assigned tannergooding Dec 10, 2022
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 10, 2022
@ghost
Copy link

ghost commented Dec 10, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

instrDesc::idOpSize() and instrDesc::idOpSize(emitAttr opsz) were using an array lookup when a simple shift or bit scan would have sufficed since all inputs are powers of two.

On MSVC, the codegen changes from:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0AA0: 8B 01              mov         eax,dword ptr [rcx]
  00000001800F0AA2: 48 8D 0D F7 8C 05  lea         rcx,[?emitSizeDecode@emitter@@1QBW4emitAttr@@B]
                    00
  00000001800F0AA9: 48 C1 E8 15        shr         rax,15h
  00000001800F0AAD: 83 E0 07           and         eax,7
  00000001800F0AB0: 8B 04 81           mov         eax,dword ptr [rcx+rax*4]
  00000001800F0AB3: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114EC0: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114EC6: 48 63 C2           movsxd      rax,edx
  0000000180114EC9: 48 8D 15 50 48 03  lea         rdx,[?emitSizeEncode@emitter@@1QBW4opSize@1@B]
                    00
  0000000180114ED0: 8B 54 82 FC        mov         edx,dword ptr [rdx+rax*4-4]
  0000000180114ED4: 83 E2 07           and         edx,7
  0000000180114ED7: C1 E2 15           shl         edx,15h
  0000000180114EDA: 09 11              or          dword ptr [rcx],edx
  0000000180114EDC: C3                 ret

To:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0A70: 8B 09              mov         ecx,dword ptr [rcx]
  00000001800F0A72: B8 01 00 00 00     mov         eax,1
  00000001800F0A77: C1 E9 15           shr         ecx,15h
  00000001800F0A7A: 83 E1 07           and         ecx,7
  00000001800F0A7D: D3 E0              shl         eax,cl
  00000001800F0A7F: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114E70: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114E76: 0F BC C2           bsf         eax,edx
  0000000180114E79: 83 E0 07           and         eax,7
  0000000180114E7C: C1 E0 15           shl         eax,15h
  0000000180114E7F: 09 01              or          dword ptr [rcx],eax
  0000000180114E81: C3                 ret

As you can see, for idOpSize rather than doing 3 memory lookups we now just do the one lookup and a shl. Likewise, for idOpSize(emitAttr) rather than doing 4 memory accesses we now just do 2 with a simple bsf (effectively a lzcnt).

The memory accesses, even in the best case scenario where they were cached, ended up being fairly complex both in terms of the addressing modes required to resolve them but also in the number of indirections required to access the data. Each indirection, even in the case of L1 data would take approx 4 cycles to resolve. shl and bsf in comparison are both highly optimized instructions that typically take 1-4 cycles to compute (depending on target CPU).

This also marks idOpSize() as const so the compiler can understand it is non-mutating and only reading a field.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

Hmmm, this helped Linux x86 but hurt other platforms. Was unexpected based on local perf checks.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 9, 2023
@tannergooding tannergooding deleted the idOpSize branch July 1, 2025 14:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant