Don't use an array when a simple shift/bit scan will do #79493

tannergooding · 2022-12-10T16:20:10Z

instrDesc::idOpSize() and instrDesc::idOpSize(emitAttr opsz) were using an array lookup when a simple shift or bit scan would have sufficed since all inputs are powers of two.

On MSVC, the codegen changes from:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0AA0: 8B 01              mov         eax,dword ptr [rcx]
  00000001800F0AA2: 48 8D 0D F7 8C 05  lea         rcx,[?emitSizeDecode@emitter@@1QBW4emitAttr@@B]
                    00
  00000001800F0AA9: 48 C1 E8 15        shr         rax,15h
  00000001800F0AAD: 83 E0 07           and         eax,7
  00000001800F0AB0: 8B 04 81           mov         eax,dword ptr [rcx+rax*4]
  00000001800F0AB3: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114EC0: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114EC6: 48 63 C2           movsxd      rax,edx
  0000000180114EC9: 48 8D 15 50 48 03  lea         rdx,[?emitSizeEncode@emitter@@1QBW4opSize@1@B]
                    00
  0000000180114ED0: 8B 54 82 FC        mov         edx,dword ptr [rdx+rax*4-4]
  0000000180114ED4: 83 E2 07           and         edx,7
  0000000180114ED7: C1 E2 15           shl         edx,15h
  0000000180114EDA: 09 11              or          dword ptr [rcx],edx
  0000000180114EDC: C3                 ret

To:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0A70: 8B 09              mov         ecx,dword ptr [rcx]
  00000001800F0A72: B8 01 00 00 00     mov         eax,1
  00000001800F0A77: C1 E9 15           shr         ecx,15h
  00000001800F0A7A: 83 E1 07           and         ecx,7
  00000001800F0A7D: D3 E0              shl         eax,cl
  00000001800F0A7F: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114E70: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114E76: 0F BC C2           bsf         eax,edx
  0000000180114E79: 83 E0 07           and         eax,7
  0000000180114E7C: C1 E0 15           shl         eax,15h
  0000000180114E7F: 09 01              or          dword ptr [rcx],eax
  0000000180114E81: C3                 ret

As you can see, for idOpSize rather than doing 3 memory lookups we now just do the one lookup and a shl. Likewise, for idOpSize(emitAttr) rather than doing 4 memory accesses we now just do 2 with a simple bsf (effectively a lzcnt).

The memory accesses, even in the best case scenario where they were cached, ended up being fairly complex both in terms of the addressing modes required to resolve them but also in the number of indirections required to access the data. Each indirection, even in the case of L1 data would take approx 4 cycles to resolve. shl and bsf in comparison are both highly optimized instructions that typically take 1-4 cycles to compute (depending on target CPU).

This also marks idOpSize() as const so the compiler can understand it is non-mutating and only reading a field.

ghost · 2022-12-10T16:20:21Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

instrDesc::idOpSize() and instrDesc::idOpSize(emitAttr opsz) were using an array lookup when a simple shift or bit scan would have sufficed since all inputs are powers of two.

On MSVC, the codegen changes from:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0AA0: 8B 01              mov         eax,dword ptr [rcx]
  00000001800F0AA2: 48 8D 0D F7 8C 05  lea         rcx,[?emitSizeDecode@emitter@@1QBW4emitAttr@@B]
                    00
  00000001800F0AA9: 48 C1 E8 15        shr         rax,15h
  00000001800F0AAD: 83 E0 07           and         eax,7
  00000001800F0AB0: 8B 04 81           mov         eax,dword ptr [rcx+rax*4]
  00000001800F0AB3: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114EC0: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114EC6: 48 63 C2           movsxd      rax,edx
  0000000180114EC9: 48 8D 15 50 48 03  lea         rdx,[?emitSizeEncode@emitter@@1QBW4opSize@1@B]
                    00
  0000000180114ED0: 8B 54 82 FC        mov         edx,dword ptr [rdx+rax*4-4]
  0000000180114ED4: 83 E2 07           and         edx,7
  0000000180114ED7: C1 E2 15           shl         edx,15h
  0000000180114EDA: 09 11              or          dword ptr [rcx],edx
  0000000180114EDC: C3                 ret

To:

?idOpSize@instrDesc@emitter@@QEBA?AW4emitAttr@@XZ: ; instrDesc::idOpSize()
  00000001800F0A70: 8B 09              mov         ecx,dword ptr [rcx]
  00000001800F0A72: B8 01 00 00 00     mov         eax,1
  00000001800F0A77: C1 E9 15           shr         ecx,15h
  00000001800F0A7A: 83 E1 07           and         ecx,7
  00000001800F0A7D: D3 E0              shl         eax,cl
  00000001800F0A7F: C3                 ret

?idOpSize@instrDesc@emitter@@QEAAXW4emitAttr@@@Z: ; instrDesc::idOpSize(emitAttr opsz)
  0000000180114E70: 81 21 FF FF 1F FF  and         dword ptr [rcx],0FF1FFFFFh
  0000000180114E76: 0F BC C2           bsf         eax,edx
  0000000180114E79: 83 E0 07           and         eax,7
  0000000180114E7C: C1 E0 15           shl         eax,15h
  0000000180114E7F: 09 01              or          dword ptr [rcx],eax
  0000000180114E81: C3                 ret

As you can see, for idOpSize rather than doing 3 memory lookups we now just do the one lookup and a shl. Likewise, for idOpSize(emitAttr) rather than doing 4 memory accesses we now just do 2 with a simple bsf (effectively a lzcnt).

The memory accesses, even in the best case scenario where they were cached, ended up being fairly complex both in terms of the addressing modes required to resolve them but also in the number of indirections required to access the data. Each indirection, even in the case of L1 data would take approx 4 cycles to resolve. shl and bsf in comparison are both highly optimized instructions that typically take 1-4 cycles to compute (depending on target CPU).

This also marks idOpSize() as const so the compiler can understand it is non-mutating and only reading a field.

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2022-12-10T19:00:24Z

Hmmm, this helped Linux x86 but hurt other platforms. Was unexpected based on local perf checks.

Don't use an array when a simple shift/bit scan will do

a421f3e

ghost assigned tannergooding Dec 10, 2022

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 10, 2022

tannergooding mentioned this pull request Dec 10, 2022

This fixes emitGetVexPrefixSize to support detecting 2-byte prefix support #79478

Merged

tannergooding closed this Dec 10, 2022

build-analysis bot mentioned this pull request Dec 10, 2022

Precondition failure: File has not had execution verified #79439

Closed

ghost locked as resolved and limited conversation to collaborators Jan 9, 2023

tannergooding deleted the idOpSize branch July 1, 2025 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't use an array when a simple shift/bit scan will do #79493

Don't use an array when a simple shift/bit scan will do #79493

Uh oh!

tannergooding commented Dec 10, 2022

Uh oh!

ghost commented Dec 10, 2022

Uh oh!

tannergooding commented Dec 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Don't use an array when a simple shift/bit scan will do #79493

Don't use an array when a simple shift/bit scan will do #79493

Uh oh!

Conversation

tannergooding commented Dec 10, 2022

Uh oh!

ghost commented Dec 10, 2022

Uh oh!

tannergooding commented Dec 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant