Skip to content

Conversation

@kunalspathak
Copy link
Contributor

Noticed that we use slow way of calculating bits set in a mask instead of PopCount. Unix TP should improve with this, but need to see how Windows react, because in windows, we don't have popcount and do bit twiddling that involves multiplication. This might prove slightly expensive for inputs that has smaller number of bits set.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 20, 2023
@ghost ghost assigned kunalspathak Mar 20, 2023
@ghost
Copy link

ghost commented Mar 20, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Noticed that we use slow way of calculating bits set in a mask instead of PopCount. Unix TP should improve with this, but need to see how Windows react, because in windows, we don't have popcount and do bit twiddling that involves multiplication. This might prove slightly expensive for inputs that has smaller number of bits set.

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Contributor Author

Running E:\git\runtime\dotnet.cmd E:\git\runtime\artifacts\bin\coreclr\windows.x64.Release\crossgen2\crossgen2.dll -o:E:\git\runtime\artifacts\bin\coreclr\windows.x64.Release\System.Private.CoreLib.dll -r:E:\git\runtime\artifacts\bin\coreclr\windows.x64.Release\IL\*.dll --targetarch:x64 --targetos:windows -m:E:\git\runtime\artifacts\bin\coreclr\windows.x64.Release\StandardOptimizationData.mibc --embed-pgo-data -O E:\git\runtime\artifacts\bin\coreclr\windows.x64.Release\IL\System.Private.CoreLib.dll --pdb --pdb-path:E:\git\runtime\artifacts\bin\coreclr\windows.x64.Release\PDB , I see some difference, although since this is on my dev machine, it can have errors:

base: 4.97 secs
diff: 4.601 secs

@tannergooding - what do you think?

@tannergooding
Copy link
Member

I expect it will be more performant on average.

Were you able to get a profile via Intel VTune or AMD uProf which could help highlight the execution time percentage change for functions using this API?

*/

return cnt;
inline unsigned genCountBits(uint32_t bits)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not just use BitOperations::PopCount directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting some compilation errors but didn't explore other options. I will double check before merging.

@kunalspathak
Copy link
Contributor Author

Were you able to get a profile via Intel VTune or AMD uProf which could help highlight the execution time percentage change for functions using this API?

Overall I see decent numbers (left = main, right = PR)
image

But when I compare a caller, it doesn't look great and they are greyed too.

image

However, I also see the difference for something that is untouched like BuildCall which has call to genCountBits() in an assert, so technically not present in Release mode.

image

@kunalspathak kunalspathak marked this pull request as ready for review March 21, 2023 02:08
@kunalspathak kunalspathak merged commit beab6cd into dotnet:main Mar 21, 2023
@kunalspathak kunalspathak deleted the lsra_popcount branch March 21, 2023 02:08
@ghost ghost locked as resolved and limited conversation to collaborators Apr 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants