Skip to content

LSRA: Add support to track more than 64 registers #99658

@kunalspathak

Description

@kunalspathak

Currently LSRA supports following number of registers for various platforms:

Architecture # of General purpose # of Float/Vector # of Predicate/Mask Total registers
x86 8 8 8 24
x64 16 32 8 **56
arm 16 32 0 48
arm64 32 32 *16 80

*16 new predicate registers for SVE
** There are 16 new GPR registers that will be added with APX, which will make the total for x64 to 72

Until now, the number of registers for all platforms were at most 64, we used regMaskTP data structure (typedef unsigned __int64) to represent them and pass them around. Throughout the RyuJIT codebase, whenever we have to pass, return or track a pool of registers, we use regMaskTP. Since it is 64 bits long, each bit represents a register. Lower bits represent GPR registers, while higher bits represent float/vector registers. However, with the #93095 to add SVE support for arm64, we need to add 16 predicate/mask registers, totaling the number of registers to be tracked from 64 to 80. They will not fit in regMaskTP anymore and we need an alternate way to represent these registers so that we can track the new 16 predicate registers that need to be added for SVE/arm64 work, but also 16 new GPR registers that will be added to support APX/x64 work.

Here are few options to solve this problem:

1. Convert regMaskTP to struct

The first option that we tried in #96196 was to just convert the regMaskTP to a struct that looks something like this:

typedef struct __regMaskTP
{
  unsigned __int64 low;
  unsigned __int64 high;
} regMaskTP;

To avoid refactoring all the code paths that use regMaskTP, we overloaded various operators for this struct. We found out that it regressed the throughput by around 10% for MinOpts and 7% for FullOpts as seen here.

2. Convert regMaskTP to intrinsic vector128

Next option we explored in #94589 was to use unsigned __int128 (only supported by clang on linux) which under the hood uses vector128. Our assumption was that compiler can optimize the access pattern of regMaskTP_128 and we will see lesser TP impact than option 1. However, when we cross compiled this change on linux/x64, we started seeing lot of seg faults in places where regMaskTP was initialized. The problem, as mentioned here was that __int128 assumed that regMaskTP field is at 16-byte aligned and would generate seg fault, whenever that was not the case. So, we had to give up on this option.

3. Segregate the gpr/float/predicate registers usage

This is WIP that I am working on currently in #98258. I created regMaskGpr, regMaskFloat and regMaskPredicate. Then I went through all the places in the code base and used the relevant types. Places where any register pool is accessed, I created a struct AllRegMask() that looks like this:

struct AllRegMask()
{
  regMaskGpr gprRegs;
  regMaskFloat floatRegs;
  regMaskPredicate predicateRegs;
}

 
In the code, I then pass AllRegMask() around and whenever we have to update (add/remove/track) a register from the mask, I added a check to see if the register in question is GPR/float/predicate and update the relevant field accordingly. Currently, we see the TP impact for this is around 6% regression in Minopts and 2% in FullOpts. With that, this is much better than option 1, but still is not acceptable.

4. Track predicate registers separately

Another option is to just track predicate registers separately and pass them around. There are not many places where we need to track them. The GPR/float registers will continue to get represented as regMaskTP 64-bits, and predicate registers will be tracked separately for platforms that has more than 64 registers (SVE/arm and in future APX/x64). The downside is, in future when number of GPR+float registers go beyond 64 registers, we will have to fall back to option 3. The other drawback of this approach is there are lot of places, more relevant is GenTree*, RefPosition and Interval that has regMaskTP as a field. Adding another field for "predicate registers" will consume more memory in these data structures, so probably a union and a bit to indicate if the regMaskTP indicates gpr+float or predicate might do the trick.

Metadata

Metadata

Assignees

Labels

Priority:2Work that is important, but not critical for the releasearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions