Skip to content

Conversation

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Jun 10, 2023

This makes idSmallCns signed on all platforms to allow including -1. This also means that all 8-bit values (signed or unsigned) can currently be tracked as small constants across all platforms. Doing this covers an additional 2% of constants, bringing the total number of small constants up to 94.33%

The below comment gives a comparison of the before vs after for crossgen of S.P.Corelib. There are various bits of interesting information that can be extrapolated, but the clearest is that 92% of all constants fit into an int8_t. There are then 4.5% of values that will never fit into a small constant (they take more than 16 bits) and so there's only about 1.2% of values that could fit into a small constant if we were given between [11, 16] bits to use.

This also includes some cleanup to the EMITTER_STATS handling/printing to ensure that scenarios are all being tracked.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 10, 2023
@ghost ghost assigned tannergooding Jun 10, 2023
@ghost
Copy link

ghost commented Jun 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This makes idSmallCns signed on xarch to better take advantage of how immediate values are handled on the platform.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

tannergooding commented Jun 11, 2023

There is quite a heavy weight towards -1, but this allows us to allocate around 2% additional of the total constant count as small and to save around 198KB of allocations in the emitter for S.P.Corelib

Before

  Total allocated code size = 3776966
  Total generated code size = 3772607  (0.116% waste)
  Average of 3.51 bytes of code generated per instruction

  Instruction format frequency table:

            IF_NONE           59954 ( 5.52%)
            IF_LABEL          91766 ( 8.44%)
            IF_METHPTR        97375 ( 8.96%)
            IF_RRD            60569 ( 5.57%)
            IF_RWR            86024 ( 7.92%)
            IF_RRW             7808 ( 0.72%)
            IF_RRD_CNS        13167 ( 1.21%)
            IF_RWR_CNS        18190 ( 1.67%)
            IF_RRW_CNS        59335 ( 5.46%)
            IF_RRW_SHF         4975 ( 0.46%)
            IF_RRD_RRD        43262 ( 3.98%)
            IF_RWR_RRD       172507 (15.87%)
            IF_RRW_RRD        39381 ( 3.62%)
            IF_RWR_RRD_CNS     1320 ( 0.12%)
            IF_RWR_MRD         1971 ( 0.18%)
            IF_SRD_CNS         3984 ( 0.37%)
            IF_SWR_CNS         2643 ( 0.24%)
            IF_SWR_RRD        45423 ( 4.18%)
            IF_RWR_SRD        51804 ( 4.77%)
            IF_ARD            17378 ( 1.60%)
            IF_ARD_CNS         7840 ( 0.72%)
            IF_AWR_CNS         6169 ( 0.57%)
            IF_ARD_RRD        12328 ( 1.13%)
            IF_AWR_RRD        33768 ( 3.11%)
            IF_RRD_ARD         4947 ( 0.46%)
            IF_RWR_ARD       135912 (12.51%)
            IF_RRW_ARD         1465 ( 0.13%)
           ---------------------------------
            Total shown     1081265 (99.49%)

  Total of    34510 methods
  Total of   209371 insGroup
  Total of    45254 insPlaceholderGroupData
  Total of    42517 extend insGroup
  Total of  1073827 instructions
  Total of    92699 jumps
  Total of    26189 GC livesets

  Max prolog instrDesc count:       87
  Max prolog insGroup size  :      992

  Average of      6.1 insGroup     per method
  Average of      1.3 insPhGroup   per method
  Average of      1.2 extend IG    per method
  Average of     31.1 instructions per method
  Average of    954.0 desc.  bytes per method
  Average of      2.7 jumps        per method
  Average of      0.8 GC livesets  per method

  Average of      5.1 instructions per group
  Average of    157.2 desc.  bytes per group
  Average of      0.4 jumps        per group

  Average of     30.7 bytes        per instrDesc

  A total of 32922768 desc.  bytes

  Total instructions:     1073847
  Total small instrDesc:   494675 (46.07%)
  Total instrDesc:         412677 (38.43%)
  Total instrDescJmp:       92799 ( 8.64%)
  Total instrDescLbl:           0 ( 0.00%)
  Total instrDescCns:       15202 ( 1.42%)
  Total instrDescDsp:           0 ( 0.00%)
  Total instrDescCnsDsp:        0 ( 0.00%)
  Total instrDescAmd:       29609 ( 2.76%)
  Total instrDescCnsAmd:        0 ( 0.00%)
  Total instrDescCGCA:      39287 ( 3.66%)
  Total instrDescAlign:         0 ( 0.00%)

  Descriptor size distribution:
       <=        100 ===>     327 count (  0% of total)
      101 ..    1024 ===>   23392 count ( 69% of total)
     1025 ..    2048 ===>    6076 count ( 87% of total)
     2049 ..    3072 ===>    2058 count ( 93% of total)
     3073 ..    4096 ===>     982 count ( 96% of total)
     4097 ..    5120 ===>     517 count ( 98% of total)
     5121 ..   10240 ===>     675 count (100% of total)
        >      10240 ===>      84 count (100% of total)

  GC ref frame variable counts:
       <=          0 ===>   32528 count ( 95% of total)
        1 ..       1 ===>    1066 count ( 98% of total)
        2 ..       2 ===>     142 count ( 98% of total)
        3 ..       5 ===>     192 count ( 99% of total)
        6 ..      10 ===>      78 count ( 99% of total)
       11 ..      20 ===>     111 count ( 99% of total)
       21 ..      50 ===>      38 count ( 99% of total)
       51 ..     128 ===>       1 count (100% of total)
      129 ..     256 ===>       0 count (100% of total)
      257 ..     512 ===>       0 count (100% of total)
      513 ..    1024 ===>       0 count (100% of total)

  Max. stack depth distribution:
       <=          0 ===>   34065 count (100% of total)
        1 ..       1 ===>       0 count (100% of total)
        2 ..       2 ===>       0 count (100% of total)
        3 ..       5 ===>       0 count (100% of total)
        6 ..      10 ===>       0 count (100% of total)
       11 ..      16 ===>       0 count (100% of total)
       17 ..      32 ===>       0 count (100% of total)
       33 ..     128 ===>       0 count (100% of total)
      129 ..    1024 ===>       0 count (100% of total)

  SmallCnsCnt = 178334
  LargeCnsCnt =  15190 (  7 % of total)


  Common small constants >=  0, <= 1023
  cns[   0] = 85926
  cns[   1] = 9036
  cns[   2] = 3865
  cns[   3] = 1778
  cns[   4] = 2720
  cns[   5] = 943
  cns[   6] = 580
  cns[   7] = 793
  cns[   8] = 2459
  cns[   9] = 638
  cns[  10] = 516
  cns[  11] = 549
  cns[  12] = 1713
  cns[  13] = 494
  cns[  15] = 592
  cns[  16] = 1630
  cns[  17] = 265
  cns[  20] = 263
  cns[  23] = 203
  cns[  24] = 695
  cns[  27] = 256
  cns[  28] = 206
  cns[  31] = 597
  cns[  32] = 11178
  cns[  40] = 20774
  cns[  48] = 5708
  cns[  56] = 4287
  cns[  60] = 370
  cns[  62] = 213
  cns[  63] = 327
  cns[  64] = 2900
  cns[  72] = 2919
  cns[  80] = 999
  cns[  88] = 1351
  cns[  92] = 185
  cns[  96] = 446
  cns[ 104] = 520
  cns[ 112] = 238
  cns[ 120] = 329
  cns[ 127] = 179
  cns[ 128] = 527
  cns[ 136] = 439
  cns[ 152] = 337
  cns[ 168] = 349
  cns[ 184] = 313
  cns[ 200] = 227
  cns[>= 255] = 1787
  64635788 bytes allocated in the emitter

After

  Total allocated code size = 3777086
  Total generated code size = 3776958  (0.003% waste)
  Average of 3.51 bytes of code generated per instruction

  Instruction format frequency table:

            IF_NONE           60014 ( 5.52%)
            IF_LABEL          91897 ( 8.45%)
            IF_METHPTR        97465 ( 8.96%)
            IF_RRD            60631 ( 5.58%)
            IF_RWR            86188 ( 7.93%)
            IF_RRW             7809 ( 0.72%)
            IF_RRD_CNS        13167 ( 1.21%)
            IF_RWR_CNS        18190 ( 1.67%)
            IF_RRW_CNS        59363 ( 5.46%)
            IF_RRW_SHF         4975 ( 0.46%)
            IF_RRD_RRD        43262 ( 3.98%)
            IF_RWR_RRD       172607 (15.87%)
            IF_RRW_RRD        39385 ( 3.62%)
            IF_RWR_RRD_CNS     1319 ( 0.12%)
            IF_RWR_MRD         1970 ( 0.18%)
            IF_SRD_CNS         3980 ( 0.37%)
            IF_SWR_CNS         2645 ( 0.24%)
            IF_SWR_RRD        45433 ( 4.18%)
            IF_RWR_SRD        51818 ( 4.76%)
            IF_ARD            17380 ( 1.60%)
            IF_ARD_CNS         7843 ( 0.72%)
            IF_AWR_CNS         6168 ( 0.57%)
            IF_ARD_RRD        12327 ( 1.13%)
            IF_AWR_RRD        33790 ( 3.11%)
            IF_RRD_ARD         4947 ( 0.45%)
            IF_RWR_ARD       135936 (12.50%)
            IF_RRW_ARD         1466 ( 0.13%)
           ---------------------------------
            Total shown     1081975 (99.49%)

  Total of    34508 methods
  Total of   210168 insGroup
  Total of    45334 insPlaceholderGroupData
  Total of    42570 extend insGroup
  Total of  1076827 instructions
  Total of    92753 jumps
  Total of    26170 GC livesets

  Max prolog instrDesc count:       86
  Max prolog insGroup size  :     1016

  Average of      6.1 insGroup     per method
  Average of      1.3 insPhGroup   per method
  Average of      1.2 extend IG    per method
  Average of     31.2 instructions per method
  Average of    957.8 desc.  bytes per method
  Average of      2.7 jumps        per method
  Average of      0.8 GC livesets  per method

  Average of      5.1 instructions per group
  Average of    157.3 desc.  bytes per group
  Average of      0.4 jumps        per group

  Average of     30.7 bytes        per instrDesc

  A total of 33050432 desc.  bytes

  Total instructions:     1076572
  Total small instrDesc:   498004 (46.26%)
  Total instrDesc:         414593 (38.51%)
  Total instrDescJmp:       92830 ( 8.62%)
  Total instrDescLbl:           0 ( 0.00%)
  Total instrDescCns:       10972 ( 1.02%)
  Total instrDescDsp:           0 ( 0.00%)
  Total instrDescCnsDsp:        0 ( 0.00%)
  Total instrDescAmd:       29601 ( 2.75%)
  Total instrDescCnsAmd:        0 ( 0.00%)
  Total instrDescCGCA:      39292 ( 3.65%)
  Total instrDescAlign:         0 ( 0.00%)

  Descriptor size distribution:
       <=        100 ===>     245 count (  0% of total)
      101 ..    1024 ===>   23659 count ( 70% of total)
     1025 ..    2048 ===>    5962 count ( 87% of total)
     2049 ..    3072 ===>    2093 count ( 93% of total)
     3073 ..    4096 ===>     923 count ( 96% of total)
     4097 ..    5120 ===>     528 count ( 97% of total)
     5121 ..   10240 ===>     685 count (100% of total)
        >      10240 ===>      91 count (100% of total)

  GC ref frame variable counts:
       <=          0 ===>   32632 count ( 95% of total)
        1 ..       1 ===>    1058 count ( 98% of total)
        2 ..       2 ===>     143 count ( 98% of total)
        3 ..       5 ===>     192 count ( 99% of total)
        6 ..      10 ===>      77 count ( 99% of total)
       11 ..      20 ===>     111 count ( 99% of total)
       21 ..      50 ===>      38 count ( 99% of total)
       51 ..     128 ===>       1 count (100% of total)
      129 ..     256 ===>       0 count (100% of total)
      257 ..     512 ===>       0 count (100% of total)
      513 ..    1024 ===>       0 count (100% of total)

  Max. stack depth distribution:
       <=          0 ===>   34136 count (100% of total)
        1 ..       1 ===>       0 count (100% of total)
        2 ..       2 ===>       0 count (100% of total)
        3 ..       5 ===>       0 count (100% of total)
        6 ..      10 ===>       0 count (100% of total)
       11 ..      16 ===>       0 count (100% of total)
       17 ..      32 ===>       0 count (100% of total)
       33 ..     128 ===>       0 count (100% of total)
      129 ..    1024 ===>       0 count (100% of total)

  SmallCnsCnt =   182626 (94.33%)
  LargeCnsCnt =    10968 ( 5.67%)
  Int8CnsCnt  =   178046 (91.97%)
  Int16CnsCnt =     6791 ( 3.51%)
  Int32CnsCnt =     6965 ( 3.60%)
  NegCnsCnt   =     8084 ( 4.18%)
  Pow2CnsCnt  =    36695 (18.95%)


  Common small constants >= -512, <= 511
  cns[<=-128] =      221 ( 0.11%)
  cns[    -2] =      391 ( 0.20%)
  cns[    -1] =     2956 ( 1.53%)
  cns[     0] =    85902 (44.37%)
  cns[     1] =     9040 ( 4.67%)
  cns[     2] =     3866 ( 2.00%)
  cns[     3] =     1778 ( 0.92%)
  cns[     4] =     2721 ( 1.41%)
  cns[     5] =      943 ( 0.49%)
  cns[     6] =      580 ( 0.30%)
  cns[     7] =      795 ( 0.41%)
  cns[     8] =     2460 ( 1.27%)
  cns[     9] =      638 ( 0.33%)
  cns[    10] =      516 ( 0.27%)
  cns[    11] =      547 ( 0.28%)
  cns[    12] =     1713 ( 0.88%)
  cns[    13] =      494 ( 0.26%)
  cns[    15] =      592 ( 0.31%)
  cns[    16] =     1631 ( 0.84%)
  cns[    17] =      264 ( 0.14%)
  cns[    20] =      263 ( 0.14%)
  cns[    23] =      203 ( 0.10%)
  cns[    24] =      696 ( 0.36%)
  cns[    27] =      256 ( 0.13%)
  cns[    28] =      206 ( 0.11%)
  cns[    31] =      598 ( 0.31%)
  cns[    32] =    11176 ( 5.77%)
  cns[    40] =    20786 (10.74%)
  cns[    48] =     5708 ( 2.95%)
  cns[    56] =     4286 ( 2.21%)
  cns[    60] =      370 ( 0.19%)
  cns[    62] =      213 ( 0.11%)
  cns[    63] =      327 ( 0.17%)
  cns[    64] =     2902 ( 1.50%)
  cns[    72] =     2918 ( 1.51%)
  cns[    80] =      999 ( 0.52%)
  cns[    88] =     1351 ( 0.70%)
  cns[    92] =      185 ( 0.10%)
  cns[    96] =      446 ( 0.23%)
  cns[   104] =      520 ( 0.27%)
  cns[   112] =      238 ( 0.12%)
  cns[   120] =      329 ( 0.17%)
  cns[>= 127] =     4447 ( 2.30%)
  64838740 bytes allocated in the emitter

@tannergooding tannergooding changed the title [Prototype] Make idSmallCns signed on xarch [Prototype] Make idSmallCns signed Jun 12, 2023
@tannergooding tannergooding changed the title [Prototype] Make idSmallCns signed Make idSmallCns signed to allow including -1 Jun 12, 2023
@tannergooding tannergooding marked this pull request as ready for review June 12, 2023 12:45
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib. This is ready for review.

As per the summary above, this updates small constants to be tracked as signed which reduces the total emitter allocations for crossgen2 of S.P.Corelib by around 198KB. It allows 94.33% (compared to 92.15% previously) of constants to be classified as small.

This also cleaned up the EMITTER_STATS output a bit and added some additional information so we could have more context as to the types of constants encountered and whether they could be classified as small constants at all.

There are no diffs. There are some TP diffs ranging from minor improvements to minor regressions which depend on the exact number of bits being used and the required codegen to sign-extend the value, rather than zero extend (this is cheap, but it is an extra instruction).


This was done as part of investigating #87016. Not having this makes resolving #87016 much more complex and would cause it to introduce TP regressions. This is because we would have to introduce additional checks/branches to handle the combinations of new EVEX bits alongside the different sized constants (even though SIMD only ever needs 8-bits). Such logic would happen as part of allocating the instrDesc in addition to the checks that already exist.

With this, instead we can rely on SIMD instructions always having a small constant. This then allows us to also make use of bits that are otherwise unused to make it very pay for play as the handling is entirely relegated to the EVEX code paths and only has to be touched on the path that adds the EVEX prefix.

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for updating the emitter stats.

@tannergooding tannergooding merged commit 39fe6ee into dotnet:main Jun 20, 2023
@tannergooding tannergooding deleted the xarch-fitsInSmallCns branch June 20, 2023 00:20
@ghost ghost locked as resolved and limited conversation to collaborators Jul 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants