Skip to content

ARM64 GC: Use SVE when sorting the mark list #108473

@a74nh

Description

@a74nh

Background

  • AIUI, in the coreclr GC, the mark list is an optimisation for post-marking stages where rather than traversing the heap, object by object, it allows the plan phase to skip over objects that are not live (gaps) and just traverse live objects (plugs). To do this, it sorts the object addresses of the live objects.
  • For most targets, introsort::sort is used
  • For AMD64, vxsort is used. This is a modified copy of damageboy/vxsort-cpp
  • vxsort is Quicksort with a vectorised partition routine.
  • Traditionally, the partition is not vectorizable. However, AVX can use masks and the permute instruction to split a vector into two parts.
  • ARM64 SVE has similar concepts and can split a vector using predicates and the compact instruction.

Suggestion

  • Extend the vxsort in coreclr to also work using SVE. I'm not sure how easy this would be to implement.

Alternative Suggestion

  • Highway provides a highly optimised version of Quicksort, which does not rely on the special AVX/SVE instructions and instead can vectorise across all targets. I don't know if use of this would be suitable for coreclr

Reference

Here is an example partition routine implemented in C# SVE. All elements less than the first element get written to left, all other elements get written to right

    public static unsafe void splitArray(ref uint* input, ref uint* left, ref uint* right, int length)
    {
      long i = 0;

      // Position within the output arrays.
      long index_left = 0;
      long index_right = 0;

      // Create a vector filled with the first element from input.
      Vector<uint> firstelemvec = Sve.DuplicateSelectedScalarToVector(Sve.LoadVector(Sve.CreateTrueMaskUInt32(), input), 0);

      // Create a predicate for the loop.
      Vector<uint> ploop = Sve.CreateWhileLessThanMask32Bit(i, length);

      while(Sve.TestAnyTrue(Sve.CreateTrueMaskUInt32(), ploop))
      {
        // Load from the input array based on the loop predicate.
        Vector<uint> data = Sve.LoadVector(ploop, input + i);

        // Find all elements in input array less than the first element.
        Vector<uint> pinner = Sve.ConditionalSelect(ploop, Sve.CompareLessThan(data, firstelemvec), Vector<uint>.Zero);

        // Squash all found elements to the lower lanes of the vector
        Vector<uint> compacted = Sve.Compact(pinner, data);
        
        // Store the squashed elements to the first output array. (This uses the loop predicate, so some additional
        // zeros may be stored)
        Sve.StoreAndZip(ploop, left + index_left, compacted);

        // Increment the position in the first output array by the number of elements found.
        index_left = Sve.SaturatingIncrementByActiveElementCount(index_left, pinner);

        // Do the same for the elements greater than or equal, storing into the second output array.
        pinner = Sve.ConditionalSelect(ploop, Sve.CompareGreaterThanOrEqual(data, firstelemvec), Vector<uint>.Zero);
        compacted = Sve.Compact(pinner, data);
        Sve.StoreAndZip(ploop, right + index_right, compacted);
        index_right = Sve.SaturatingIncrementByActiveElementCount(index_right, pinner);

        // Handle the loop.
        i = Sve.SaturatingIncrementBy32BitElementCount(i, 1);
        ploop = Sve.CreateWhileLessThanMask32Bit(i, length);
      }
    }

Metadata

Metadata

Assignees

Type

No type

Projects

Status

No status

Relationships

None yet

Development

No branches or pull requests

Issue actions