-
Couldn't load subscription status.
- Fork 5.2k
Open
Labels
arch-arm64area-GC-coreclrarm-sveWork related to arm64 SVE/SVE2 supportWork related to arm64 SVE/SVE2 support
Milestone
Description
Background
- AIUI, in the coreclr GC, the mark list is an optimisation for post-marking stages where rather than traversing the heap, object by object, it allows the plan phase to skip over objects that are not live (gaps) and just traverse live objects (plugs). To do this, it sorts the object addresses of the live objects.
- For most targets, introsort::sort is used
- For AMD64, vxsort is used. This is a modified copy of damageboy/vxsort-cpp
- vxsort is Quicksort with a vectorised partition routine.
- Traditionally, the partition is not vectorizable. However, AVX can use masks and the permute instruction to split a vector into two parts.
- ARM64 SVE has similar concepts and can split a vector using predicates and the compact instruction.
Suggestion
- Extend the vxsort in coreclr to also work using SVE. I'm not sure how easy this would be to implement.
Alternative Suggestion
- Highway provides a highly optimised version of Quicksort, which does not rely on the special AVX/SVE instructions and instead can vectorise across all targets. I don't know if use of this would be suitable for coreclr
Reference
Here is an example partition routine implemented in C# SVE. All elements less than the first element get written to left, all other elements get written to right
public static unsafe void splitArray(ref uint* input, ref uint* left, ref uint* right, int length)
{
long i = 0;
// Position within the output arrays.
long index_left = 0;
long index_right = 0;
// Create a vector filled with the first element from input.
Vector<uint> firstelemvec = Sve.DuplicateSelectedScalarToVector(Sve.LoadVector(Sve.CreateTrueMaskUInt32(), input), 0);
// Create a predicate for the loop.
Vector<uint> ploop = Sve.CreateWhileLessThanMask32Bit(i, length);
while(Sve.TestAnyTrue(Sve.CreateTrueMaskUInt32(), ploop))
{
// Load from the input array based on the loop predicate.
Vector<uint> data = Sve.LoadVector(ploop, input + i);
// Find all elements in input array less than the first element.
Vector<uint> pinner = Sve.ConditionalSelect(ploop, Sve.CompareLessThan(data, firstelemvec), Vector<uint>.Zero);
// Squash all found elements to the lower lanes of the vector
Vector<uint> compacted = Sve.Compact(pinner, data);
// Store the squashed elements to the first output array. (This uses the loop predicate, so some additional
// zeros may be stored)
Sve.StoreAndZip(ploop, left + index_left, compacted);
// Increment the position in the first output array by the number of elements found.
index_left = Sve.SaturatingIncrementByActiveElementCount(index_left, pinner);
// Do the same for the elements greater than or equal, storing into the second output array.
pinner = Sve.ConditionalSelect(ploop, Sve.CompareGreaterThanOrEqual(data, firstelemvec), Vector<uint>.Zero);
compacted = Sve.Compact(pinner, data);
Sve.StoreAndZip(ploop, right + index_right, compacted);
index_right = Sve.SaturatingIncrementByActiveElementCount(index_right, pinner);
// Handle the loop.
i = Sve.SaturatingIncrementBy32BitElementCount(i, 1);
ploop = Sve.CreateWhileLessThanMask32Bit(i, length);
}
}Metadata
Metadata
Assignees
Labels
arch-arm64area-GC-coreclrarm-sveWork related to arm64 SVE/SVE2 supportWork related to arm64 SVE/SVE2 support
Type
Projects
Status
No status