-
Notifications
You must be signed in to change notification settings - Fork 15
Add RadixSort as default sorting algorithm for InlineString types #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #3. A few notes on what's proposed in this PR: * It defines a `Base.Sort.defalg` on `AbstractVector{<:Union{InlineString, Missing}}` to be `RadixSort`; this means when someone calls `sort(x)` where `x` is some kind of `InlineString` array, it will use the `RadixSort` implementation also included in this PR * The implementation is based on the one defined in SortingAlgorithms.jl, but updated to be a bit more modern stylistically, improved performance, and more generically work on primitive types so that `InlineString`s can be sorted appropriately * I've also added quite a few code comments to try and explain what exactly is going on in various parts of the code to the best of my knowledge Still to do in this PR: * Add tests; I've done initial testing locally and it seems to be working, but obviously we need actual rigorous testing here * Performance tuning; I think there are several knobs we can leverage to help fine-tune performance for various cases. The current implementation does show big speedups for large arrays, but is dramatically slower on very small arrays (< 20 elements). I'd like to play around with ways we can gain back some of that performance. I'd also like to experiment with different radix sizes and some short-circuiting ideas that may help speed things up even further.
Codecov Report
@@ Coverage Diff @@
## main #10 +/- ##
==========================================
+ Coverage 89.21% 89.88% +0.66%
==========================================
Files 1 1
Lines 371 435 +64
==========================================
+ Hits 331 391 +60
- Misses 40 44 +4
Continue to review full report at Codecov.
|
cc: @evalparse, @nalimilan @bkamins Quick update: I've pushed an implementation of I did some more benchmarking; RadixSort indeed has some very nice speedups, but not in all cases. The slow cases I could find is when there are < 500 elements in an array, or when the InlineString type is String31 or larger when the # of elements is even up to 50_000 or so. I'm trying to dig into why the performance penalty for String31 and larger. For the small array case, I'm wondering if we should just do a quick check at the start of the RadixSort |
Regarding performance - I guess we can use some heuristic thresholds to switch between algorithms. Right? |
Cool, thanks! The fact that radix sort is slow for small vectors has also been spotted at JuliaCollections/SortingAlgorithms.jl#50. It would make sense to adopt a threshold like JuliaCollections/SortingAlgorithms.jl#51 does (comments welcome there BTW). Have you considered using the |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
I realized we can speed RadixSort is still way faster for larger arrays, except for large InlineString types, like String63, String127, and String255. I want to explore if there are some ways we could optimize radixsort for those types, but we might just bail to MergeSort for those as well. |
Ok, I think this is ready to go. I played around with different radix sizes for all InlineString types, but the radix size surprisingly didn't make that much of a difference; or at least a dramatic and consistent enough difference to fiddle around with. I was hoping it would make a bigger difference for larger InlineString types, but for those (String31 and larger), MergeSort is consistently the fastest, so I've updated the radix code to only operate on smaller InlineString types. I also updated the MergeSort thresholds, since I was doing so much empirical testing on stuff anyway. |
Fixes #3. A few notes on what's proposed in this PR:
Base.Sort.defalg
onAbstractVector{<:Union{InlineString, Missing}}
to beRadixSort
;this means when someone calls
sort(x)
wherex
is some kind ofInlineString
array, it will use theRadixSort
implementation alsoincluded in this PR
SortingAlgorithms.jl, but updated to be a bit more modern
stylistically, improved performance, and more generically work on
primitive types so that
InlineString
s can be sorted appropriatelyexactly is going on in various parts of the code to the best of my
knowledge
Still to do in this PR:
working, but obviously we need actual rigorous testing here
to help fine-tune performance for various cases. The current
implementation does show big speedups for large arrays, but is
dramatically slower on very small arrays (< 20 elements). I'd like to
play around with ways we can gain back some of that performance. I'd
also like to experiment with different radix sizes and some
short-circuiting ideas that may help speed things up even further.