Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 23, 2025

This PR is designed to test @jhorstmann 's question on #7962 from @zhuqi-lucas #7962 (comment):

Is this implementation somehow more performant than using the existing BitIndexIterator and casting its items to u32? The only difference I see is in the masking of the lowest bit, ^= 1 << bit_pos vs &= self.curr - 1, but I think llvm would know that those are equivalent. If it makes a difference, then we should adjust BitIndexIterator the same way.

I made a PR based on #7962 and changed the code to use the existing BitIndexIterator and will run the same benchmarks on this one to see how it compares

@alamb
Copy link
Contributor Author

alamb commented Jul 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_with_bit_iterator (162b2a8) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_with_bit_iterator
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jul 23, 2025

🤖: Benchmark completed

Details

group                                                   alamb_test_with_bit_iterator           main
-----                                                   ----------------------------           ----
lexsort (bool, bool) 2^12                               1.00    115.8±0.41µs        ? ?/sec    1.01    117.3±0.32µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    154.0±0.24µs        ? ?/sec    1.03    157.9±0.21µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     44.9±0.06µs        ? ?/sec    1.00     45.0±0.08µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.02    213.5±0.40µs        ? ?/sec    1.00    209.8±0.37µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.8±0.08µs        ? ?/sec    1.02     39.6±0.11µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     41.1±0.09µs        ? ?/sec    1.00     41.2±0.10µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.6±0.11µs        ? ?/sec    1.00     78.4±0.10µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    210.8±0.43µs        ? ?/sec    1.00    210.4±0.64µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.7±0.15µs        ? ?/sec    1.00     52.4±0.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.5±0.44µs        ? ?/sec    1.00    245.1±0.43µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.02     86.4±0.19µs        ? ?/sec    1.00     84.6±0.20µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.8±0.14µs        ? ?/sec    1.00     85.6±0.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.02     96.5±0.29µs        ? ?/sec    1.00     94.8±0.16µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.6±0.69µs        ? ?/sec    1.00    245.3±0.50µs        ? ?/sec
rank f32 2^12                                           1.07     72.4±0.43µs        ? ?/sec    1.00     67.9±0.30µs        ? ?/sec
rank f32 nulls 2^12                                     1.05     37.1±0.17µs        ? ?/sec    1.00     35.2±0.07µs        ? ?/sec
rank string[10] 2^12                                    1.00    252.4±0.38µs        ? ?/sec    1.01    255.9±1.73µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    121.1±0.23µs        ? ?/sec    1.01    122.2±0.42µs        ? ?/sec
sort f32 2^12                                           1.01     60.8±0.87µs        ? ?/sec    1.00     60.1±0.44µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.2±0.08µs        ? ?/sec    1.00     28.8±0.10µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     40.5±0.24µs        ? ?/sec    1.34     54.5±0.14µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.2±0.27µs        ? ?/sec    1.05     76.1±0.50µs        ? ?/sec
sort i32 2^10                                           1.02      7.5±0.02µs        ? ?/sec    1.00      7.3±0.02µs        ? ?/sec
sort i32 2^12                                           1.02     36.4±0.16µs        ? ?/sec    1.00     35.8±0.12µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.8±0.01µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.1±0.06µs        ? ?/sec    1.01     20.3±0.05µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.05      8.2±0.02µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     35.3±0.11µs        ? ?/sec    1.25     44.0±0.22µs        ? ?/sec
sort i32 to indices 2^10                                1.12     12.8±0.02µs        ? ?/sec    1.00     11.4±0.04µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.1±0.18µs        ? ?/sec    1.00     55.0±0.89µs        ? ?/sec
sort primitive run 2^12                                 1.00      7.0±0.02µs        ? ?/sec    1.00      7.1±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.06      9.4±0.02µs        ? ?/sec    1.00      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    153.2±0.43µs        ? ?/sec    1.10    168.3±0.33µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    333.3±0.64µs        ? ?/sec    1.00    332.2±0.49µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    122.3±0.31µs        ? ?/sec    1.15    140.2±0.30µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    260.1±0.74µs        ? ?/sec    1.01    262.3±0.66µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    133.5±0.60µs        ? ?/sec    1.11    148.4±0.47µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    282.9±1.65µs        ? ?/sec    1.00    283.0±0.80µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    125.7±0.36µs        ? ?/sec    1.11    138.9±0.35µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    251.3±1.36µs        ? ?/sec    1.00    250.5±1.39µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    119.5±0.45µs        ? ?/sec    1.13    135.0±0.53µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.2±1.01µs        ? ?/sec    1.00    248.1±0.98µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    154.1±0.33µs        ? ?/sec    1.13    174.4±0.47µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    316.4±1.03µs        ? ?/sec    1.01    320.8±0.79µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    122.0±0.23µs        ? ?/sec    1.12    136.4±0.21µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    244.7±0.38µs        ? ?/sec    1.01    247.7±0.64µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     64.4±0.34µs        ? ?/sec    1.25     80.5±0.16µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    134.1±0.20µs        ? ?/sec    1.00    134.6±0.62µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     48.0±0.32µs        ? ?/sec    1.29     61.8±0.59µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.3±0.33µs        ? ?/sec    1.00    104.1±0.19µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     45.3±0.29µs        ? ?/sec    1.29     58.4±0.38µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.9±0.31µs        ? ?/sec    1.00     95.3±0.45µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Jul 23, 2025

The u32 is:

sort f32 nulls to indices 2^12                          1.00     39.7±0.10µs        ? ?/sec    1.37     54.5±0.17µs        ? ?/sec

This PR is:

sort f32 nulls to indices 2^12                          1.00     40.5±0.24µs        ? ?/sec    1.34     54.5±0.14µs        ? ?/sec

About %3 improvement for the u32 implement.

From my local testing, it sometimes %5 or more.

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2025

Test complete, so closing this PR

@alamb alamb closed this Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants