Speed up `collect_bool` and remove `unsafe`, optimize `take_bits`, `take_native` for null values #8849

Dandandan · 2025-11-14T19:44:19Z

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Closes Speed up collect_bool and remove unsafe #8848

Rationale for this change

Some more performance
Remove unsafe

take bool 512           time:   [450.90 ns 455.52 ns 463.35 ns]
                        change: [−13.852% −12.345% −10.584%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

take bool 1024          time:   [810.17 ns 825.99 ns 857.14 ns]
                        change: [−7.4751% −5.4268% −2.8132%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

take bool null indices 1024
                        time:   [963.37 ns 967.58 ns 973.34 ns]
                        change: [−5.1713% −4.3630% −3.5176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

take bool null values 1024
                        time:   [1.5872 µs 1.5936 µs 1.6034 µs]
                        change: [−8.4184% −7.6316% −6.9030%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

take bool null values null indices 1024
                        time:   [1.9172 µs 1.9262 µs 1.9403 µs]
                        change: [−4.4966% −3.9076% −3.3188%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
take i32 null indices 1024
                        time:   [539.55 ns 541.72 ns 545.28 ns]
                        change: [−6.8747% −6.5190% −6.0826%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

take i32 null values 1024
                        time:   [1.2699 µs 1.2750 µs 1.2823 µs]
                        change: [−4.8767% −4.4049% −3.8817%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

take i32 null values null indices 1024
                        time:   [1.4711 µs 1.4781 µs 1.4874 µs]
                        change: [−6.3342% −4.7819% −3.6249%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.

What changes are included in this PR?

Changes to use Vec API (extend, push) rather than MutableBuffer and unsafe code.

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

Remove truncation of buffer to align with byte size.

alamb · 2025-11-16T11:45:11Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing collect_bool (f93da94) to 9d7dde2 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=collect_bool
Results will be posted here when complete

alamb · 2025-11-16T11:56:17Z

🤖: Benchmark completed

Details

group                                                                     collect_bool                           main
-----                                                                     ------------                           ----
take bool 1024                                                            1.00   1331.6±2.76ns        ? ?/sec    1.03   1368.9±1.62ns        ? ?/sec
take bool 512                                                             1.00    730.4±1.08ns        ? ?/sec    1.10    805.7±1.45ns        ? ?/sec
take bool null indices 1024                                               1.00   1645.3±7.27ns        ? ?/sec    1.13   1853.0±8.36ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.03µs        ? ?/sec    1.05      2.7±0.00µs        ? ?/sec
take bool null values null indices 1024                                   1.00      3.1±0.05µs        ? ?/sec    1.12      3.5±0.02µs        ? ?/sec
take check bounds i32 1024                                                1.00   1620.9±2.39ns        ? ?/sec    1.00   1621.8±2.39ns        ? ?/sec
take check bounds i32 512                                                 1.16   942.9±10.10ns        ? ?/sec    1.00    811.3±1.10ns        ? ?/sec
take i32 1024                                                             1.00    712.8±8.88ns        ? ?/sec    1.00    711.4±1.12ns        ? ?/sec
take i32 512                                                              1.14    441.2±0.77ns        ? ?/sec    1.00    388.6±0.75ns        ? ?/sec
take i32 null indices 1024                                                1.21    991.9±8.09ns        ? ?/sec    1.00    816.6±1.93ns        ? ?/sec
take i32 null values 1024                                                 1.00      2.0±0.08µs        ? ?/sec    1.02      2.1±0.00µs        ? ?/sec
take i32 null values null indices 1024                                    1.03      2.7±0.01µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.03      8.3±0.10µs        ? ?/sec    1.00      8.0±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      9.1±0.11µs        ? ?/sec    1.09      9.9±0.10µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.03     21.1±0.10µs        ? ?/sec    1.00     20.6±0.06µs        ? ?/sec
take str 1024                                                             1.01     11.3±0.05µs        ? ?/sec    1.00     11.2±0.04µs        ? ?/sec
take str 512                                                              1.00      5.4±0.02µs        ? ?/sec    1.00      5.4±0.02µs        ? ?/sec
take str null indices 1024                                                1.00      7.8±0.09µs        ? ?/sec    1.01      7.9±0.03µs        ? ?/sec
take str null indices 512                                                 1.02      3.9±0.04µs        ? ?/sec    1.00      3.8±0.02µs        ? ?/sec
take str null values 1024                                                 1.04      8.9±0.12µs        ? ?/sec    1.00      8.5±0.03µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.4±0.10µs        ? ?/sec    1.04      7.7±0.04µs        ? ?/sec
take stringview 1024                                                      1.01   857.7±11.93ns        ? ?/sec    1.00    851.8±2.01ns        ? ?/sec
take stringview 512                                                       1.01    555.7±7.08ns        ? ?/sec    1.00    552.1±1.83ns        ? ?/sec
take stringview null indices 1024                                         1.00  1403.4±23.00ns        ? ?/sec    1.01   1414.0±2.00ns        ? ?/sec
take stringview null indices 512                                          1.00    687.7±8.99ns        ? ?/sec    1.01    692.0±1.12ns        ? ?/sec
take stringview null values 1024                                          1.02      2.1±0.00µs        ? ?/sec    1.00      2.0±0.00µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.9±0.02µs        ? ?/sec    1.05      3.0±0.02µs        ? ?/sec

alamb

Thank you @Dandandan -- I think this looks great to me. The benchmarks look good and I reviewed the code carefully.

I left some other suggestions, but I don't think they are required

🚢 🇮🇹

alamb · 2025-11-16T11:50:52Z

arrow-ord/src/cmp.rs

 /// the option to efficiently negate the result
 fn collect_bool(len: usize, neg: bool, f: impl Fn(usize) -> bool) -> BooleanBuffer {
-    let mut buffer = MutableBuffer::new(ceil(len, 64) * 8);
+    let mut buffer = Vec::with_capacity(ceil(len, 64));


Maybe we could make the neg a generic argument on MutableBuffer::collect_bool and then avoid the duplication (as a follow on PR)

Or maybe make a collect_bool function in bit_util that returns a Vec and have the mutable buffer and this one call it 🤔

That makes sense, I was thinking about it as well.

alamb · 2025-11-16T11:56:28Z

arrow-select/src/take.rs

-                None => match n.is_null(idx) {
-                    true => T::default(),
-                    false => panic!("Out-of-bounds index {index:?}"),
+                // SAFETY: idx<indices.len()


I had to read this several times to convince myself it is correct -- namely that idx i doesn't come from indices (provided by the user) but instead comes from iterating indices

I found this whole method actually pretty confusing as there are multiple things called values and indices (and indices.values()..)

I also double checked that there is a test for out of bounds indexes here:
https://github.com/apache/arrow-rs/blob/f93da94e61e731344ce84146dee946a94fe36602/arrow-select/src/take.rs#L2084-L2083

alamb · 2025-11-16T12:00:33Z

arrow-select/src/take.rs

-                    false => panic!("Out-of-bounds index {index:?}"),
+                // SAFETY: idx<indices.len()
+                None => match unsafe { n.inner().value_unchecked(idx) } {
+                    false => T::default(),


@Dandandan do you know why this even bothers to look at the null buffer again? I realize you didn't change this code, but it seems to me like checking n.inner() (the nulls) is unecessary - it was already implicitly checked by calling values.get() (which returns Some/None).

It seems like all this is doing is re-checking that value() and the nulls match up.

so, TLDR I suggest we remove this clause entirely (could be a follow on PR)

I think we can't remove it - it checks the indices value is null as well to make sure out of bounds on a non-null value leads to a panic.

So currently:

out of bound with a null index value => default value (+ null in the output)

out of bounds with a non-null value => panic

We could consider out of bounds either always panics or always gives a default (0) value but the current API (and tests) requires it to be this way

Logic looks correct to me, but is not really intuitive. In our microbenchmarks the current code might faster because it avoids a branch on the validity bit in the happy case, but I'm not sure that will still be faster on larger inputs, or if a larger amount of indices is null.

I would find the following more intuitive, and hopefully not much slower (slice, range and zip iteration should all be TrustedLen):

indices .values() .iter() .zip((0..n.len()).map(move |i| unsafe { n.inner().value_unchecked(i) })) .map(|(index, valid)| if valid { values[index.as_usize()] } else { T::default() }) .collect()

I made a PR with this change:

Clean up take_native to avoid a panic #8861

arrow-select/src/take.rs

Co-authored-by: Andrew Lamb <[email protected]>

Dandandan · 2025-11-16T12:22:06Z

take i32 null indices 1024                                                1.21    991.9±8.09ns        ? ?/sec    1.00    816.6±1.93ns        ? ?/sec

This is the only one I think that could be a regression I think but AFAIK the take kernel benches were a bit noisy on the instance before as well. On my (apple m2) they are consistently somewhat better.

alamb · 2025-11-16T22:37:58Z

take i32 null indices 1024 1.21 991.9±8.09ns ? ?/sec 1.00 816.6±1.93ns ? ?/sec

given the overall time is measured in ns (nanoseconds!) I agree this is likely noise

Maybe we should crank up the size of those batches to 8192 or something to better match typical usecases

jhorstmann · 2025-11-17T10:54:43Z

Looks good to me, nice performance improvement!

@Dandandan

# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to #8849 # Rationale for this change While reviewing #8849 from @Dandandan I realized we don't have good examples for going from MutableBuffer to Vec, so it may not be clear to users how to use it. So let's add some examples in the docs for MutableBuffer # What changes are included in this PR? Add examples / update docs # Are these changes tested? By CI # Are there any user-facing changes? New docs, no functional changes --------- Co-authored-by: Ed Seidl <[email protected]>

Speed up collect_bool

e3807c7

github-actions bot added the arrow Changes to the arrow crate label Nov 14, 2025

Dandandan added 2 commits November 14, 2025 20:48

Second place

b0086f4

fix

de5a584

Dandandan marked this pull request as draft November 14, 2025 20:01

Dandandan added 3 commits November 14, 2025 21:03

Remove buffer truncation

dcb4f96

Remove truncation of buffer to align with byte size.

Truncate

6a39497

Fix doc

507a9fb

Dandandan marked this pull request as ready for review November 14, 2025 23:02

Dandandan added 2 commits November 15, 2025 14:28

optimize

b1a87ea

comment

f9004a2

Dandandan changed the title ~~Speed up collect_bool and remove unsafe~~ Speed up collect_bool and remove unsafe, optimize take_bits a bit more Nov 15, 2025

Dandandan changed the title ~~Speed up collect_bool and remove unsafe, optimize take_bits a bit more~~ Speed up collect_bool and remove unsafe, optimize take_bits a bit more Nov 15, 2025

optimize take_native with nulls

964a076

Dandandan changed the title ~~Speed up collect_bool and remove unsafe, optimize take_bits a bit more~~ Speed up collect_bool and remove unsafe, optimize take_bits, take_native a bit more for null values Nov 15, 2025

Fix

f93da94

Dandandan changed the title ~~Speed up collect_bool and remove unsafe, optimize take_bits, take_native a bit more for null values~~ Speed up collect_bool and remove unsafe, optimize take_bits, take_native for null values Nov 15, 2025

alamb added the performance label Nov 16, 2025

alamb mentioned this pull request Nov 16, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-17 apache/datafusion#18711

Open

44 tasks

alamb approved these changes Nov 16, 2025

View reviewed changes

Dandandan and others added 2 commits November 16, 2025 13:16

Apply suggestion from @alamb

973121c

Co-authored-by: Andrew Lamb <[email protected]>

Apply suggestion from @alamb

07e7a53

Co-authored-by: Andrew Lamb <[email protected]>

alamb mentioned this pull request Nov 16, 2025

docs: Add examples for creating MutableBuffer from Vec #8852

Merged

alamb merged commit d13e46a into apache:main Nov 17, 2025
26 checks passed

alamb mentioned this pull request Nov 17, 2025

Clean up take_native to avoid a panic #8861

Closed

alamb mentioned this pull request Nov 19, 2025

Speed up collect_bool and remove unsafe #8848

Closed

Speed up collect_bool and remove unsafe, optimize take_bits, take_native for null values #8849

Speed up collect_bool and remove unsafe, optimize take_bits, take_native for null values #8849

Uh oh!

Conversation

Dandandan commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

jhorstmann Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Dandandan commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

jhorstmann commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Speed up `collect_bool` and remove `unsafe`, optimize `take_bits`, `take_native` for null values #8849

Speed up `collect_bool` and remove `unsafe`, optimize `take_bits`, `take_native` for null values #8849

Dandandan commented Nov 14, 2025 •

edited

Loading