Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Nov 14, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

  • Some more performance
  • Remove unsafe
take bool 512           time:   [450.90 ns 455.52 ns 463.35 ns]
                        change: [−13.852% −12.345% −10.584%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

take bool 1024          time:   [810.17 ns 825.99 ns 857.14 ns]
                        change: [−7.4751% −5.4268% −2.8132%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

take bool null indices 1024
                        time:   [963.37 ns 967.58 ns 973.34 ns]
                        change: [−5.1713% −4.3630% −3.5176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

take bool null values 1024
                        time:   [1.5872 µs 1.5936 µs 1.6034 µs]
                        change: [−8.4184% −7.6316% −6.9030%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

take bool null values null indices 1024
                        time:   [1.9172 µs 1.9262 µs 1.9403 µs]
                        change: [−4.4966% −3.9076% −3.3188%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
take i32 null indices 1024
                        time:   [539.55 ns 541.72 ns 545.28 ns]
                        change: [−6.8747% −6.5190% −6.0826%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

take i32 null values 1024
                        time:   [1.2699 µs 1.2750 µs 1.2823 µs]
                        change: [−4.8767% −4.4049% −3.8817%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

take i32 null values null indices 1024
                        time:   [1.4711 µs 1.4781 µs 1.4874 µs]
                        change: [−6.3342% −4.7819% −3.6249%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.

What changes are included in this PR?

Changes to use Vec API (extend, push) rather than MutableBuffer and unsafe code.

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 14, 2025
@Dandandan Dandandan marked this pull request as draft November 14, 2025 20:01
Remove truncation of buffer to align with byte size.
@Dandandan Dandandan marked this pull request as ready for review November 14, 2025 23:02
@Dandandan Dandandan changed the title Speed up collect_bool and remove unsafe Speed up collect_bool and remove unsafe, optimize take_bits a bit more Nov 15, 2025
@Dandandan Dandandan changed the title Speed up collect_bool and remove unsafe, optimize take_bits a bit more Speed up collect_bool and remove unsafe, optimize take_bits a bit more Nov 15, 2025
@Dandandan Dandandan changed the title Speed up collect_bool and remove unsafe, optimize take_bits a bit more Speed up collect_bool and remove unsafe, optimize take_bits, take_native a bit more for null values Nov 15, 2025
@Dandandan Dandandan changed the title Speed up collect_bool and remove unsafe, optimize take_bits, take_native a bit more for null values Speed up collect_bool and remove unsafe, optimize take_bits, take_native for null values Nov 15, 2025
@alamb
Copy link
Contributor

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing collect_bool (f93da94) to 9d7dde2 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=collect_bool
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group                                                                     collect_bool                           main
-----                                                                     ------------                           ----
take bool 1024                                                            1.00   1331.6±2.76ns        ? ?/sec    1.03   1368.9±1.62ns        ? ?/sec
take bool 512                                                             1.00    730.4±1.08ns        ? ?/sec    1.10    805.7±1.45ns        ? ?/sec
take bool null indices 1024                                               1.00   1645.3±7.27ns        ? ?/sec    1.13   1853.0±8.36ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.03µs        ? ?/sec    1.05      2.7±0.00µs        ? ?/sec
take bool null values null indices 1024                                   1.00      3.1±0.05µs        ? ?/sec    1.12      3.5±0.02µs        ? ?/sec
take check bounds i32 1024                                                1.00   1620.9±2.39ns        ? ?/sec    1.00   1621.8±2.39ns        ? ?/sec
take check bounds i32 512                                                 1.16   942.9±10.10ns        ? ?/sec    1.00    811.3±1.10ns        ? ?/sec
take i32 1024                                                             1.00    712.8±8.88ns        ? ?/sec    1.00    711.4±1.12ns        ? ?/sec
take i32 512                                                              1.14    441.2±0.77ns        ? ?/sec    1.00    388.6±0.75ns        ? ?/sec
take i32 null indices 1024                                                1.21    991.9±8.09ns        ? ?/sec    1.00    816.6±1.93ns        ? ?/sec
take i32 null values 1024                                                 1.00      2.0±0.08µs        ? ?/sec    1.02      2.1±0.00µs        ? ?/sec
take i32 null values null indices 1024                                    1.03      2.7±0.01µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.03      8.3±0.10µs        ? ?/sec    1.00      8.0±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      9.1±0.11µs        ? ?/sec    1.09      9.9±0.10µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.03     21.1±0.10µs        ? ?/sec    1.00     20.6±0.06µs        ? ?/sec
take str 1024                                                             1.01     11.3±0.05µs        ? ?/sec    1.00     11.2±0.04µs        ? ?/sec
take str 512                                                              1.00      5.4±0.02µs        ? ?/sec    1.00      5.4±0.02µs        ? ?/sec
take str null indices 1024                                                1.00      7.8±0.09µs        ? ?/sec    1.01      7.9±0.03µs        ? ?/sec
take str null indices 512                                                 1.02      3.9±0.04µs        ? ?/sec    1.00      3.8±0.02µs        ? ?/sec
take str null values 1024                                                 1.04      8.9±0.12µs        ? ?/sec    1.00      8.5±0.03µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.4±0.10µs        ? ?/sec    1.04      7.7±0.04µs        ? ?/sec
take stringview 1024                                                      1.01   857.7±11.93ns        ? ?/sec    1.00    851.8±2.01ns        ? ?/sec
take stringview 512                                                       1.01    555.7±7.08ns        ? ?/sec    1.00    552.1±1.83ns        ? ?/sec
take stringview null indices 1024                                         1.00  1403.4±23.00ns        ? ?/sec    1.01   1414.0±2.00ns        ? ?/sec
take stringview null indices 512                                          1.00    687.7±8.99ns        ? ?/sec    1.01    692.0±1.12ns        ? ?/sec
take stringview null values 1024                                          1.02      2.1±0.00µs        ? ?/sec    1.00      2.0±0.00µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.9±0.02µs        ? ?/sec    1.05      3.0±0.02µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan -- I think this looks great to me. The benchmarks look good and I reviewed the code carefully.

I left some other suggestions, but I don't think they are required

🚢 🇮🇹

/// the option to efficiently negate the result
fn collect_bool(len: usize, neg: bool, f: impl Fn(usize) -> bool) -> BooleanBuffer {
let mut buffer = MutableBuffer::new(ceil(len, 64) * 8);
let mut buffer = Vec::with_capacity(ceil(len, 64));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make the neg a generic argument on MutableBuffer::collect_bool and then avoid the duplication (as a follow on PR)

Or maybe make a collect_bool function in bit_util that returns a Vec and have the mutable buffer and this one call it 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I was thinking about it as well.

None => match n.is_null(idx) {
true => T::default(),
false => panic!("Out-of-bounds index {index:?}"),
// SAFETY: idx<indices.len()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read this several times to convince myself it is correct -- namely that idx i doesn't come from indices (provided by the user) but instead comes from iterating indices

I found this whole method actually pretty confusing as there are multiple things called values and indices (and indices.values()..)

I also double checked that there is a test for out of bounds indexes here:
https://github.com/apache/arrow-rs/blob/f93da94e61e731344ce84146dee946a94fe36602/arrow-select/src/take.rs#L2084-L2083

false => panic!("Out-of-bounds index {index:?}"),
// SAFETY: idx<indices.len()
None => match unsafe { n.inner().value_unchecked(idx) } {
false => T::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan do you know why this even bothers to look at the null buffer again? I realize you didn't change this code, but it seems to me like checking n.inner() (the nulls) is unecessary - it was already implicitly checked by calling values.get() (which returns Some/None).

It seems like all this is doing is re-checking that value() and the nulls match up.

so, TLDR I suggest we remove this clause entirely (could be a follow on PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't remove it - it checks the indices value is null as well to make sure out of bounds on a non-null value leads to a panic.

So currently:

  • out of bound with a null index value => default value (+ null in the output)
  • out of bounds with a non-null value => panic

We could consider out of bounds either always panics or always gives a default (0) value but the current API (and tests) requires it to be this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks correct to me, but is not really intuitive. In our microbenchmarks the current code might faster because it avoids a branch on the validity bit in the happy case, but I'm not sure that will still be faster on larger inputs, or if a larger amount of indices is null.

I would find the following more intuitive, and hopefully not much slower (slice, range and zip iteration should all be TrustedLen):

    indices
            .values()
            .iter()
            .zip((0..n.len()).map(move |i| unsafe { n.inner().value_unchecked(i) }))
            .map(|(index, valid)| if valid {
                values[index.as_usize()]
            } else {
                T::default()
            })
            .collect()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR with this change:

Dandandan and others added 2 commits November 16, 2025 13:16
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
@Dandandan
Copy link
Contributor Author

take i32 null indices 1024                                                1.21    991.9±8.09ns        ? ?/sec    1.00    816.6±1.93ns        ? ?/sec

This is the only one I think that could be a regression I think but AFAIK the take kernel benches were a bit noisy on the instance before as well. On my (apple m2) they are consistently somewhat better.

@alamb
Copy link
Contributor

alamb commented Nov 16, 2025

take i32 null indices 1024 1.21 991.9±8.09ns ? ?/sec 1.00 816.6±1.93ns ? ?/sec

given the overall time is measured in ns (nanoseconds!) I agree this is likely noise

Maybe we should crank up the size of those batches to 8192 or something to better match typical usecases

@jhorstmann
Copy link
Contributor

Looks good to me, nice performance improvement!

@alamb alamb merged commit d13e46a into apache:main Nov 17, 2025
26 checks passed
alamb added a commit that referenced this pull request Nov 18, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to #8849

# Rationale for this change

While reviewing #8849 from
@Dandandan

I realized we don't have good examples for going from MutableBuffer to
Vec, so it may not be clear to users how to use it.

So let's add some examples in the docs for MutableBuffer

# What changes are included in this PR?

Add examples / update docs

# Are these changes tested?
By CI
# Are there any user-facing changes?

New docs, no functional changes

---------

Co-authored-by: Ed Seidl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up collect_bool and remove unsafe

3 participants