Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 14, 2023

Which issue does this PR close?

Closes #5208

Rationale for this change

I would like to avoid copies when unecessary

What changes are included in this PR?

  1. Return Cow<NullBuffer> rather than NullBuffer
  2. Update existing code
  3. Add comments and examples to make what is happening clearer (as I think using Cows requires finagling that is sometimes obtuse)

Are there any user-facing changes?

Yes, this is an API change

I tried to make it as easy as possible with documentation

@alamb alamb added the api-change Changes to the arrow API label Dec 14, 2023
@alamb alamb marked this pull request as ready for review December 14, 2023 15:59
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 14, 2023

pub use binary_array::*;

mod boolean_array;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this re-organisation, it makes it harder to notice if you've missed a re-export

Copy link
Contributor Author

@alamb alamb Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do it on purpose -- my editor must have done it. I will revert it if we proceed with this PR

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if this is showing up in benchmarks as an overhead, a dyn dispatch is already quite expensive. In general something is a bit off if you're calling logical_nulls in a hot loop, as you should probably extract the null mask outside of this loop...

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

🤔 @tustvold and @Dandandan hvae points out that the overhead of copying NullBuffers is probably small

https://github.com/apache/arrow-rs/blob/802ed428f87051fdca31180430ddb0ecb2f60e8b/arrow-buffer/src/buffer/null.rs#L30C15-L33

The size of a NullBuffer is 48 bytes.

    println!("sizeof null buffer: {}", std::mem::size_of::<NullBuffer>());
sizeof null buffer: 48

So this PR would save copying 48 bytes and 1 atomic increment for the common case where the null buffer is not computed

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

I'm curious if this is showing up in benchmarks as an overhead, a dyn dispatch is already quite expensive. In general something is a bit off if you're calling logical_nulls in a hot loop, as you should probably extract the null mask outside of this loop...

No, this is only a theoretical concern

@tustvold
Copy link
Contributor

It is hard for me to approve a breaking change without a strong performance justification. It is also worth noting that Cow adds a branch on access, which might actually be far worse than some atomic increments...

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2023

It is hard for me to approve a breaking change without a strong performance justification. It is also worth noting that Cow adds a branch on access, which might actually be far worse than some atomic increments...

I agree to merge this we need performance benchmarks. I will try and find time to see if I can find some data one way or the other

@alamb
Copy link
Contributor Author

alamb commented Apr 16, 2024

Sorry for not closing this sooner.

BTW @andygrove added an automatic workflow to close stale PRs (like this one) in DataFusion: apache/datafusion#10046

It may be worth considering something like that for this repo too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid forced copy in Array::logical_nulls

2 participants