Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions datafusion/physical-expr/src/aggregate/count.rs
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ impl GroupsAccumulator for CountGroupsAccumulator {
self.counts.resize(total_num_groups, 0);
accumulate_indices(
group_indices,
values.nulls(), // ignore values
values.logical_nulls().as_ref(),
opt_filter,
|group_index| {
self.counts[group_index] += 1;
Expand Down Expand Up @@ -198,16 +198,18 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> usize {
if values.len() > 1 {
let result_bool_buf: Option<BooleanBuffer> = values
.iter()
.map(|a| a.nulls())
.map(|a| a.logical_nulls())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly worried about the need to allocate a new null buffer each time, even for arrays when we could just use the exising one

This is particularly concerning given this is on the critical path in aggregates

I reviewed the logical_nulls method --
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls and I see the issue is that it returns an owned Option

What would you think about implemeting a method in DataFusion that avoids the copy if it is not necessary, like

fn logical_nulls(arr: &ArrayRef) -> Cow<'_, Option<BooleanBuffer>> {
  
}

That only creates the nulll buffer for NullArrays?

Then we can propose upstreaming that back to arrow-rs to avoid the potential performance issue

I know the Cow thing is not always the easiest to make happen -- if you need help I can try and find time to help code it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I would implement this outside of the Array trait while ensuring that all cases are covered. Originally I had some branching logic based on the datatype but removed it after the discussion here: #8511 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @alamb in the end NullBuffer has Arc<Bytes> so it mostly clones this + a few usizes etc. While not ideal I don't think it will be very expensive?
https://arrow.apache.org/rust/arrow_buffer/buffer/immutable/struct.Buffer.html

But I like the suggestion of returning a reference or Cow in arrow-rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this PR, but I would prefer to have the Cow thing. Let me see if I can whip it up quickly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my proposal for improvement: coralogix#221

.fold(None, |acc, b| match (acc, b) {
(Some(acc), Some(b)) => Some(acc.bitand(b.inner())),
(Some(acc), None) => Some(acc),
(None, Some(b)) => Some(b.inner().clone()),
(None, Some(b)) => Some(b.into_inner()),
_ => None,
});
result_bool_buf.map_or(0, |b| values[0].len() - b.count_set_bits())
} else {
values[0].null_count()
values[0]
.logical_nulls()
.map_or(0, |nulls| nulls.null_count())
}
}

Expand Down
5 changes: 5 additions & 0 deletions datafusion/physical-expr/src/aggregate/count_distinct.rs
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,12 @@ impl Accumulator for DistinctCountAccumulator {
if values.is_empty() {
return Ok(());
}

let arr = &values[0];
if arr.data_type() == &DataType::Null {
return Ok(());
}

(0..arr.len()).try_for_each(|index| {
if !arr.is_null(index) {
let scalar = ScalarValue::try_from_array(arr, index)?;
Expand Down
17 changes: 16 additions & 1 deletion datafusion/sqllogictest/test_files/aggregate.slt
Original file line number Diff line number Diff line change
Expand Up @@ -1492,6 +1492,12 @@ SELECT count(c1, c2) FROM test
----
3

# count_null
query III
SELECT count(null), count(null, null), count(distinct null) FROM test
----
0 0 0

# count_multi_expr_group_by
query I
SELECT count(c1, c2) FROM test group by c1 order by c1
Expand All @@ -1501,6 +1507,15 @@ SELECT count(c1, c2) FROM test group by c1 order by c1
2
0

# count_null_group_by
query III
SELECT count(null), count(null, null), count(distinct null) FROM test group by c1 order by c1
----
0 0 0
0 0 0
0 0 0
0 0 0

# aggreggte_with_alias
query II
select c1, sum(c2) as `Total Salary` from test group by c1 order by c1
Expand Down Expand Up @@ -3241,4 +3256,4 @@ select count(*) from (select count(*) from (select 1));
query I
select count(*) from (select count(*) a, count(*) b from (select 1));
----
1
1