Skip to content

Conversation

@joroKr21
Copy link
Contributor

Use logical_nulls when the array data type is Null.

Which issue does this PR close?

Closes #8509.

Rationale for this change

The semantics of NullArray in Arrow are confusing: apache/arrow-rs#4838
So we have to handle the Null data type in a special way.

What changes are included in this PR?

Patches in count and count distinct accumulators to handle the Null data type or use logical_nulls when appropriate.

Are these changes tested?

Yes, added SQL tests.

Are there any user-facing changes?

Yes, count and count distinct now behave consistently wrt null regardless of the data type.

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels Dec 12, 2023
} else {
accumulate_indices(
group_indices,
values.nulls(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use logical_nulls for all cases.

Copy link
Contributor Author

@joroKr21 joroKr21 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logical_nulls returns an owned value, so in most cases it will clone, that's why I added a branch

Copy link
Contributor

@Dandandan Dandandan Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that's a bit unfortunate 🤔
The clone is relatively cheap though, as the buffer holding the bitmap is wrapped in Arc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, then we can use it I guess. I assumed that's why they don't want to change it in Arrow.

if values.data_type() == &DataType::Null {
values.len()
} else {
values.null_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be all based on logical_nulls as well

@joroKr21 joroKr21 force-pushed the bugfix/count-null branch 2 times, most recently from e7fadf6 to 23097f5 Compare December 12, 2023 10:12
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this contribution @joroKr21 -- the basic idea is great. I am just worried about the use of the logical_nulls method as it copies things under the covers that may cause performance slowdowns.

I offered an alternate suggestion -- let me know what you think

let result_bool_buf: Option<BooleanBuffer> = values
.iter()
.map(|a| a.nulls())
.map(|a| a.logical_nulls())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly worried about the need to allocate a new null buffer each time, even for arrays when we could just use the exising one

This is particularly concerning given this is on the critical path in aggregates

I reviewed the logical_nulls method --
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls and I see the issue is that it returns an owned Option

What would you think about implemeting a method in DataFusion that avoids the copy if it is not necessary, like

fn logical_nulls(arr: &ArrayRef) -> Cow<'_, Option<BooleanBuffer>> {
  
}

That only creates the nulll buffer for NullArrays?

Then we can propose upstreaming that back to arrow-rs to avoid the potential performance issue

I know the Cow thing is not always the easiest to make happen -- if you need help I can try and find time to help code it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I would implement this outside of the Array trait while ensuring that all cases are covered. Originally I had some branching logic based on the datatype but removed it after the discussion here: #8511 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @alamb in the end NullBuffer has Arc<Bytes> so it mostly clones this + a few usizes etc. While not ideal I don't think it will be very expensive?
https://arrow.apache.org/rust/arrow_buffer/buffer/immutable/struct.Buffer.html

But I like the suggestion of returning a reference or Cow in arrow-rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this PR, but I would prefer to have the Cow thing. Let me see if I can whip it up quickly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my proposal for improvement: coralogix#221

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joroKr21 and @Dandandan

It would be great to do some benchmark runs to show this doesn't impact performance, but I think I am over worrying this extra copy of the NullBuffer -- it is 48 bytes and one increment

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

FWIW I want to be clear I think we can revert cecc493 unless it shows a performance improvement. I am sorry for all the noise

@joroKr21
Copy link
Contributor Author

Oh sorry, I saw your comment too late. I will force push in that case.

Use `logical_nulls` when the array data type is `Null`.
@joroKr21
Copy link
Contributor Author

I don't know how long it takes to run the benchmarks. I could probably do a run during the weekend.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

I don't know how long it takes to run the benchmarks. I could probably do a run during the weekend.

I'll run some now as I have it all setup. I'll post them here when ready

@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

I ran benchmarks and my conclusion is that this branch doesn't change the performance and any changes are within the level of noise

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ bugfix_count-null ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     3.64ms │            3.74ms │     no change │
│ QQuery 1     │   135.02ms │          138.12ms │     no change │
│ QQuery 2     │   299.58ms │          312.29ms │     no change │
│ QQuery 3     │   280.64ms │          293.30ms │     no change │
│ QQuery 4     │  2783.49ms │         2773.07ms │     no change │
│ QQuery 5     │  4050.48ms │         3897.47ms │     no change │
│ QQuery 6     │   125.78ms │          123.34ms │     no change │
│ QQuery 7     │   140.97ms │          141.36ms │     no change │
│ QQuery 8     │  4433.68ms │         4286.59ms │     no change │
│ QQuery 9     │  7768.12ms │         7501.32ms │     no change │
│ QQuery 10    │  1120.87ms │         1073.25ms │     no change │
│ QQuery 11    │  1206.52ms │         1178.95ms │     no change │
│ QQuery 12    │  3528.50ms │         3434.86ms │     no change │
│ QQuery 13    │  6754.81ms │         6778.36ms │     no change │
│ QQuery 14    │  3859.19ms │         3913.48ms │     no change │
│ QQuery 15    │  3078.75ms │         3047.18ms │     no change │
│ QQuery 16    │  8303.34ms │         8266.43ms │     no change │
│ QQuery 17    │  7910.97ms │         7943.00ms │     no change │
│ QQuery 18    │ 16263.82ms │        15770.30ms │     no change │
│ QQuery 19    │   235.28ms │          228.10ms │     no change │
│ QQuery 20    │  3989.39ms │         3658.45ms │ +1.09x faster │
│ QQuery 21    │  4972.73ms │         4778.85ms │     no change │
│ QQuery 22    │ 13516.67ms │        13320.12ms │     no change │
│ QQuery 23    │ 35451.65ms │        33649.33ms │ +1.05x faster │
│ QQuery 24    │  1885.83ms │         1919.50ms │     no change │
│ QQuery 25    │  1630.22ms │         1604.27ms │     no change │
│ QQuery 26    │  2034.69ms │         2042.98ms │     no change │
│ QQuery 27    │  5236.29ms │         5144.58ms │     no change │
│ QQuery 28    │ 42455.40ms │        41939.68ms │     no change │
│ QQuery 29    │  1493.39ms │         1500.64ms │     no change │
│ QQuery 30    │  3408.16ms │         3666.76ms │  1.08x slower │
│ QQuery 31    │  4357.88ms │         4430.19ms │     no change │
│ QQuery 32    │ 23382.55ms │        22710.19ms │     no change │
│ QQuery 33    │ 18094.43ms │        17377.25ms │     no change │
│ QQuery 34    │ 19085.58ms │        18796.15ms │     no change │
│ QQuery 35    │  5181.19ms │         5264.16ms │     no change │
│ QQuery 36    │   697.38ms │          730.89ms │     no change │
│ QQuery 37    │   368.22ms │          392.02ms │  1.06x slower │
│ QQuery 38    │   324.24ms │          325.51ms │     no change │
│ QQuery 39    │  1643.98ms │         1595.20ms │     no change │
│ QQuery 40    │   186.05ms │          176.14ms │ +1.06x faster │
│ QQuery 41    │   165.92ms │          146.29ms │ +1.13x faster │
│ QQuery 42    │   174.23ms │          171.96ms │     no change │
└──────────────┴────────────┴───────────────────┴───────────────┘
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ bugfix_count-null ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  345.57ms │          337.26ms │     no change │
│ QQuery 2     │   85.16ms │           80.29ms │ +1.06x faster │
│ QQuery 3     │  193.96ms │          190.57ms │     no change │
│ QQuery 4     │  197.21ms │          201.03ms │     no change │
│ QQuery 5     │  333.08ms │          339.23ms │     no change │
│ QQuery 6     │   37.63ms │           35.96ms │     no change │
│ QQuery 7     │  890.08ms │          881.96ms │     no change │
│ QQuery 8     │  172.86ms │          167.49ms │     no change │
│ QQuery 9     │  192.89ms │          200.49ms │     no change │
│ QQuery 10    │  393.93ms │          403.95ms │     no change │
│ QQuery 11    │   66.91ms │           65.34ms │     no change │
│ QQuery 12    │  231.26ms │          239.59ms │     no change │
│ QQuery 13    │  173.97ms │          172.41ms │     no change │
│ QQuery 14    │   71.21ms │           71.20ms │     no change │
│ QQuery 15    │  224.79ms │          225.91ms │     no change │
│ QQuery 16    │   79.26ms │           76.84ms │     no change │
│ QQuery 17    │  228.44ms │          238.25ms │     no change │
│ QQuery 18    │  762.62ms │          806.16ms │  1.06x slower │
│ QQuery 19    │  114.11ms │          122.09ms │  1.07x slower │
│ QQuery 20    │  267.40ms │          271.21ms │     no change │
│ QQuery 21    │ 1017.97ms │         1032.86ms │     no change │
│ QQuery 22    │   49.09ms │           56.83ms │  1.16x slower │
└──────────────┴───────────┴───────────────────┴───────────────┘
alamb@aal-dev:~/datafusion-benchmarking$

@alamb alamb merged commit 06d3bcc into apache:main Dec 14, 2023
@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

Thanks @joroKr21

@joroKr21 joroKr21 deleted the bugfix/count-null branch December 14, 2023 21:35
@joroKr21 joroKr21 restored the bugfix/count-null branch December 14, 2023 21:35
@joroKr21 joroKr21 deleted the bugfix/count-null branch February 1, 2024 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

COUNT(NULL) should always return 0

3 participants