-
Notifications
You must be signed in to change notification settings - Fork 1k
Perf: Add prefix compare for inlined compare and change use of inline_value to inline it to a u128 #7748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Peformance result: 2.53 faster for lt StringViewArray StringViewArray inlined bytes critcmp main issue_7743 --filter "iew"
group issue_7743 main
----- ---------- ----
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar complex 1.00 1250.8±16.53µs ? ?/sec 1.02 1270.1±34.61µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar contains 1.01 1224.8±12.97µs ? ?/sec 1.00 1218.6±18.65µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar ends with 1.00 673.7±3.52µs ? ?/sec 1.00 676.2±6.47µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar starts with 1.00 615.0±6.85µs ? ?/sec 1.00 614.5±6.27µs ? ?/sec
eq StringViewArray StringViewArray 1.04 6.0±0.05ms ? ?/sec 1.00 5.8±0.06ms ? ?/sec
eq StringViewArray StringViewArray inlined bytes 1.00 2.7±0.03ms ? ?/sec 1.01 2.7±0.05ms ? ?/sec
eq long same prefix strings StringViewArray 1.02 512.6±8.67µs ? ?/sec 1.00 503.9±9.59µs ? ?/sec
eq scalar StringViewArray 13 bytes 1.00 3.6±0.04ms ? ?/sec 1.02 3.7±0.03ms ? ?/sec
eq scalar StringViewArray 4 bytes 1.02 3.9±0.03ms ? ?/sec 1.00 3.8±0.04ms ? ?/sec
eq scalar StringViewArray 6 bytes 1.02 3.9±0.03ms ? ?/sec 1.00 3.8±0.04ms ? ?/sec
like_utf8view scalar complex 1.01 81.0±0.86ms ? ?/sec 1.00 80.3±0.62ms ? ?/sec
like_utf8view scalar contains 1.00 76.5±0.59ms ? ?/sec 1.00 76.3±0.95ms ? ?/sec
like_utf8view scalar ends with 13 bytes 1.01 19.0±0.25ms ? ?/sec 1.00 18.9±0.26ms ? ?/sec
like_utf8view scalar ends with 4 bytes 1.00 19.0±0.22ms ? ?/sec 1.00 19.0±0.23ms ? ?/sec
like_utf8view scalar ends with 6 bytes 1.00 18.9±0.18ms ? ?/sec 1.00 19.0±0.24ms ? ?/sec
like_utf8view scalar equals 1.01 14.0±0.15ms ? ?/sec 1.00 13.9±0.16ms ? ?/sec
like_utf8view scalar starts with 13 bytes 1.01 18.3±0.24ms ? ?/sec 1.00 18.2±0.27ms ? ?/sec
like_utf8view scalar starts with 4 bytes 1.00 10.7±0.18ms ? ?/sec 1.00 10.7±0.23ms ? ?/sec
like_utf8view scalar starts with 6 bytes 1.00 18.5±0.22ms ? ?/sec 1.00 18.4±0.24ms ? ?/sec
long same prefix strings like_utf8view scalar complex 1.01 617.5±11.35µs ? ?/sec 1.00 610.9±7.55µs ? ?/sec
long same prefix strings like_utf8view scalar contains 1.01 1834.8±21.26µs ? ?/sec 1.00 1810.7±21.24µs ? ?/sec
long same prefix strings like_utf8view scalar ends with 1.00 605.1±9.37µs ? ?/sec 1.00 603.2±6.33µs ? ?/sec
long same prefix strings like_utf8view scalar equals 1.01 196.9±5.64µs ? ?/sec 1.00 194.2±4.78µs ? ?/sec
long same prefix strings like_utf8view scalar starts with 1.00 669.1±5.56µs ? ?/sec 1.00 666.8±5.22µs ? ?/sec
lt StringViewArray StringViewArray inlined bytes 1.00 19.0±0.39ms ? ?/sec 2.53 48.1±1.10ms ? ?/sec
lt long same prefix strings StringViewArray 1.00 485.9±8.31µs ? ?/sec 1.04 504.3±8.39µs ? ?/sec
lt scalar StringViewArray 1.00 22.3±1.39ms ? ?/sec 1.16 25.8±1.28ms ? ?/sec
neq long same prefix strings StringViewArray 1.00 506.5±7.39µs ? ?/sec 1.00 506.4±10.21µs ? ?/sec |
Note: I still don't change the 4 bytes inline compare, because this case we still need to get the prefix for both, it will not benefit a lot since we will at most has 4 bytes to compare for this inline compare. I will also try to optimize the 4 bytes inline compare later. |
const DATA_MASK: u128 = !0u128 << 32; | ||
|
||
// Remove the length bits, leaving only the data | ||
let l_data = (l_bits & DATA_MASK) >> 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can avoid some of the bit masking by using ByteView
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan , reduce the bit masking in latest PR, i think ByteView from will use all convert to the struct field, we only use part of them, so i just use part of logic in latest PR.
impl From<u128> for ByteView {
#[inline]
fn from(value: u128) -> Self {
Self {
length: value as u32,
prefix: (value >> 32) as u32,
buffer_index: (value >> 64) as u32,
offset: (value >> 96) as u32,
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the compiler might be smart enough to see it is actually the same (I changed it somewhere else, but couldn't detect performance difference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting @Dandandan , let me try using ByteView and compare performance!
arrow-ord/src/cmp.rs
Outdated
let l_data = (l_bits & DATA_MASK) >> 32; | ||
let r_data = (r_bits & DATA_MASK) >> 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here :)
Amazing 😎 !! Perhaps we can use some less manual bit masking/shifting while still producing roughly the same code? |
Thank you @Dandandan for review, i polished the code, now the performance is even better: lt StringViewArray StringViewArray inlined bytes 1.00 16.4±0.50ms ? ?/sec 2.98 48.7±0.71ms ? ?/sec About 3 faster comparing to main branch. |
let min_len = l_len.min(r_len); | ||
// We have all 12 bytes in the high bits, but only want the top min_len | ||
let shift = (12 - min_len) * 8; | ||
let l_partial = l_be >> shift; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible to OR the length back into the lower bits, which would then allow getting a result with a single u128 comparison. I think it would also be beneficial to extract this code block into a shared helper function, and add some unit tests for it. The generic code here might not be well convered by tests because of the fast path for inline buffers elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jhorstmann for review, good suggestion, i will address it!
arrow-ord/src/cmp.rs
Outdated
let shift = (12 - min_len) * 8; | ||
let l_partial = l_be >> shift; | ||
let r_partial = r_be >> shift; | ||
if l_partial < r_partial { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be written as l_partial.cmp(r_partial).then_with(|| { l_len.cmp(&r_len) })
return l_len.cmp(&r_len); | ||
} | ||
|
||
// one of the string is larger than 12 bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can change this code below to use (l_view >> 32) as u32
as well (or ByteView if it generates the same code). It seems that is a bit faster for the prefix comparison:
lt scalar StringViewArray
time: [34.533 ms 34.567 ms 34.601 ms]
change: [−11.030% −10.827% −10.620%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @Dandandan , i will change to ByteView prefix.
Addressed comments in latest PR, thanks! Updated @Dandandan @jhorstmann @alamb , amazing result for latest PR: 6.x faster for lt inlined bytes and 1.3x faster for lt scalar StringViewArray lt StringViewArray StringViewArray inlined bytes 1.00 7.8±0.29ms ? ?/sec 6.21 48.7±0.71ms ? ?/sec
lt scalar StringViewArray 1.00 19.5±1.97ms ? ?/sec 1.32 25.8±1.28ms ? ?/sec critcmp main issue_7743 --filter "iew"
group issue_7743 main
----- ---------- ----
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar complex 1.00 1208.3±22.24µs ? ?/sec 1.05 1270.1±34.61µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar contains 1.00 1221.9±32.61µs ? ?/sec 1.00 1218.6±18.65µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar ends with 1.00 677.7±9.00µs ? ?/sec 1.00 676.2±6.47µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar starts with 1.03 629.9±17.64µs ? ?/sec 1.00 614.5±6.27µs ? ?/sec
eq StringViewArray StringViewArray 1.00 5.7±0.06ms ? ?/sec 1.02 5.8±0.06ms ? ?/sec
eq StringViewArray StringViewArray inlined bytes 1.00 2.7±0.01ms ? ?/sec 1.00 2.7±0.03ms ? ?/sec
eq long same prefix strings StringViewArray 1.00 503.7±9.42µs ? ?/sec 1.00 503.9±9.59µs ? ?/sec
eq scalar StringViewArray 13 bytes 1.02 3.8±0.03ms ? ?/sec 1.00 3.7±0.03ms ? ?/sec
eq scalar StringViewArray 4 bytes 1.01 3.9±0.03ms ? ?/sec 1.00 3.8±0.04ms ? ?/sec
eq scalar StringViewArray 6 bytes 1.00 3.8±0.03ms ? ?/sec 1.00 3.8±0.04ms ? ?/sec
like_utf8view scalar complex 1.01 81.2±1.62ms ? ?/sec 1.00 80.3±0.62ms ? ?/sec
like_utf8view scalar contains 1.00 75.7±0.91ms ? ?/sec 1.01 76.3±0.95ms ? ?/sec
like_utf8view scalar ends with 13 bytes 1.00 18.8±0.23ms ? ?/sec 1.00 18.9±0.26ms ? ?/sec
like_utf8view scalar ends with 4 bytes 1.00 18.7±0.24ms ? ?/sec 1.01 19.0±0.23ms ? ?/sec
like_utf8view scalar ends with 6 bytes 1.00 18.8±0.24ms ? ?/sec 1.01 19.0±0.24ms ? ?/sec
like_utf8view scalar equals 1.00 13.9±0.11ms ? ?/sec 1.01 13.9±0.16ms ? ?/sec
like_utf8view scalar starts with 13 bytes 1.02 18.5±0.37ms ? ?/sec 1.00 18.2±0.27ms ? ?/sec
like_utf8view scalar starts with 4 bytes 1.00 10.5±0.16ms ? ?/sec 1.02 10.7±0.23ms ? ?/sec
like_utf8view scalar starts with 6 bytes 1.00 18.3±0.17ms ? ?/sec 1.01 18.4±0.24ms ? ?/sec
long same prefix strings like_utf8view scalar complex 1.00 601.3±2.97µs ? ?/sec 1.02 610.9±7.55µs ? ?/sec
long same prefix strings like_utf8view scalar contains 1.00 1786.0±21.54µs ? ?/sec 1.01 1810.7±21.24µs ? ?/sec
long same prefix strings like_utf8view scalar ends with 1.00 591.8±4.87µs ? ?/sec 1.02 603.2±6.33µs ? ?/sec
long same prefix strings like_utf8view scalar equals 1.00 193.7±4.41µs ? ?/sec 1.00 194.2±4.78µs ? ?/sec
long same prefix strings like_utf8view scalar starts with 1.00 667.8±5.49µs ? ?/sec 1.00 666.8±5.22µs ? ?/sec
lt StringViewArray StringViewArray inlined bytes 1.00 7.8±0.29ms ? ?/sec 6.21 48.7±0.71ms ? ?/sec
lt long same prefix strings StringViewArray 1.00 486.2±12.05µs ? ?/sec 1.04 504.3±8.39µs ? ?/sec
lt scalar StringViewArray 1.00 19.5±1.97ms ? ?/sec 1.32 25.8±1.28ms ? ?/sec
neq long same prefix strings StringViewArray 1.00 502.9±12.25µs ? ?/sec 1.01 506.4±10.21µs ? ?/sec |
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas and @Dandandan - this is really exciting to see.
I am not sure about the length handling, but I think concern can be rectified with a few more tests
l_full_data.cmp(r_full_data) | ||
} | ||
|
||
/// Builds a 128-bit composite key for an inline value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry -- I left comments for this function on the wrong PR: #7792 (comment)
Basically I think it would be very helpful to explain what properties the resulting u128 has
let raw = make_raw_inline(input.len() as u32, input); | ||
let key = GenericByteViewArray::<BinaryViewType>::inline_key_fast(raw); | ||
|
||
// Validate that keys are monotonically increasing in lexicographic+length order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend updating this test with:
- Strings that are the same length but compare lexically different (
aaa
vsaab
for example) - That the comparison is the same as using the GenericBinaryArray accessors
So for example like
let array = GenericBinaryArray::from(test_inputs);
...
// compare using &str semantics
assert!(array.value(i) < array.value(i+1))
// and then compare using the fast key comparison
assert!(make_raw_inline(array.views()[i])< make_raw_inline(arrays.views()[i+1]));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for good suggestion, i will try to add more testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest PR, thank you @alamb !
BTW I think this is a really nice PR @zhuqi-lucas -- it is am amzing result and quite clever and is a nice illustration of the level of attention of detail required for quality high performance engineering |
🤖: Benchmark completed Details
|
Thank you @alamb, it seems the benchmark has less improvement than my local mac result, but it still good to see %66 and %30 improvement. lt StringViewArray StringViewArray inlined bytes 1.00 29.2±0.14ms ? ?/sec 1.66 48.4±0.12ms ? ?/sec
lt scalar StringViewArray 1.00 49.8±0.12ms ? ?/sec 1.30 64.8±0.14ms ? ?/sec |
Indeed those are pretty amazing results for something like comparisons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @zhuqi-lucas -- thank you
previous_key = Some(key); | ||
} | ||
|
||
// 2) Cross-check against GenericBinaryArray comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think technically speaking this second loop does everything the first loop does and thus the first loop is redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @alamb , thank you, addressed it in latest PR!
The integration CI seems flaky recently, i am not sure why. |
Yeah, @etseidl and I have seen that too -- I filed a ticket |
Thanks again @zhuqi-lucas and @Dandandan |
# Which issue does this PR close? This is a follow-up for #7748 In theory we can custom string view compare, and make it crazy faster. - Closes [#7790](#7790) # Rationale for this change In theory we can custom string view compare, and make it crazy faster. # What changes are included in this PR? In theory we can custom string view compare, and make it crazy faster. # Are these changes tested? Yes # Are there any user-facing changes? No
I believe this PR introduced a bug in comparisons, see |
Which issue does this PR close?
Closes #7743
Rationale for this change
Change the fast path to use u128 to compare for lt case, also for inline <12 case to use u128 to compare.
Also when we have > 12 data buffer case, we change 4 bytes compare from each byte compare to u32 compare.
What changes are included in this PR?
Change the fast path to use u128 to compare for lt case, also for inline <12 case to use u128 to compare.
Also when we have > 12 data buffer case, we change 4 bytes compare from each byte compare to u32 compare.
Are there any user-facing changes?
No