-
Notifications
You must be signed in to change notification settings - Fork 1.1k
perf: only encode actual list values in RowConverter (16-26 times faster for small sliced list)
#7996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: only encode actual list values in RowConverter (16-26 times faster for small sliced list)
#7996
Conversation
Waiting for: - apache#7994 to be merged first Closes apache#7993
| let first_offset = list_array.offsets()[0] as usize; | ||
| let last_offset = | ||
| list_array.offsets()[list_array.offsets().len() - 1] as usize; | ||
|
|
||
| list_array | ||
| .values() | ||
| .slice(first_offset, last_offset - first_offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be changed to get_values_sliced() or something after we resolve:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add some comments explaining the context of this code (that as you have said, is non obvious) for future readers
For example, something like
// values can include more data than referenced in the ListArray, only encode
// the referenced values. There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rluvaton -- this code looks good to me
I think we need to add some tests as well prior to merging this PR
For example:
- A round trip encoded sliced list (if it doesn't already exist)
- A test specifically for encoding a smaller part of the list -- for example a test that created a ListArray with 2 rows: a single element list
[1], and a list with 1000 values,[1,2,3,4...1000], sliced to only include the single element list. Then encode it and ensure the encodde size is reasonable
| let first_offset = list_array.offsets()[0] as usize; | ||
| let last_offset = | ||
| list_array.offsets()[list_array.offsets().len() - 1] as usize; | ||
|
|
||
| list_array | ||
| .values() | ||
| .slice(first_offset, last_offset - first_offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add some comments explaining the context of this code (that as you have said, is non obvious) for future readers
For example, something like
// values can include more data than referenced in the ListArray, only encode
// the referenced values. |
Ooops, sorry -- I didn't see #7994 -- looking at that one |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rluvaton -- I still worry that this PR has no tests and thus we run the risk of potentially breaking / reverting it in some future PR.
Would it be possible to add some coverage, either in the form of the "verify the size of the output rows" or a performance benchmark?
|
Pr: Added the tests. Round trip is included in the fuzz test. Number of output rows: already covered in the existing tests The output rows size test will pass on both main and this branch as list encode the entire children as intermediate data, it would later only pick visible list values I can add a performance benchmark that would show how bad for performance it was before |
|
Run the benchmark against this and: these are the results locally: it shows 16 to 26 times faster |
RowConverterRowConverter (16-26 times faster for small sliced list)
#8008) # Which issue does this PR close? N/A # Rationale for this change #7996 (review) # What changes are included in this PR? added to the row format conversion list/large list and sliced list/large list cases # Are these changes tested? Not needed # Are there any user-facing changes? Nope ---- Related to: - #7996
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rluvaton
|
🤖: Benchmark completed Details
|
|
@alamb the benchmark results does not include the new benchmarks |
Yeah I probably need to merge this PR up from main (so |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Thank you @rluvaton |
Which issue does this PR close?
RowConverteron list should only encode the sliced list values and not the entire data #7993.Rationale for this change
See issue
What changes are included in this PR?
only encode sliced list values and change shift the offset in encoding
Are these changes tested?
Yes in:
Waiting for it to be merged first
Are there any user-facing changes?
only perf