- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
Improve StringView support for SUBSTR #12044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| The following is my latest bench report of   | 
| let mut group = c.benchmark_group("SHORTER THAN 12"); | ||
| group.sampling_mode(SamplingMode::Flat); | ||
| group.sample_size(10); | ||
| group.measurement_time(Duration::from_secs(10)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we tune the sampling parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually simply copied these settings from benches/repeat. I'm not very sure about this.
| if length == 0 { | ||
| builder.append_null(); | ||
| } else if length > 12 { | ||
| let buffer_index = (*raw >> 64) as u32; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ByteView from arrow-rs:
https://github.com/apache/arrow-rs/blob/27789d7c9abb50796a4042e7e193703efe3c95b3/arrow-data/src/byte_view.rs#L44-L54
But ByteView is behind arrow-data, which is not explicitly depended by DataFusion, what's your opinion? @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add an explclit dependence on arrow-data similar to 
Line 86 in 950dc73
| arrow-ord = { version = "52.2.0", default-features = false } | 
In general I think it would be better if the arrow crate re-exported these various structures (so DataFusion could depend only on arrow rather than all the sub crates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made apache/arrow-rs#6275 to export this structure publically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also kind of annoying that the use of inline_value always comes with an immediate call to from_utf8_unchecked
                                let bytes =
                                    StringViewArray::inline_value(raw, length as usize);
                                let str = std::str::from_utf8_unchecked(
                                    &bytes[..length as usize],
                                );I'll try and make a PR in arrow-rs to do that too
| I think the PR is looking good, left some comments | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Kev1n8 and @XiangpengHao -- I agree this PR is very close
This is a really neat PR
The only thing I think is needed is using ByteView, as suggested by @XiangpengHao  in https://github.com/apache/datafusion/pull/12044/files#r1720861699
I left some smaller suggestions, that is the big one
| /// substr('alphabet', 3) = 'phabet' | ||
| /// substr('alphabet', 3, 2) = 'ph' | ||
| /// The implementation uses UTF-8 code points as characters | ||
| fn calculate_substr<'a, V, T>(string_array: V, args: &[ArrayRef]) -> Result<ArrayRef> | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function adds an unecessary level of indirection
Rather than havingsubstr dispatch to calculate_substr which then dispatches to calculate_string or calculate_string_view I suspect the code would be easier to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I didn't realize it earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Kev1n8 and @XiangpengHao -- this is quite cool and I think looking close
As @XiangpengHao  mentions in https://github.com/apache/datafusion/pull/12044/files#r1720861699 , I think we should be using ByteView instead of inlined comparison. If the stuff about using arrow-data isn't clear, I would be happy to work on it
I'll make a PR with some proposed changes to ths one shortly
        
          
                datafusion/functions/src/utils.rs
              
                Outdated
          
        
      | pub(crate) fn optimized_utf8_to_str_type( | ||
| arg_type: &DataType, | ||
| name: &str, | ||
| ) -> Result<DataType> { | ||
| let support_list = ["substr"]; | ||
| if support_list.contains(&name) { | ||
| Ok(DataType::Utf8View) | ||
| } else { | ||
| utf8_to_str_type(arg_type, name) | ||
| } | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔  since this function simply returns Utf8View regardless of the input types, maybe we should just change substr() to directly return Ok(DataType::Utf8View) -- that would probably be the simplest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think so too.
| 
 I can open a PR for that  Refactor the code again into the following behavior: | 
| Thanks @Kev1n8 ad @XiangpengHao -- I am running some benchmarks on this now. BTW I was thinking that once we have completed  #12119 we could potentially make  | 
| 
 Sounds great. And I think we can also improve the  | 
| Sorry -- here are the results of my benchmark run -- TLDR is this looks great to me  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Kev1n8 and @XiangpengHao - I am sorry for the delay in review / approval / testing of this PR
| I also merged up from main for this PR to get the arrow 53 release (and thus the changes to add arrow-data as a dependency were removed) | 
| Filed #12338 to track the idea of using StringViewArray always | 
| 🚀 | 
… This reverts commit 7ead2ad
… This reverts commit 7ead2ad
… This reverts commit 7ead2ad
Which issue does this PR close?
Closes #12031
Closes #12033
Rationale for this change
What changes are included in this PR?
StringVIewinSUBSTR, operate directly on the views instead of generating newStrings.Utf8Viewcolumn asText.One is both the input and output of
SUBSTRis larger or smaller than 12B, for another the input is larger while the output is smaller than 12B. Also, compareUtf8View,Utf8, andLargeUtf8with each other.Are these changes tested?
yes
Are there any user-facing changes?
no