- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1k
          Implement native support StringViewArray for regexp_is_match and regexp_is_match_scalar function, deprecate regexp_is_match_utf8 and regexp_is_match_utf8_scalar
          #6376
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thanks @tlm365 ❤️ I am running the benchmarks on this PR now and will report back when they are complete | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tlm365 -- this is looking really nice
I wonder if you might also be willing to add StringView to the benchmarks as well, specifically
arrow-rs/arrow/benches/comparison_kernels.rs
Lines 56 to 62 in 704f90b
| fn bench_regexp_is_match_utf8_scalar(arr_a: &StringArray, value_b: &str) { | |
| regexp_is_match_utf8_scalar( | |
| criterion::black_box(arr_a), | |
| criterion::black_box(value_b), | |
| None, | |
| ) | |
| .unwrap(); | 
So that if this code is changed in the future we can ensure it doesn't regress in performance
        
          
                arrow-string/src/regexp.rs
              
                Outdated
          
        
      | ); | ||
| test_flag_utf8!( | ||
| test_utf8_array_regexp_is_match_insensitive_2, | ||
| StringViewArray::from(vec!["arrow", "arrow", "arrow", "arrow", "arrow", "arrow"]), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringViewArray has special case handling for strings that are more than 12 bytes long (the string data is stored out of band in those cases)
Can you please add tests that have some strings that are longer than 12 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add tests that have some strings that are longer than 12 bytes?
Yes, noted. I will review and update test cases for this scenario.
        
          
                arrow-string/src/regexp.rs
              
                Outdated
          
        
      | /// See the documentation on [`regexp_is_match_utf8`] for more details. | ||
| pub fn regexp_is_match_utf8_scalar<OffsetSize: OffsetSizeTrait>( | ||
| array: &GenericStringArray<OffsetSize>, | ||
| pub fn regexp_is_match_utf8_scalar<'a, S>( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think this is a API change (as is the above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea of how to update this PR to avoid an API change -- the reason this is important is that a breaking API change would need to wait until the next major release (Dec 2024) per the release schedule: https://github.com/apache/arrow-rs?tab=readme-ov-file#release-versioning-and-schedule
TLDR is I think if we introduced a new function like the following:
fn regexp_is_match(
    array: &dyn Array, 
    regex_array: &dyn Array, 
    flags_array: Option<&dyn Array, >,
) -> Result<BooleanArray, ArrowError> {
..
}
``
We could then support StringView and StringArray and LargeStringArray There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think if we introduced a new function like the following:
@alamb Sounds good 👍 But why do we use &dyn Array for the new regex_is_match function instead of keeping the current implementation?
Or am I misunderstanding you? I understand that we will provide a new regex_is_match function, and mark the current regex_is_match_utf8 function as:
#[deprecated(since="54.0.0", note="please use `regex_is_match` instead")]
pub fn regexp_is_match_utf8(...) { ... }Is that right? 🤔
| 
 @alamb Thanks for reviewing, willing to add benchmark for this one. I will update it soon. | 
Signed-off-by: Tai Le Manh <[email protected]>
514847f    to
    e80deea      
    Compare
  
    | Here are the benchmark results (aka this PR doesn't slow down the existing implementation)  | 
c4763a9    to
    65e6839      
    Compare
  
    regexp_is_match_utf8 and regexp_is_match_utf8_scalar functionregexp_is_match and regexp_is_match_scalar function
      Signed-off-by: Tai Le Manh <[email protected]>
65e6839    to
    4fed56b      
    Compare
  
    | I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @tlm365 -- this looks great.
I was reviewing this PR and I had the code checked out locally, so I took the liberty of making a few changes:
- I fixed clippy (was failing due to using deprecated functions)
- I updated the comments / added an example to ease the transition (passing Noneas the flags argument results in type inference errors without some type help)
- Improved the comments in general making it clearer what regexp_is_match does and how it is related to regexp_match
| /// * [`regexp_is_match_scalar`] for matching a single regular expression against an array of strings | ||
| /// * [`regexp_match`] for extracting groups from a string array based on a regular expression | ||
| /// | ||
| /// # Example | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| regex_array: &GenericStringArray<OffsetSize>, | ||
| flags_array: Option<&GenericStringArray<OffsetSize>>, | ||
| ) -> Result<BooleanArray, ArrowError> { | ||
| regexp_is_match(array, regex_array, flags_array) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched the implementation to just call the new function to avoid duplication
regexp_is_match and regexp_is_match_scalar functionregexp_is_match and regexp_is_match_scalar function, deprecate regexp_is_match_utf8 and regexp_is_match_utf8_scalar
      | @tlm365 I wonder if you have a few minutes to review the changes I pushed to this PR. I again I am sorry about the review delays | 
| 
 @alamb it looks very nice 👍 thank you so much for this update! ❤️ | 
| We updated this PR so it was not an API change so removing the label | 

Which issue does this PR close?
Closes #6370.
Rationale for this change
What changes are included in this PR?
Introduce
regexp_is_matchandregexp_is_match_scalar(which can replaceregexp_is_match_utf8andregexp_is_match_utf8_scalar) can perform onStringArray/LargeStringArray/StringViewArrayarguments.Are there any user-facing changes?
No.