-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893
Conversation
|
Test build #72733 has started for PR 16893 at commit |
|
jenkins retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is about improving performance, would recommend trying out with while loop and a counter instead of for(). I ran your test snippet on my box:
with for():
function took 5287 ms
function took 5234 ms
function took 5149 ms
with while loop and counter:
function took 4762 ms
function took 4216 ms
function took 4212 ms
Here is my version with while loop and counter:
def escapeLikeRegex(input: String): String = {
val builder = new StringBuilder("(?s)")
val length = input.length
var i = 0
var previousChar = ' '
while (i < length) {
val currentChar = input.charAt(i)
if (currentChar != '\\') {
val out = if (previousChar == '\\') {
currentChar match {
case '_' => "_"
case '%' => "%"
case _ => Pattern.quote("\\" + currentChar)
}
} else {
currentChar match {
case '_' => "."
case '%' => ".*"
case _ => Pattern.quote(Character.toString(currentChar))
}
}
builder.append(out)
}
previousChar = currentChar
i += 1
}
builder.toString()
}
|
|
||
| // replace the _ with .{1} exactly match 1 time of any character | ||
| // replace the % with .*, match 0 or more times with any character | ||
| def escapeLikeRegex(v: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: while you are at it, can you also make the var names better ?
v -> input
c -> currentChar
prev -> previousChar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could further simplify this by replacing previousChar with a boolean, nextCharacterIsEscaped (or inEscape).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm wrong: we can't quite do that simplification because the old code has a subtle bug related to backslash-escaping. Due to the complexity and terseness the old implementation, it's a little non-obvious to spot that the case (prev, '\\') => "" has the effect of always ignoring backslash characters, so this method is incapable of producing a backslash in its output. This is a problem if the user wants to write a LIKE pattern to match backslashes then this is impossible with the current code.
It turns out that this is covered by #15398, which also implements performance improvements for this code, so I guess this PR and JIRA is redundant :(
I thought #15398 had been merged / fixed by now, but I guess not.
|
Test build #72788 has finished for PR 16893 at commit
|
|
Hi all, where are we on this? Is this still active? |
|
@HyukjinKwon, we can close this because its optimizations were incorporated into SPARK-17647 (I ran the benchmarks to verify this, too). I'm going to resolve this JIRA as fixed by that one as well. |
What changes were proposed in this pull request?
Copied from SPARK-19555 JIRA
Spark's
StringUtils.escapeLikeRegex()method is written inefficiently, performing tons of object allocations due to the usezip(),flatMap(), andmkString. Instead, I think method should be rewritten in an imperative style using a Java string builder.This method can become a performance bottleneck in cases where regex expressions are used with non-constant-foldable expressions (e.g. the regex expression comes from the data rather than being part of the query).
How was this patch tested?
Existing tests.
Performance Comparison
Perf testing code:
./bin/spark-shell --master "local-cluster[2,1,10240]"