[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893

lins05 · 2017-02-11T07:49:54Z

What changes were proposed in this pull request?

Copied from SPARK-19555 JIRA

Spark's StringUtils.escapeLikeRegex() method is written inefficiently, performing tons of object allocations due to the use zip(), flatMap() , and mkString. Instead, I think method should be rewritten in an imperative style using a Java string builder.

This method can become a performance bottleneck in cases where regex expressions are used with non-constant-foldable expressions (e.g. the regex expression comes from the data rather than being part of the query).

How was this patch tested?

Existing tests.

Performance Comparison

number of rows	before	after
10M	12s	4.5s
100M	120s	45s

Perf testing code:

./bin/spark-shell --master "local-cluster[2,1,10240]"

    def perf(f: => Unit, n: Int = 3): Unit = {
      for (i <- 0 to n - 1) {
        val before = System.currentTimeMillis
        f
        val after = System.currentTimeMillis
        println(s"function took ${after - before} ms")
      }
    }

    val n = 10000000
    val df = sc.parallelize(0 to n, 2).map(i => (i.toString, "%9999")).toDF("a", "b").cache
    df.count()
    df.createOrReplaceTempView("df")

    perf {
      sql("select count(*) from df where a like b").show()
    }

…egex

SparkQA · 2017-02-11T07:52:34Z

Test build #72733 has started for PR 16893 at commit e68eab0.

JoshRosen · 2017-02-12T20:04:48Z

jenkins retest this please

tejasapatil

Since this PR is about improving performance, would recommend trying out with while loop and a counter instead of for(). I ran your test snippet on my box:

with for():

function took 5287 ms
function took 5234 ms
function took 5149 ms

with while loop and counter:

function took 4762 ms
function took 4216 ms
function took 4212 ms

Here is my version with while loop and counter:

  def escapeLikeRegex(input: String): String = {
    val builder = new StringBuilder("(?s)")
    val length = input.length

    var i = 0
    var previousChar = ' '

    while (i < length) {
      val currentChar = input.charAt(i)

      if (currentChar != '\\') {
        val out = if (previousChar == '\\') {
          currentChar match {
            case '_' => "_"
            case '%' => "%"
            case _ => Pattern.quote("\\" + currentChar)
          }
        } else {
          currentChar match {
            case '_' => "."
            case '%' => ".*"
            case _ => Pattern.quote(Character.toString(currentChar))
          }
        }
        builder.append(out)
      }
      previousChar = currentChar
      i += 1
    }
    builder.toString()
  }

tejasapatil · 2017-02-12T22:05:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala


  // replace the _ with .{1} exactly match 1 time of any character
  // replace the % with .*, match 0 or more times with any character
  def escapeLikeRegex(v: String): String = {


nit: while you are at it, can you also make the var names better ?

v -> input
c -> currentChar
prev -> previousChar

I think we could further simplify this by replacing previousChar with a boolean, nextCharacterIsEscaped (or inEscape).

Actually, I'm wrong: we can't quite do that simplification because the old code has a subtle bug related to backslash-escaping. Due to the complexity and terseness the old implementation, it's a little non-obvious to spot that the case (prev, '\\') => "" has the effect of always ignoring backslash characters, so this method is incapable of producing a backslash in its output. This is a problem if the user wants to write a LIKE pattern to match backslashes then this is impossible with the current code.

It turns out that this is covered by #15398, which also implements performance improvements for this code, so I guess this PR and JIRA is redundant :(

I thought #15398 had been merged / fixed by now, but I guess not.

SparkQA · 2017-02-12T22:41:25Z

Test build #72788 has finished for PR 16893 at commit e68eab0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-11T14:28:58Z

Hi all, where are we on this? Is this still active?

JoshRosen · 2017-05-17T20:10:51Z

@HyukjinKwon, we can close this because its optimizations were incorporated into SPARK-17647 (I ran the benchmarks to verify this, too). I'm going to resolve this JIRA as fixed by that one as well.

lins05 added 2 commits February 11, 2017 15:46

[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeR…

bbcdeda

…egex

Add back the comments.

e68eab0

tejasapatil reviewed Feb 12, 2017

View reviewed changes

JoshRosen mentioned this pull request Feb 12, 2017

[SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns. #15398

Closed

3 tasks

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893

[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893

Uh oh!

lins05 commented Feb 11, 2017 •

edited

Loading

Uh oh!

SparkQA commented Feb 11, 2017

Uh oh!

JoshRosen commented Feb 12, 2017

Uh oh!

tejasapatil left a comment •

edited

Loading

Uh oh!

tejasapatil Feb 12, 2017

Uh oh!

JoshRosen Feb 12, 2017

Uh oh!

JoshRosen Feb 12, 2017

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

HyukjinKwon commented May 11, 2017

Uh oh!

JoshRosen commented May 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893

[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893

Uh oh!

Conversation

lins05 commented Feb 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Performance Comparison

Uh oh!

SparkQA commented Feb 11, 2017

Uh oh!

JoshRosen commented Feb 12, 2017

Uh oh!

tejasapatil left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil Feb 12, 2017

Choose a reason for hiding this comment

Uh oh!

JoshRosen Feb 12, 2017

Choose a reason for hiding this comment

Uh oh!

JoshRosen Feb 12, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 12, 2017

Uh oh!

HyukjinKwon commented May 11, 2017

Uh oh!

JoshRosen commented May 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lins05 commented Feb 11, 2017 •

edited

Loading

tejasapatil left a comment •

edited

Loading