Skip to content

Conversation

@lins05
Copy link
Contributor

@lins05 lins05 commented Feb 11, 2017

What changes were proposed in this pull request?

Copied from SPARK-19555 JIRA

Spark's StringUtils.escapeLikeRegex() method is written inefficiently, performing tons of object allocations due to the use zip(), flatMap() , and mkString. Instead, I think method should be rewritten in an imperative style using a Java string builder.

This method can become a performance bottleneck in cases where regex expressions are used with non-constant-foldable expressions (e.g. the regex expression comes from the data rather than being part of the query).

How was this patch tested?

Existing tests.

Performance Comparison

number of rows before after
10M 12s 4.5s
100M 120s 45s

Perf testing code:

./bin/spark-shell --master "local-cluster[2,1,10240]"
    def perf(f: => Unit, n: Int = 3): Unit = {
      for (i <- 0 to n - 1) {
        val before = System.currentTimeMillis
        f
        val after = System.currentTimeMillis
        println(s"function took ${after - before} ms")
      }
    }

    val n = 10000000
    val df = sc.parallelize(0 to n, 2).map(i => (i.toString, "%9999")).toDF("a", "b").cache
    df.count()
    df.createOrReplaceTempView("df")

    perf {
      sql("select count(*) from df where a like b").show()
    }

@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72733 has started for PR 16893 at commit e68eab0.

@JoshRosen
Copy link
Contributor

jenkins retest this please

Copy link
Contributor

@tejasapatil tejasapatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is about improving performance, would recommend trying out with while loop and a counter instead of for(). I ran your test snippet on my box:

with for():

function took 5287 ms
function took 5234 ms
function took 5149 ms

with while loop and counter:

function took 4762 ms
function took 4216 ms
function took 4212 ms

Here is my version with while loop and counter:

  def escapeLikeRegex(input: String): String = {
    val builder = new StringBuilder("(?s)")
    val length = input.length

    var i = 0
    var previousChar = ' '

    while (i < length) {
      val currentChar = input.charAt(i)

      if (currentChar != '\\') {
        val out = if (previousChar == '\\') {
          currentChar match {
            case '_' => "_"
            case '%' => "%"
            case _ => Pattern.quote("\\" + currentChar)
          }
        } else {
          currentChar match {
            case '_' => "."
            case '%' => ".*"
            case _ => Pattern.quote(Character.toString(currentChar))
          }
        }
        builder.append(out)
      }
      previousChar = currentChar
      i += 1
    }
    builder.toString()
  }


// replace the _ with .{1} exactly match 1 time of any character
// replace the % with .*, match 0 or more times with any character
def escapeLikeRegex(v: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while you are at it, can you also make the var names better ?

v -> input
c -> currentChar
prev -> previousChar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could further simplify this by replacing previousChar with a boolean, nextCharacterIsEscaped (or inEscape).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm wrong: we can't quite do that simplification because the old code has a subtle bug related to backslash-escaping. Due to the complexity and terseness the old implementation, it's a little non-obvious to spot that the case (prev, '\\') => "" has the effect of always ignoring backslash characters, so this method is incapable of producing a backslash in its output. This is a problem if the user wants to write a LIKE pattern to match backslashes then this is impossible with the current code.

It turns out that this is covered by #15398, which also implements performance improvements for this code, so I guess this PR and JIRA is redundant :(

I thought #15398 had been merged / fixed by now, but I guess not.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72788 has finished for PR 16893 at commit e68eab0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Hi all, where are we on this? Is this still active?

@JoshRosen
Copy link
Contributor

@HyukjinKwon, we can close this because its optimizations were incorporated into SPARK-17647 (I ran the benchmarks to verify this, too). I'm going to resolve this JIRA as fixed by that one as well.

@asfgit asfgit closed this in 5d2750a May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants