[SPARK-6483][SQL]Improve ScalaUdf called performance. #5154

zzcclp · 2015-03-24T05:47:12Z

As issue SPARK-6483 description, ScalaUdf is low performance because of calling asInstanceOf to convert per record.
With this, the performance of ScalaUdf is the same as other case.
thank @lianhuiwang for telling me how to resolve this problem.

AmplabJenkins · 2015-03-24T05:52:10Z

Can one of the admins verify this patch?

chenghao-intel · 2015-03-24T05:54:31Z

Hmm, have you try what the performance gain by this change? From my understanding the bottleneck is in the function call ScalaReflection.convertToScala

zzcclp · 2015-03-24T05:57:16Z

Before this change, it takes 17 minutes, and now takes 5 minutes, which is the same as HiveContext + udf floor and non-udf

chenghao-intel · 2015-03-24T05:59:43Z

OK, probably we can also move the children.size match {..} out of the eval.

chenghao-intel · 2015-03-24T06:21:42Z

I mean we can do something like

val f = children.size match {
  case 1 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
    (input: Row) => {
      func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType)))
    }

  case 2 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
   val child1 = children(1)

   (input: Row) => {
    func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType))
       ScalaReflection.convertToScala(child1.eval(input), child1.dataType)))
    }
}

def eval(input: Row) = f(input)

zzcclp · 2015-03-24T07:24:02Z

OK, I will modify code and test again.

zzcclp · 2015-03-24T08:05:09Z

@chenghao-intel , I change code and test it, the result is the same as last commit , is 5 minutes.
Please help me for reviewing code.

chenghao-intel · 2015-03-24T08:35:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUdf.scala

children is the type of Seq[Expression] (essentially the List[Expression]), access its element by index will cause performance overhead in runtime, we'd better move that out of the anonymous functions. See:
http://docs.scala-lang.org/overviews/collections/performance-characteristics.html

chenghao-intel · 2015-03-24T08:39:11Z

@zzcclp I will run the benchmark in my local machine, will get back soon.

@liancheng , can you trigger the unit test?

chenghao-intel · 2015-03-24T15:48:25Z

Verified the code change by the following micro-benchmark

import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.types._

case class Floor(child: Expression) extends UnaryExpression with Predicate {
  override def foldable = child.foldable
  def nullable = child.nullable
  override def toString = s"Floor $child"

  override def eval(input: Row): Any = {
    child.eval(input) match {
      case null => null
      case ts: Int => ts - ts % 300
    }
  }
}

object T {
  def benchmark(count: Int, expr: Expression): Unit = {
    var i = 0
    val row = new GenericRow(Array[Any](123, 21, 42))
    val s = System.currentTimeMillis()
    while (i < count) {
      expr.eval(row)
      i += 1
    }
    val e = System.currentTimeMillis()

    println (s"${expr.getClass.getSimpleName}  -- ${e - s} ms")
  }
  def main(args: Array[String]) {
    def func(ts: Int) = ts - ts % 300
    val udf0 = ScalaUdf(func _, IntegerType, BoundReference(0, IntegerType, true) :: Nil)
    val udf1 = Floor(BoundReference(0, IntegerType, true))

    benchmark(1000000, udf0)
    benchmark(1000000, udf0)
    benchmark(1000000, udf0)

    benchmark(1000000, udf1)
    benchmark(1000000, udf1)
    benchmark(1000000, udf1)
  }
}

Without the code change it outputs
ScalaUdf -- 1183 ms
ScalaUdf -- 887 ms
ScalaUdf -- 929 ms

Floor -- 49 ms
Floor -- 15 ms
Floor -- 21 ms

With the code change, it outputs
ScalaUdf -- 382 ms
ScalaUdf -- 255 ms
ScalaUdf -- 247 ms

Floor -- 27 ms
Floor -- 6 ms
Floor -- 8 ms

Conclusions:

The code change will improve the performance of scala udf by 2-3x
Scala UDF is in very low performance compare to the built-in type of Expression.

We probably need to provide more efficient way of UDF extension interface.

liancheng · 2015-03-24T16:41:33Z

ok to test

SparkQA · 2015-03-24T16:43:23Z

Test build #29093 has started for PR 5154 at commit 2b7afc0.

This patch merges cleanly.

SparkQA · 2015-03-24T18:05:07Z

Test build #29093 has finished for PR 5154 at commit 2b7afc0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-24T18:05:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29093/
Test PASSed.

SparkQA · 2015-03-25T01:07:41Z

Test build #29134 has started for PR 5154 at commit c899e6d.

This patch does not merge cleanly.

zzcclp · 2015-03-25T01:10:03Z

@chenghao-intel ,I have update the code. can you take a look again. thanks.

chenghao-intel · 2015-03-25T01:28:38Z

You need to fetch the latest code and resolve the conflicts.

chenghao-intel · 2015-03-25T01:31:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUdf.scala

Nit: revert the change here?

chenghao-intel · 2015-03-25T01:35:44Z

@zzcclp LGTM, except some small issues.

SparkQA · 2015-03-25T02:28:22Z

Test build #29134 has finished for PR 5154 at commit c899e6d.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-25T02:28:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29134/
Test FAILed.

…n script. 1. access Seq[Expression] element by :: operator 2. update the code gen script

SparkQA · 2015-03-25T02:48:15Z

Test build #29141 has started for PR 5154 at commit b73836a.

This patch merges cleanly.

zzcclp · 2015-03-25T02:48:50Z

@SparkQA , merge again.

AmplabJenkins · 2015-03-25T03:08:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29141/
Test FAILed.

SparkQA · 2015-03-25T03:10:25Z

Test build #29142 has finished for PR 5154 at commit 0a8cdc3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-25T03:10:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29142/
Test FAILed.

chenghao-intel · 2015-03-25T04:35:51Z

@zzcclp
children:Seq[Expression] essentially can be either List[Expression] or ArraySeq[Expression] (maybe more), however the later doesn't support the pattern matching.

Probably you have to use the code like:

val f = children.size match {
  case 1 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
    (input: Row) => {
      func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType)))
    }

  case 2 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
   val child1 = children(1)

   (input: Row) => {
    func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType))
       ScalaReflection.convertToScala(child1.eval(input), child1.dataType)))
    }
}

zzcclp · 2015-03-25T06:43:13Z

@chenghao-intel , can you review again, thanks.

SparkQA · 2015-03-25T06:48:14Z

Test build #29152 has started for PR 5154 at commit cc6868e.

This patch merges cleanly.

zzcclp · 2015-03-25T06:53:08Z

@AmplabJenkins , please re test

chenghao-intel · 2015-03-25T07:49:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUdf.scala

A newline should be added at the end of source file.

SparkQA · 2015-03-25T08:09:51Z

Test build #29152 has finished for PR 5154 at commit cc6868e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-25T08:09:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29152/
Test PASSed.

zzcclp · 2015-03-25T08:15:54Z

Add a newline at the end of source file, does it need to re test?

SparkQA · 2015-03-25T08:18:15Z

Test build #29157 has started for PR 5154 at commit 5ac6e09.

This patch merges cleanly.

zzcclp · 2015-03-25T08:37:32Z

@AmplabJenkins , please re test, thanks.

chenghao-intel · 2015-03-25T08:56:29Z

The unit test will be auto-triggered, once the code changed, you needn't say anything to @AmplabJenkins .

zzcclp · 2015-03-25T08:58:20Z

OK, I am new sparker, 😄

SparkQA · 2015-03-25T09:39:17Z

Test build #29157 has finished for PR 5154 at commit 5ac6e09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-25T09:39:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29157/
Test PASSed.

liancheng · 2015-03-25T11:12:09Z

@zzcclp @chenghao-intel Thanks for working on this and the review comments! Merged to master.

liancheng · 2015-03-25T11:17:44Z

@zzcclp Would you please set your real name on both GitHub and JIRA so that our script can put your name on the credit list of the next release? Also, it would be good if you can set your name in git config:

$ git config --global user.name "Your Name"

zzcclp · 2015-03-26T02:38:43Z

@liancheng , I already set my real name on my GitHub Name and JIRA Full Name. 😃

It's a follow-up of #5154, we can speed up scala udf evaluation by create type converter in advance. Author: Wenchen Fan <[email protected]> Closes #6182 from cloud-fan/tmp and squashes the following commits: 241cfe9 [Wenchen Fan] use converter in ScalaUdf (cherry picked from commit 2f22424) Signed-off-by: Yin Huai <[email protected]>

It's a follow-up of #5154, we can speed up scala udf evaluation by create type converter in advance. Author: Wenchen Fan <[email protected]> Closes #6182 from cloud-fan/tmp and squashes the following commits: 241cfe9 [Wenchen Fan] use converter in ScalaUdf

It's a follow-up of apache#5154, we can speed up scala udf evaluation by create type converter in advance. Author: Wenchen Fan <[email protected]> Closes apache#6182 from cloud-fan/tmp and squashes the following commits: 241cfe9 [Wenchen Fan] use converter in ScalaUdf

zzcclp changed the title ~~Improve ScalaUdf called performance.~~ [SPARK-6483][SQL]Improve ScalaUdf called performance. Mar 24, 2015

chenghao-intel reviewed Mar 24, 2015
View reviewed changes

chenghao-intel reviewed Mar 25, 2015
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUdf.scala Outdated

Copy link

Contributor

chenghao-intel Mar 25, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: revert the change here?

zzcclp added 2 commits March 25, 2015 10:39

rebase from master

7763848

Access Seq[Expression] element by :: operator, and update the code ge…

b73836a

…n script. 1. access Seq[Expression] element by :: operator 2. update the code gen script

zzcclp force-pushed the SPARK-6483 branch from c899e6d to b73836a Compare March 25, 2015 02:43

indention issue

0a8cdc3

Fix for fail on unit test.

cc6868e

chenghao-intel reviewed Mar 25, 2015
View reviewed changes

Add a newline at the end of source file

5ac6e09

asfgit closed this in 64262ed Mar 25, 2015

cloud-fan mentioned this pull request May 15, 2015

[SQL][minor] use catalyst type converter in ScalaUdf #6182

Closed

[SPARK-6483][SQL]Improve ScalaUdf called performance. #5154

[SPARK-6483][SQL]Improve ScalaUdf called performance. #5154

Uh oh!

Conversation

zzcclp commented Mar 24, 2015

Uh oh!

AmplabJenkins commented Mar 24, 2015

Uh oh!

chenghao-intel commented Mar 24, 2015

Uh oh!

zzcclp commented Mar 24, 2015

Uh oh!

chenghao-intel commented Mar 24, 2015

Uh oh!

chenghao-intel commented Mar 24, 2015

Uh oh!

zzcclp commented Mar 24, 2015

Uh oh!

zzcclp commented Mar 24, 2015

Uh oh!

chenghao-intel Mar 24, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Mar 24, 2015

Uh oh!

chenghao-intel commented Mar 24, 2015

Uh oh!

liancheng commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

AmplabJenkins commented Mar 24, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

zzcclp commented Mar 25, 2015

Uh oh!

chenghao-intel commented Mar 25, 2015

Uh oh!

chenghao-intel Mar 25, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

zzcclp commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

chenghao-intel commented Mar 25, 2015

Uh oh!

zzcclp commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

zzcclp commented Mar 25, 2015

Uh oh!

chenghao-intel Mar 25, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

zzcclp commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

zzcclp commented Mar 25, 2015