Skip to content

Conversation

@zzcclp
Copy link
Contributor

@zzcclp zzcclp commented Mar 24, 2015

As issue SPARK-6483 description, ScalaUdf is low performance because of calling asInstanceOf to convert per record.
With this, the performance of ScalaUdf is the same as other case.
thank @lianhuiwang for telling me how to resolve this problem.

@zzcclp zzcclp changed the title Improve ScalaUdf called performance. [SPARK-6483][SQL]Improve ScalaUdf called performance. Mar 24, 2015
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@chenghao-intel
Copy link
Contributor

Hmm, have you try what the performance gain by this change? From my understanding the bottleneck is in the function call ScalaReflection.convertToScala

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 24, 2015

Before this change, it takes 17 minutes, and now takes 5 minutes, which is the same as HiveContext + udf floor and non-udf

@chenghao-intel
Copy link
Contributor

OK, probably we can also move the children.size match {..} out of the eval.

@chenghao-intel
Copy link
Contributor

I mean we can do something like

val f = children.size match {
  case 1 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
    (input: Row) => {
      func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType)))
    }

  case 2 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
   val child1 = children(1)

   (input: Row) => {
    func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType))
       ScalaReflection.convertToScala(child1.eval(input), child1.dataType)))
    }
}

def eval(input: Row) = f(input)

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 24, 2015

OK, I will modify code and test again.

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 24, 2015

@chenghao-intel , I change code and test it, the result is the same as last commit , is 5 minutes.
Please help me for reviewing code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children is the type of Seq[Expression] (essentially the List[Expression]), access its element by index will cause performance overhead in runtime, we'd better move that out of the anonymous functions. See:
http://docs.scala-lang.org/overviews/collections/performance-characteristics.html

@chenghao-intel
Copy link
Contributor

@zzcclp I will run the benchmark in my local machine, will get back soon.

@liancheng , can you trigger the unit test?

@chenghao-intel
Copy link
Contributor

Verified the code change by the following micro-benchmark

import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.types._

case class Floor(child: Expression) extends UnaryExpression with Predicate {
  override def foldable = child.foldable
  def nullable = child.nullable
  override def toString = s"Floor $child"

  override def eval(input: Row): Any = {
    child.eval(input) match {
      case null => null
      case ts: Int => ts - ts % 300
    }
  }
}

object T {
  def benchmark(count: Int, expr: Expression): Unit = {
    var i = 0
    val row = new GenericRow(Array[Any](123, 21, 42))
    val s = System.currentTimeMillis()
    while (i < count) {
      expr.eval(row)
      i += 1
    }
    val e = System.currentTimeMillis()

    println (s"${expr.getClass.getSimpleName}  -- ${e - s} ms")
  }
  def main(args: Array[String]) {
    def func(ts: Int) = ts - ts % 300
    val udf0 = ScalaUdf(func _, IntegerType, BoundReference(0, IntegerType, true) :: Nil)
    val udf1 = Floor(BoundReference(0, IntegerType, true))

    benchmark(1000000, udf0)
    benchmark(1000000, udf0)
    benchmark(1000000, udf0)

    benchmark(1000000, udf1)
    benchmark(1000000, udf1)
    benchmark(1000000, udf1)
  }
}

Without the code change it outputs
ScalaUdf -- 1183 ms
ScalaUdf -- 887 ms
ScalaUdf -- 929 ms

Floor -- 49 ms
Floor -- 15 ms
Floor -- 21 ms

With the code change, it outputs
ScalaUdf -- 382 ms
ScalaUdf -- 255 ms
ScalaUdf -- 247 ms

Floor -- 27 ms
Floor -- 6 ms
Floor -- 8 ms

Conclusions:

  • The code change will improve the performance of scala udf by 2-3x
  • Scala UDF is in very low performance compare to the built-in type of Expression.

We probably need to provide more efficient way of UDF extension interface.

@liancheng
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29093 has started for PR 5154 at commit 2b7afc0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29093 has finished for PR 5154 at commit 2b7afc0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29093/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29134 has started for PR 5154 at commit c899e6d.

  • This patch does not merge cleanly.

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 25, 2015

@chenghao-intel ,I have update the code. can you take a look again. thanks.

@chenghao-intel
Copy link
Contributor

You need to fetch the latest code and resolve the conflicts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: revert the change here?

@chenghao-intel
Copy link
Contributor

@zzcclp LGTM, except some small issues.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29134 has finished for PR 5154 at commit c899e6d.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29134/
Test FAILed.

zzcclp added 2 commits March 25, 2015 10:39
…n script.

1. access Seq[Expression] element by :: operator
2. update the code gen script
@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29141 has started for PR 5154 at commit b73836a.

  • This patch merges cleanly.

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 25, 2015

@SparkQA , merge again.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29141/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29142 has finished for PR 5154 at commit 0a8cdc3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29142/
Test FAILed.

@chenghao-intel
Copy link
Contributor

@zzcclp
children:Seq[Expression] essentially can be either List[Expression] or ArraySeq[Expression] (maybe more), however the later doesn't support the pattern matching.

Probably you have to use the code like:

val f = children.size match {
  case 1 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
    (input: Row) => {
      func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType)))
    }

  case 2 =>
   val func = function.asInstanceOf[(Any) => Any]
   val child0 = children(0)
   val child1 = children(1)

   (input: Row) => {
    func(ScalaReflection.convertToScala(child0.eval(input), child0.dataType))
       ScalaReflection.convertToScala(child1.eval(input), child1.dataType)))
    }
}

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 25, 2015

@chenghao-intel , can you review again, thanks.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29152 has started for PR 5154 at commit cc6868e.

  • This patch merges cleanly.

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 25, 2015

@AmplabJenkins , please re test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A newline should be added at the end of source file.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29152 has finished for PR 5154 at commit cc6868e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29152/
Test PASSed.

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 25, 2015

Add a newline at the end of source file, does it need to re test?

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29157 has started for PR 5154 at commit 5ac6e09.

  • This patch merges cleanly.

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 25, 2015

@AmplabJenkins , please re test, thanks.

@chenghao-intel
Copy link
Contributor

The unit test will be auto-triggered, once the code changed, you needn't say anything to @AmplabJenkins .

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 25, 2015

OK, I am new sparker, 😄

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29157 has finished for PR 5154 at commit 5ac6e09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29157/
Test PASSed.

@asfgit asfgit closed this in 64262ed Mar 25, 2015
@liancheng
Copy link
Contributor

@zzcclp @chenghao-intel Thanks for working on this and the review comments! Merged to master.

@liancheng
Copy link
Contributor

@zzcclp Would you please set your real name on both GitHub and JIRA so that our script can put your name on the credit list of the next release? Also, it would be good if you can set your name in git config:

$ git config --global user.name "Your Name"

@zzcclp
Copy link
Contributor Author

zzcclp commented Mar 26, 2015

@liancheng , I already set my real name on my GitHub Name and JIRA Full Name. 😃

asfgit pushed a commit that referenced this pull request May 17, 2015
It's a follow-up of #5154, we can speed up scala udf evaluation by create type converter in advance.

Author: Wenchen Fan <[email protected]>

Closes #6182 from cloud-fan/tmp and squashes the following commits:

241cfe9 [Wenchen Fan] use converter in ScalaUdf

(cherry picked from commit 2f22424)
Signed-off-by: Yin Huai <[email protected]>
asfgit pushed a commit that referenced this pull request May 17, 2015
It's a follow-up of #5154, we can speed up scala udf evaluation by create type converter in advance.

Author: Wenchen Fan <[email protected]>

Closes #6182 from cloud-fan/tmp and squashes the following commits:

241cfe9 [Wenchen Fan] use converter in ScalaUdf
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
It's a follow-up of apache#5154, we can speed up scala udf evaluation by create type converter in advance.

Author: Wenchen Fan <[email protected]>

Closes apache#6182 from cloud-fan/tmp and squashes the following commits:

241cfe9 [Wenchen Fan] use converter in ScalaUdf
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
It's a follow-up of apache#5154, we can speed up scala udf evaluation by create type converter in advance.

Author: Wenchen Fan <[email protected]>

Closes apache#6182 from cloud-fan/tmp and squashes the following commits:

241cfe9 [Wenchen Fan] use converter in ScalaUdf
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
It's a follow-up of apache#5154, we can speed up scala udf evaluation by create type converter in advance.

Author: Wenchen Fan <[email protected]>

Closes apache#6182 from cloud-fan/tmp and squashes the following commits:

241cfe9 [Wenchen Fan] use converter in ScalaUdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants