[SPARK-26147][SQL] only pull out unevaluable python udf from join condition #23153

cloud-fan · 2018-11-27T11:22:59Z

What changes were proposed in this pull request?

#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable.

This PR fixes this mistake.

How was this patch tested?

a new test

cloud-fan · 2018-11-27T11:24:48Z

@xuanyuanking @HyukjinKwon @mgaido91 @gatorsmile

SparkQA · 2018-11-27T11:27:51Z

Test build #99324 has finished for PR 23153 at commit cb195cf.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-11-27T12:46:44Z

python/pyspark/sql/tests/test_udf.py

+        f = udf(lambda a: str(a), StringType())
+        # The join condition can't be pushed down, as it refers to attributes from both sides.
+        # The Python UDF only refer to attributes from one side, so it's evaluable.
+        df = left.join(right, f("a") == col("b").cast("string"), how = "left_outer")


style nit: how="left_outer"

xuanyuanking · 2018-11-27T16:28:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

  override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case j @ Join(_, _, joinType, condition)
-        if condition.isDefined && hasPythonUDF(condition.get) =>
+    case j @ Join(_, _, joinType, Some(cond)) if hasUnevaluablePythonUDF(cond, j) =>


Followed by the rule changes, we need modify the suites in PullOutPythonUDFInJoinConditionSuite, the suites should also construct the dummy python udf from both side.

xuanyuanking · 2018-11-27T16:33:47Z

Sorry for the mistake and thanks for the fix from Wenchen,

the suites should also construct the dummy python udf from both side.

I did this locally, the suites can be simply fixed by:

diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutPythonUDFInJoinConditionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutPythonUDFInJoinConditionSuite.scala
index d3867f2b6b..a0f8ae2fc7 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutPythonUDFInJoinConditionSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutPythonUDFInJoinConditionSuite.scala
@@ -40,13 +40,18 @@ class PullOutPythonUDFInJoinConditionSuite extends PlanTest {
         CheckCartesianProducts) :: Nil
   }

-  val testRelationLeft = LocalRelation('a.int, 'b.int)
-  val testRelationRight = LocalRelation('c.int, 'd.int)
+  val attrA = 'a.int
+  val attrB = 'b.int
+  val attrC = 'c.int
+  val attrD = 'd.int
+
+  val testRelationLeft = LocalRelation(attrA, attrB)
+  val testRelationRight = LocalRelation(attrC, attrD)

   // Dummy python UDF for testing. Unable to execute.
   val pythonUDF = PythonUDF("pythonUDF", null,
     BooleanType,
-    Seq.empty,
+    Seq(attrA, attrC),
     PythonEvalType.SQL_BATCHED_UDF,
     udfDeterministic = true)

@@ -118,7 +123,7 @@ class PullOutPythonUDFInJoinConditionSuite extends PlanTest {
   test("pull out whole complex condition with multiple python udf") {
     val pythonUDF1 = PythonUDF("pythonUDF1", null,
       BooleanType,
-      Seq.empty,
+      Seq(attrA, attrC),
       PythonEvalType.SQL_BATCHED_UDF,
       udfDeterministic = true)
     val condition = (pythonUDF || 'a.attr === 'c.attr) && pythonUDF1

mgaido91 · 2018-11-27T16:44:21Z

the change itself seems fine to me, as @xuanyuanking mentioned, though, we should update the existing tests. What about adding a test in the new suite checking the plans instead of a end-to-end test?

SparkQA · 2018-11-28T08:05:02Z

Test build #99356 has finished for PR 23153 at commit 7b985d8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-28T08:08:41Z

retest this please

SparkQA · 2018-11-28T11:47:22Z

Test build #99358 has finished for PR 23153 at commit 7b985d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking

LGTM

cloud-fan · 2018-11-28T12:39:16Z

thanks, merging to master/2.4!

mgaido91 · 2018-11-28T12:47:50Z

a late LGTM as well, thanks @cloud-fan for the patch and thanks @xuanyuanking for the review.

…dition #22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes #23153 from cloud-fan/join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile · 2018-11-30T17:42:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

+
+  private def hasUnevaluablePythonUDF(expr: Expression, j: Join): Boolean = {
+    expr.find { e =>
+      PythonUDF.isScalarPythonUDF(e) && !canEvaluate(e, j.left) && !canEvaluate(e, j.right)


We might need a comment to explain why we only pull out the Scalar PythonUDF.

It's only possible to have scalar UDF in join condition, so changing it to e.isInstanceOf[PythonUDF] is same.

HyukjinKwon

late LGTM as well

…dition ## What changes were proposed in this pull request? apache#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. ## How was this patch tested? a new test Closes apache#23153 from cloud-fan/join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…dition apache#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes apache#23153 from cloud-fan/join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <[email protected]>

only pull out unevaluable python udf from join condition

cb195cf

xuanyuanking reviewed Nov 27, 2018

View reviewed changes

fix tests

7b985d8

xuanyuanking approved these changes Nov 28, 2018

View reviewed changes

asfgit closed this in affe809 Nov 28, 2018

gatorsmile reviewed Nov 30, 2018

View reviewed changes

HyukjinKwon reviewed Dec 1, 2018

View reviewed changes

[SPARK-26147][SQL] only pull out unevaluable python udf from join condition #23153

[SPARK-26147][SQL] only pull out unevaluable python udf from join condition #23153

Uh oh!

Conversation

cloud-fan commented Nov 27, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 27, 2018

Uh oh!

xuanyuanking Nov 27, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Nov 27, 2018

Choose a reason for hiding this comment

Uh oh!

xuanyuanking commented Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Nov 27, 2018

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

cloud-fan commented Nov 28, 2018

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 28, 2018

Uh oh!

mgaido91 commented Nov 28, 2018

Uh oh!

gatorsmile Nov 30, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 2, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented Nov 27, 2018 •

edited

Loading

xuanyuanking commented Nov 27, 2018 •

edited

Loading