-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26147][SQL] only pull out unevaluable python udf from join condition #23153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #99324 has finished for PR 23153 at commit
|
python/pyspark/sql/tests/test_udf.py
Outdated
| f = udf(lambda a: str(a), StringType()) | ||
| # The join condition can't be pushed down, as it refers to attributes from both sides. | ||
| # The Python UDF only refer to attributes from one side, so it's evaluable. | ||
| df = left.join(right, f("a") == col("b").cast("string"), how = "left_outer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: how="left_outer"
| override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { | ||
| case j @ Join(_, _, joinType, condition) | ||
| if condition.isDefined && hasPythonUDF(condition.get) => | ||
| case j @ Join(_, _, joinType, Some(cond)) if hasUnevaluablePythonUDF(cond, j) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed by the rule changes, we need modify the suites in PullOutPythonUDFInJoinConditionSuite, the suites should also construct the dummy python udf from both side.
|
Sorry for the mistake and thanks for the fix from Wenchen, I did this locally, the suites can be simply fixed by: |
|
the change itself seems fine to me, as @xuanyuanking mentioned, though, we should update the existing tests. What about adding a test in the new suite checking the plans instead of a end-to-end test? |
|
Test build #99356 has finished for PR 23153 at commit
|
|
retest this please |
|
Test build #99358 has finished for PR 23153 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
thanks, merging to master/2.4! |
|
a late LGTM as well, thanks @cloud-fan for the patch and thanks @xuanyuanking for the review. |
…dition #22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes #23153 from cloud-fan/join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <[email protected]>
|
|
||
| private def hasUnevaluablePythonUDF(expr: Expression, j: Join): Boolean = { | ||
| expr.find { e => | ||
| PythonUDF.isScalarPythonUDF(e) && !canEvaluate(e, j.left) && !canEvaluate(e, j.right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a comment to explain why we only pull out the Scalar PythonUDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only possible to have scalar UDF in join condition, so changing it to e.isInstanceOf[PythonUDF] is same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late LGTM as well
…dition ## What changes were proposed in this pull request? apache#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. ## How was this patch tested? a new test Closes apache#23153 from cloud-fan/join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…dition apache#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes apache#23153 from cloud-fan/join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <[email protected]>
…dition apache#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes apache#23153 from cloud-fan/join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable.
This PR fixes this mistake.
How was this patch tested?
a new test