-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARKR] [SPARK-10981] SparkR Join improvements #9029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… collect() and orderBy() functions
|
Thanks for PR @mfliu - Could you format the PR title as |
|
Jenkins, ok to test |
|
cc @sun-rui |
|
The errors in the log file don't seem to be related to the changes I made? They are primarily in PythonRDD.scala: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) Caused by: java.io.EOFException Errors seem to be from PythonRDD.scala, but I made no changes to that file, and I'm not sure how changing R code affects that interface? |
|
@mfliu I think thats just a flaky test, but irrespective of that your changes should be against the current |
R/pkg/R/DataFrame.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is undoing a recent PR, could you check?
Merge remote-tracking branch 'upstream/master'
|
@felixcheung Yes, you are correct. The arrange function was different. I pulled again and changed those files and it is working on my machine. Can you test again? |
|
Test build #43468 has finished for PR 9029 at commit
|
expect_equal(count(joined7, 3)) was changed to expect_equal(count(joined7), 3)
|
Sorry, had a typo in one of my unit tests. It now passes the run-tests.sh on my machine. Can you test again? |
|
Test build #43469 has finished for PR 9029 at commit
|
|
Test build #43473 has finished for PR 9029 at commit
|
R/pkg/R/DataFrame.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we should change right_outer to rightouter ? It'll break code that used to work with previous versions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change was made based on the comment on the JIRA report:
https://issues.apache.org/jira/browse/SPARK-10981
In the PR, please:
- Support all join types defined in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala (You can remove the "_" char from the currently supported join types in SparkR)
- Add test cases for missing join types including "leftsemi"
Perhaps I misunderstood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 breaking API changes, IMO we really need to come up on some policy on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add "right_outer" and "left_outer" along with "rightouter" and "leftouter"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, API compatibility is a concern. So we can make R code consistent with the scala version at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala#L21
That is, replace the "_" char in the join type string with empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so lets support both right_outer and rightouter. That way we don't break backwards compatibility. One simple way to do this as @sun-rui said is to just replace all "_"s in the join string with "" using gsub or something like that.
|
Test build #43664 has finished for PR 9029 at commit
|
|
Test build #43666 has finished for PR 9029 at commit
|
|
@mfliu Just FYI, you can check the lint-r tests locally by running the script |
|
Test build #43668 has finished for PR 9029 at commit
|
|
Jenkins, retest this please |
Merge remote-tracking branch 'upstream/master'
|
Test build #43671 has finished for PR 9029 at commit
|
|
Test build #43673 has finished for PR 9029 at commit
|
|
looks good |
|
LGTM |
|
Thanks @mfliu - LGTM. Merging this |
I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file. Pull request because I filed this JIRA bug report: https://issues.apache.org/jira/browse/SPARK-10981 Author: Monica Liu <[email protected]> Closes #9029 from mfliu/master. (cherry picked from commit 8b32885) Signed-off-by: Shivaram Venkataraman <[email protected]>
I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file.
Pull request because I filed this JIRA bug report:
https://issues.apache.org/jira/browse/SPARK-10981