[SPARKR] [SPARK-10981] SparkR Join improvements #9029

mfliu · 2015-10-08T18:33:07Z

I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file.
Pull request because I filed this JIRA bug report:
https://issues.apache.org/jira/browse/SPARK-10981

… collect() and orderBy() functions

shivaram · 2015-10-08T18:42:14Z

Thanks for PR @mfliu - Could you format the PR title as [SPARKR] [SPARK-10981] SparkR Join improvements ? More instructions are at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PreparingtoContributeCodeChanges

shivaram · 2015-10-08T18:42:18Z

Jenkins, ok to test

shivaram · 2015-10-08T18:42:21Z

cc @sun-rui

mfliu · 2015-10-08T20:11:07Z

The errors in the log file don't seem to be related to the changes I made? They are primarily in PythonRDD.scala:
java.net.SocketException: Socket is closed

org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203)
at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)

Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)

Errors seem to be from PythonRDD.scala, but I made no changes to that file, and I'm not sure how changing R code affects that interface?

Can someone else take a look? @shivaram, @sun-rui

shivaram · 2015-10-08T20:52:42Z

@mfliu I think thats just a flaky test, but irrespective of that your changes should be against the current master branch. Right now it looks like there are a lot more lines in the diff because of the change being against 1.5.1 ? For fixing this in branch-1.5, we will first merge with master and then during merge we can backport it.

felixcheung · 2015-10-09T02:44:31Z

R/pkg/R/DataFrame.R

It looks like this is undoing a recent PR, could you check?

Merge remote-tracking branch 'upstream/master'

…s on join

mfliu · 2015-10-09T13:31:13Z

@felixcheung Yes, you are correct. The arrange function was different. I pulled again and changed those files and it is working on my machine. Can you test again?

SparkQA · 2015-10-09T13:41:00Z

Test build #43468 has finished for PR 9029 at commit d4a1ed3.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

expect_equal(count(joined7, 3)) was changed to expect_equal(count(joined7), 3)

mfliu · 2015-10-09T14:12:00Z

Sorry, had a typo in one of my unit tests. It now passes the run-tests.sh on my machine. Can you test again?

SparkQA · 2015-10-09T14:17:44Z

Test build #43469 has finished for PR 9029 at commit efa072c.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-09T15:20:31Z

Test build #43473 has finished for PR 9029 at commit 216be37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-10-09T16:12:48Z

R/pkg/R/DataFrame.R

any reason we should change right_outer to rightouter ? It'll break code that used to work with previous versions ?

That change was made based on the comment on the JIRA report:
https://issues.apache.org/jira/browse/SPARK-10981

In the PR, please:

Support all join types defined in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala (You can remove the "_" char from the currently supported join types in SparkR)

Add test cases for missing join types including "leftsemi"

Perhaps I misunderstood?

+1 breaking API changes, IMO we really need to come up on some policy on that

Would it make sense to add "right_outer" and "left_outer" along with "rightouter" and "leftouter"?

yeah, API compatibility is a concern. So we can make R code consistent with the scala version at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala#L21

That is, replace the "_" char in the join type string with empty.

Yeah, so lets support both right_outer and rightouter. That way we don't break backwards compatibility. One simple way to do this as @sun-rui said is to just replace all "_"s in the join string with "" using gsub or something like that.

SparkQA · 2015-10-13T19:47:18Z

Test build #43664 has finished for PR 9029 at commit 9603722.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-13T19:56:39Z

Test build #43666 has finished for PR 9029 at commit a67965a.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-10-13T20:16:52Z

@mfliu Just FYI, you can check the lint-r tests locally by running the script dev/lint-r

SparkQA · 2015-10-13T20:32:18Z

Test build #43668 has finished for PR 9029 at commit d4eff64.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-10-13T20:34:05Z

Jenkins, retest this please

Merge remote-tracking branch 'upstream/master'

SparkQA · 2015-10-13T20:59:24Z

Test build #43671 has finished for PR 9029 at commit d4eff64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-13T21:13:42Z

Test build #43673 has finished for PR 9029 at commit 8813b1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2015-10-13T21:30:05Z

looks good

sun-rui · 2015-10-14T02:54:07Z

LGTM

shivaram · 2015-10-14T04:07:52Z

Thanks @mfliu - LGTM. Merging this

I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file. Pull request because I filed this JIRA bug report: https://issues.apache.org/jira/browse/SPARK-10981 Author: Monica Liu <[email protected]> Closes #9029 from mfliu/master. (cherry picked from commit 8b32885) Signed-off-by: Shivaram Venkataraman <[email protected]>

SparkR joins. Used DataFrame.R from Spark 1.5.1 because of changes to…

5bde8cd

… collect() and orderBy() functions

mfliu changed the title ~~SparkR joins. Used DataFrame.R and test_sparkSQL.R from Spark 1.5.1~~ [SPARKR] [SPARK-10981] SparkR Join improvements Oct 8, 2015

felixcheung reviewed Oct 9, 2015
View reviewed changes

R/pkg/R/DataFrame.R Outdated

Copy link

Member

felixcheung Oct 9, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is undoing a recent PR, could you check?

mfliu added 2 commits October 9, 2015 09:11

DataFrame.R join changes

060b972

Merge remote-tracking branch 'upstream/master'

Pulled from main Spark respository and changed join and test function…

d4a1ed3

…s on join

joined7 typo

efa072c

expect_equal(count(joined7, 3)) was changed to expect_equal(count(joined7), 3)

Fixed style issues to pass lintr tests

216be37

shivaram reviewed Oct 9, 2015
View reviewed changes

Add support for right_outer and left_outer

9603722

Fixing R style errors

a67965a

Actually fixes R style errors, I think

d4eff64

Syncing with official Spark repository

8813b1c

Merge remote-tracking branch 'upstream/master'

asfgit closed this in 8b32885 Oct 14, 2015

[SPARKR] [SPARK-10981] SparkR Join improvements #9029

[SPARKR] [SPARK-10981] SparkR Join improvements #9029

Uh oh!

Conversation

mfliu commented Oct 8, 2015

Uh oh!

shivaram commented Oct 8, 2015

Uh oh!

shivaram commented Oct 8, 2015

Uh oh!

shivaram commented Oct 8, 2015

Uh oh!

mfliu commented Oct 8, 2015

Uh oh!

shivaram commented Oct 8, 2015

Uh oh!

felixcheung Oct 9, 2015

Choose a reason for hiding this comment

Uh oh!

mfliu commented Oct 9, 2015

Uh oh!

SparkQA commented Oct 9, 2015

Uh oh!

mfliu commented Oct 9, 2015

Uh oh!

SparkQA commented Oct 9, 2015

Uh oh!

SparkQA commented Oct 9, 2015

Uh oh!

shivaram Oct 9, 2015

Choose a reason for hiding this comment

Uh oh!

mfliu Oct 9, 2015

Choose a reason for hiding this comment

Uh oh!

felixcheung Oct 9, 2015

Choose a reason for hiding this comment

Uh oh!

mfliu Oct 9, 2015

Choose a reason for hiding this comment

Uh oh!

sun-rui Oct 10, 2015

Choose a reason for hiding this comment

Uh oh!

shivaram Oct 13, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 13, 2015

Uh oh!

SparkQA commented Oct 13, 2015

Uh oh!

shivaram commented Oct 13, 2015

Uh oh!

SparkQA commented Oct 13, 2015

Uh oh!

shivaram commented Oct 13, 2015

Uh oh!

SparkQA commented Oct 13, 2015

Uh oh!

SparkQA commented Oct 13, 2015

Uh oh!

felixcheung commented Oct 13, 2015

Uh oh!

sun-rui commented Oct 14, 2015

Uh oh!

shivaram commented Oct 14, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants