-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20585][SPARKR] R generic hint support #17851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76435 has finished for PR 17851 at commit
|
|
Test build #76436 has finished for PR 17851 at commit
|
R/pkg/R/DataFrame.R
Outdated
| #' | ||
| #' @param x a SparkDataFrame. | ||
| #' @param name a name of the hint. | ||
| #' @param ... additional argument(s) passed to the method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case the ... is actually meaningful, so I'd suggest documenting it, eg. similar to scala, "(optional) properties"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scala is even more cryptic here. I adjusted it a bit, but I think we can revisit this once we have some practical examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yes ;)
R/pkg/R/DataFrame.R
Outdated
| #' @param x a SparkDataFrame. | ||
| #' @param name a name of the hint. | ||
| #' @param ... additional argument(s) passed to the method. | ||
| #' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty line
| #' @examples | ||
| #' \dontrun{ | ||
| #' df <- createDataFrame(mtcars) | ||
| #' avg_mpg <- mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you recreated createDataFrame(mtcars) here, do you mean to use df from the line before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you want to show these are two different dataset? maybe it's worthwhile to comment that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use different datasets to avoid aliasing and trivially equal warning. It works but it is confusing (note: equi-join syntax same as in Scala or Python would be great, and it shouldn't be that hard to add). Once we merge alias this shouldn't been an issue. Since we don't run it, I can add aliases now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alias is going to 2.3, this is going to 2.2 - I think we leave this for now and can improve this in master later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also with alias it will be quite dense:
#' @examples
#' \dontrun{
#' # Set aliases to avoid ambiguity
#' df <- alias(createDataFrame(mtcars), "cars")
#' avg_mpg <- alias(mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg"), "avg_mpg")
#'
#' head(join(
#' df, hint(avg_mpg, "broadcast"),
#' column("cars.cyl") == column("avg_mpg.cyl")
#' ))
#' }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right - I think the example makes sense now but it might not be very obvious - for example,
createDataFrame(mtcars)
createDataFrame(mtcars)
vs
df <- createDataFrame(mtcars)
df
is not very subtle unless you know what Spark is doing differently here. This is why I suggested pointing out the need to have distinct "copies" of data
R/pkg/R/DataFrame.R
Outdated
|
|
||
| #' hint | ||
| #' | ||
| #' Specifies execution plan hint on the current SparkDataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the R programming model is a bit different - I think it is better to point out the original SparkDataFrame is not actually changed - instead say .... hint and return a new SparkDataFrame is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch sorry I didn't mean to put ".... hint and return a new SparkDataFrame"
but "Specifies execution plan hint and return a new SparkDataFrame"
| execution_plan_hint <- capture.output( | ||
| explain(join(df1, hint(df2, "broadcast"), df1$id == df2$id)) | ||
| ) | ||
| expect_true(any(grepl("BroadcastHashJoin", execution_plan_hint))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
R/pkg/R/generics.R
Outdated
| #' @rdname hint | ||
| #' @export | ||
| setGeneric("hint", function(x, name, ...) { standardGeneric("hint") }) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should move after groupBy, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
@felixcheung Do you think this makes |
|
Test build #76442 has finished for PR 17851 at commit
|
R/pkg/R/DataFrame.R
Outdated
|
|
||
| #' hint | ||
| #' | ||
| #' Specifies execution plan hint on the current SparkDataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch sorry I didn't mean to put ".... hint and return a new SparkDataFrame"
but "Specifies execution plan hint and return a new SparkDataFrame"
|
do you mean this? we can talk about what to do a bit later. |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Waiting on Jenkins.
|
it looks like AppVeyor is stuck since about 22 hrs ago.. |
|
Test build #76449 has finished for PR 17851 at commit
|
Adds support for generic hints on `SparkDataFrame` Unit tests, `check-cran.sh` Author: zero323 <[email protected]> Closes #17851 from zero323/SPARK-20585. (cherry picked from commit 9c36aa2) Signed-off-by: Felix Cheung <[email protected]>
|
merged to master/2.2 |
|
Thanks. |
|
@felixcheung was this merged only in master but not branch-2.2? |
|
This is branch-2.2?
3f5c548
It missed the rc2 by a few hours though
|
What changes were proposed in this pull request?
Adds support for generic hints on
SparkDataFrameHow was this patch tested?
Unit tests,
check-cran.sh