Skip to content

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented May 4, 2017

What changes were proposed in this pull request?

Adds support for generic hints on SparkDataFrame

How was this patch tested?

Unit tests, check-cran.sh

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76435 has finished for PR 17851 at commit 261e5a6.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76436 has finished for PR 17851 at commit ee52b53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#'
#' @param x a SparkDataFrame.
#' @param name a name of the hint.
#' @param ... additional argument(s) passed to the method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case the ... is actually meaningful, so I'd suggest documenting it, eg. similar to scala, "(optional) properties"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala is even more cryptic here. I adjusted it a bit, but I think we can revisit this once we have some practical examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yes ;)

#' @param x a SparkDataFrame.
#' @param name a name of the hint.
#' @param ... additional argument(s) passed to the method.
#'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line

#' @examples
#' \dontrun{
#' df <- createDataFrame(mtcars)
#' avg_mpg <- mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you recreated createDataFrame(mtcars) here, do you mean to use df from the line before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you want to show these are two different dataset? maybe it's worthwhile to comment that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use different datasets to avoid aliasing and trivially equal warning. It works but it is confusing (note: equi-join syntax same as in Scala or Python would be great, and it shouldn't be that hard to add). Once we merge alias this shouldn't been an issue. Since we don't run it, I can add aliases now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alias is going to 2.3, this is going to 2.2 - I think we leave this for now and can improve this in master later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also with alias it will be quite dense:

#' @examples
#' \dontrun{
#' # Set aliases to avoid ambiguity
#' df <- alias(createDataFrame(mtcars), "cars")
#' avg_mpg <- alias(mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg"), "avg_mpg")
#'
#' head(join(
#'   df, hint(avg_mpg, "broadcast"), 
#'   column("cars.cyl") == column("avg_mpg.cyl")
#' ))
#' }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - I think the example makes sense now but it might not be very obvious - for example,

createDataFrame(mtcars)
createDataFrame(mtcars)

vs

df <- createDataFrame(mtcars)
df

is not very subtle unless you know what Spark is doing differently here. This is why I suggested pointing out the need to have distinct "copies" of data


#' hint
#'
#' Specifies execution plan hint on the current SparkDataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the R programming model is a bit different - I think it is better to point out the original SparkDataFrame is not actually changed - instead say .... hint and return a new SparkDataFrame is better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch sorry I didn't mean to put ".... hint and return a new SparkDataFrame"
but "Specifies execution plan hint and return a new SparkDataFrame"

execution_plan_hint <- capture.output(
explain(join(df1, hint(df2, "broadcast"), df1$id == df2$id))
)
expect_true(any(grepl("BroadcastHashJoin", execution_plan_hint)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

#' @rdname hint
#' @export
setGeneric("hint", function(x, name, ...) { standardGeneric("hint") })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should move after groupBy, I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zero323
Copy link
Member Author

zero323 commented May 4, 2017

@felixcheung Do you think this makes o.a.s.sql.functions.broadcast obsolete? I a have WIP on this but it is a tricky one. There is an internal, non-generic broadcast, with different signature so we'd have to either adjust it. or use different name (broadcast_df, broadcast_table`?).

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76442 has finished for PR 17851 at commit 1183441.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


#' hint
#'
#' Specifies execution plan hint on the current SparkDataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch sorry I didn't mean to put ".... hint and return a new SparkDataFrame"
but "Specifies execution plan hint and return a new SparkDataFrame"

@felixcheung
Copy link
Member

felixcheung commented May 4, 2017

do you mean this?

we can talk about what to do a bit later.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Waiting on Jenkins.

@felixcheung
Copy link
Member

it looks like AppVeyor is stuck since about 22 hrs ago..

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76449 has finished for PR 17851 at commit e6c6d82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 4, 2017
Adds support for generic hints on `SparkDataFrame`

Unit tests, `check-cran.sh`

Author: zero323 <[email protected]>

Closes #17851 from zero323/SPARK-20585.

(cherry picked from commit 9c36aa2)
Signed-off-by: Felix Cheung <[email protected]>
@felixcheung
Copy link
Member

merged to master/2.2

@asfgit asfgit closed this in 9c36aa2 May 4, 2017
@zero323
Copy link
Member Author

zero323 commented May 4, 2017

Thanks.

@rxin
Copy link
Contributor

rxin commented May 5, 2017

@felixcheung was this merged only in master but not branch-2.2?

@felixcheung
Copy link
Member

felixcheung commented May 6, 2017 via email

@zero323 zero323 deleted the SPARK-20585 branch May 8, 2017 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants