[SPARK-16258][SparkR] Automatically append the grouping keys in SparkR's gapply #14431

NarineK · 2016-08-01T01:04:39Z

What changes were proposed in this pull request?

The following pull request addresses the new feature request described in SPARK-16258.
It automatically('by default') appends grouping keys to output DataFrame.

I've also tried to solve the problem by adding an optional flag in gapply that states if the key is required or not. However, the optional flag needs to be passed as an argument through a number of methods which is not necessarily elegant and leads to some issues such as "The number of parameters should not exceed 10" in '..../logical/object.scala:290'

Since this pull request already appends the grouping key automatically, I was thinking if we really need to pass 'key' as R functions input argument - function(key, x) {....} Isn't it superfluous ?
I'd be happy to hear your thoughts on that.

Thanks!

How was this patch tested?

Test cases in R.

…dapply and dapplyCollect

NarineK · 2016-08-01T01:42:48Z

docs/sparkr.md

 head(collect(arrange(result, "max_eruption", decreasing = TRUE)))

 ##    waiting   max_eruption
-##1      64       5.100


previously, there was a typo in the examples.
It is easy to see by running:

> result <- data.frame(aggregate(faithful$eruptions, by = list(faithful$waiting), FUN = max)) > result <- head(result[order(result$x, decreasing = TRUE), ]) > result

SparkQA · 2016-08-01T02:52:41Z

Test build #63059 has finished for PR 14431 at commit 575fcf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-01T03:53:56Z

Test build #63064 has finished for PR 14431 at commit f235227.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-01T04:00:07Z

Test build #63065 has finished for PR 14431 at commit 44ee864.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-01T04:04:25Z

Test build #63061 has finished for PR 14431 at commit 8db1d08.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

shivaram · 2016-08-01T17:09:05Z

@NarineK Thanks for the PR. The thing I worry about is that this will break any code users write with the 2.0 release and they'll need to change their code if we ship this in 2.1 -- Other than passing the option around, do you know if there is any way to maintain backwards compatibility ?

NarineK · 2016-08-02T06:19:18Z

That's a good point, @shivaram.
worker.R is the component which has the keys and appends it to the output.
I don't see any elegant way of doing it in worker.R yet.

However, I was thinking about the following option:
We can still have optional flag in gapply that states if the key is required or not but we will not pass it over to scala side.
By default we can always prepend keys in worker.R and in group.R we can have a check such as:

if (!prependKey) {
  // de-attach/remove the appended key columns.
}

Is this sound reasonable or is it a still hackish ?

shivaram · 2016-08-02T16:09:18Z

Yeah I think something like that is fine. Basically doing some pre-processing or post-processing after the UDF has run using our own R code is a good way to add new features

NarineK · 2016-08-02T19:09:25Z

cool! Let me give a try that option.

NarineK · 2016-08-07T04:09:01Z

It seems that, currently, in SparkR the GroupedData which represents scala's GroupedData object doesn't have any information about the grouping keys. RelationalGroupedDataset has a private attribute groupingExpr which contains information about grouping columns, however it is not accessible from R side. I was thinking that maybe we could pass grouping columns to groups.R like: groupedData(sgd, cols).
Any thoughts @shivaram ?
Thanks!

shivaram · 2016-08-09T21:09:30Z

Sure - Appending more information to the R object is fine. Also it looks like we actually have a handle to the RelationalGroupedDataset when we call groupBy on the scala side

spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Line 1288 in b89b3a5

def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {

NarineK · 2016-08-10T01:36:54Z

Thanks, @shivaram! Yes, we have a handle to RelationalGroupedDataset, but I couldn't access column fields of RelationalGroupedDataset's instance. Is there a way to access the columns ?

shivaram · 2016-08-10T04:25:32Z

I'm not sure I understand the question. Also some of the SQL committers like @liancheng might be able to answer this better

NarineK · 2016-08-10T05:54:00Z

My point is the following: Let's say we have the following:
var relationalGroupedDataset = df.groupBy("col1", "col2");
Now, having relationalGroupedDataset how can I find out the grouping columns.
there is nothing like:
relationalGroupedDataset.columns or relationalGroupedDataset.groupExpression
Is there ?

shivaram · 2016-08-11T16:26:21Z

groupingExprs is a member of the class as I can see in [1]. Also we convert these grouping expressions to columns in the flatMapGroupsInR function [2] -- So we could add a new function that just does a similar mapping but just returns the column names ?

[1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L48
[2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L413

NarineK · 2016-08-12T05:22:37Z

yes, @shivaram , that will be one way to do.
Basically, adding a new public function to RelationalGroupedDataset which will return the column names.
If it is fine from SQL perspective, maybe I can create a separate pull request for that ?
cc: @liancheng

NarineK · 2016-08-22T02:35:50Z

Made a pull request for grouping columns: #14742

sameeragarwal · 2017-06-16T23:27:28Z

@NarineK @shivaram any updates here? Also cc @felixcheung

felixcheung · 2017-06-19T17:30:19Z

Think this is still useful to have and perhaps important to get consistent with what might be being planned for Python?

shivaram · 2017-06-19T17:31:02Z

AFAIK this was dependent on #14742, but @NarineK may know better

NarineK · 2017-06-19T17:36:49Z

Hi everyone, yes it depends on #14742 . I've been asked to close #14742.
For this PR I need to access the grouping columns. If you think that there is an alternative way of accessing that information, I'd be happy to make the changes in this PR.
Thanks!

gatorsmile · 2017-06-19T19:41:38Z

This will introduce an external behavior change, right?

NarineK · 2017-06-19T23:37:52Z

yes, but we only need read access.

gatorsmile · 2017-06-27T04:20:30Z

R/pkg/R/DataFrame.R

-#' 1        0.699883    0.3303370    0.9455356    -0.1697527
-#' 2        1.895540    0.3868576    0.9083370    -0.6792238
-#' 3        2.351890    0.6548350    0.2375602     0.2521257
+#' Model  Species     (Intercept)  Sepal_Width  Petal_Length  Petal_Width


This is an external change. I think such an external change is not acceptable after we already introduce it. Right?

it's going to be a breaking change, yes

In the past we had a discussion about backward compatibility with shivaram.
#14431 (comment)

I think I didn't push R changes, because I wanted to be able to access the grouping columns on sql side first. Without being able to access the grouping columns I couldn't find a way to keep backward compatibility without breaking anything.

yes I'd think it's reasonable if under a switch

gatorsmile · 2017-06-30T04:09:12Z

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L51

You just need to change the above line

val groupingExprs: Seq[Expression],

You can access groupingExprs in SQLUtils.scala.

falaki · 2017-06-30T06:10:11Z

@NarineK how about adding this as a new API e.g., gapplyWithKeys(). I am extremely worried about the semantic change. It can break existing SparkR applications and will be confusing for users.

NarineK · 2017-06-30T16:43:46Z

@falaki, I'd be fine with a separate gapplyWithKeys() method too.
@shivaram, @felixcheung what do you think ? Should we add a new gapplyWithKeys() method ?

NarineK · 2017-06-30T17:28:56Z

Thank you, @gatorsmile! I'll give a try.

felixcheung · 2017-06-30T18:30:46Z

I haven't looked at this closely. Just on top of my head, if two methods are semantically very similar - I'd prefer to have one method with parameter. We already have all sort of problems with the number of methods we export. Obviously a breaking change should be avoid. Perhaps as suggested here to maintain backward compatibility #14431 (comment) and here #14431 (comment)

falaki · 2017-06-30T20:48:45Z

If we want to avoid yet another method, we could add this functionality as a non-default behavior. E.g.,

gapply(df, "key", function(key, x) { x }, schema(df), appendKeys = F)

shivaram · 2017-06-30T22:39:54Z

Compared to introducing a new API, I think @falaki 's idea of adding a non-default option is better

NarineK · 2017-06-30T23:49:49Z

I think @falaki's approach is good, only I find the key which is passed as an argument together with x as an input of function is a little superfluous.

felixcheung · 2017-07-01T04:27:05Z

btw, if the key is the very first column, that sounds like prefix and not append?
perhaps return.data.frame.key.column = FALSE?

and about your comment, do you mean key in function(key, x) { x }?
IMO it's quite helpful to know what group (ie. key) is the UDF processing?

NarineK · 2017-07-01T17:55:32Z

I think prepend sounds better. What do you think ?
Yes, the key in function(key, x) { x } can be useful for some use cases but I also think that the user could easily prepend it to the dataframe if he/she needs it and since the key is already there.

felixcheung · 2017-07-04T19:07:30Z

I'm not too worry about the exact words, but prepend keys doesn't seem obvious what it means.
also please use something.something as parameter name and not camel casing - we should try to do that unless the name is in Spark Scala.

NarineK · 2017-07-05T18:41:57Z

Alright, give me couple days to address those cases.

NarineK · 2017-07-10T22:43:11Z

@gatorsmile, I'm able to access groupingExprs from SQLUtils.scala through val groupingExprs: Seq[Expression], however it seems to be challenging to access the name of the column from pure expression. In RelationalGroupedDataset it is using alias to create a named expression: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L83

What would you suggest as a best option of accessing column name from Expression ?

Thank you,
Narine

github-actions · 2020-01-18T00:08:17Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Narine Kokhlikyan added 3 commits June 21, 2016 01:12

Fixed duplicated documentation problem + separated documentation for …

29d8a5c

…dapply and dapplyCollect

merge with master

60491b9

gapply: prepend key to output

575fcf8

NarineK changed the title ~~[SPARK-16258][SparkR][WIP] Gapply add key attach option~~ [SPARK-16258][SparkR][WIP] Automatically append the grouping keys in SparkR's gapply Aug 1, 2016

NarineK changed the title ~~[SPARK-16258][SparkR][WIP] Automatically append the grouping keys in SparkR's gapply~~ [SPARK-16258][SparkR] Automatically append the grouping keys in SparkR's gapply Aug 1, 2016

Updated examples and programming guide accordingly

8db1d08

NarineK reviewed Aug 1, 2016
View reviewed changes

Narine Kokhlikyan added 2 commits July 31, 2016 18:57

merge with master

f235227

fix conflict in sparkr.md

44ee864

NarineK mentioned this pull request Jun 19, 2017

[SPARK-17177][SQL] Make grouping columns accessible from RelationalGroupedDataset #14742

Closed

gatorsmile reviewed Jun 27, 2017

View reviewed changes

dongjoon-hyun added the SPARKR label Jun 14, 2019

github-actions bot added the Stale label Jan 18, 2020

github-actions bot closed this Jan 20, 2020

[SPARK-16258][SparkR] Automatically append the grouping keys in SparkR's gapply #14431

[SPARK-16258][SparkR] Automatically append the grouping keys in SparkR's gapply #14431

Uh oh!

Conversation

NarineK commented Aug 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

NarineK Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 1, 2016

Uh oh!

SparkQA commented Aug 1, 2016

Uh oh!

SparkQA commented Aug 1, 2016

Uh oh!

SparkQA commented Aug 1, 2016

Uh oh!

shivaram commented Aug 1, 2016

Uh oh!

NarineK commented Aug 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented Aug 2, 2016

Uh oh!

NarineK commented Aug 2, 2016

Uh oh!

NarineK commented Aug 7, 2016

Uh oh!

shivaram commented Aug 9, 2016

Uh oh!

NarineK commented Aug 10, 2016

Uh oh!

shivaram commented Aug 10, 2016

Uh oh!

NarineK commented Aug 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented Aug 11, 2016

Uh oh!

NarineK commented Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NarineK commented Aug 22, 2016

Uh oh!

sameeragarwal commented Jun 16, 2017

Uh oh!

felixcheung commented Jun 19, 2017 via email

Uh oh!

shivaram commented Jun 19, 2017

Uh oh!

NarineK commented Jun 19, 2017

Uh oh!

gatorsmile commented Jun 19, 2017

Uh oh!

NarineK commented Jun 19, 2017

Uh oh!

gatorsmile Jun 27, 2017

Choose a reason for hiding this comment

Uh oh!

felixcheung Jun 27, 2017

Choose a reason for hiding this comment

Uh oh!

NarineK Jun 27, 2017

Choose a reason for hiding this comment

Uh oh!

felixcheung Jun 27, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 30, 2017

Uh oh!

falaki commented Jun 30, 2017

Uh oh!

NarineK commented Jun 30, 2017

Uh oh!

NarineK commented Jun 30, 2017

Uh oh!

felixcheung commented Jun 30, 2017 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

NarineK Aug 1, 2016 •

edited

Loading

NarineK commented Aug 2, 2016 •

edited

Loading

NarineK commented Aug 10, 2016 •

edited

Loading

NarineK commented Aug 12, 2016 •

edited

Loading

felixcheung commented Jun 30, 2017 via email •

edited

Loading

NarineK commented Jul 1, 2017 •

edited

Loading

NarineK commented Jul 5, 2017 •

edited

Loading