-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16258][SparkR] Automatically append the grouping keys in SparkR's gapply #14431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docs/sparkr.md
Outdated
| head(collect(arrange(result, "max_eruption", decreasing = TRUE))) | ||
|
|
||
| ## waiting max_eruption | ||
| ##1 64 5.100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously, there was a typo in the examples.
It is easy to see by running:
> result <- data.frame(aggregate(faithful$eruptions, by = list(faithful$waiting), FUN = max))
> result <- head(result[order(result$x, decreasing = TRUE), ])
> result
|
Test build #63059 has finished for PR 14431 at commit
|
|
Test build #63064 has finished for PR 14431 at commit
|
|
Test build #63065 has finished for PR 14431 at commit
|
|
Test build #63061 has finished for PR 14431 at commit
|
|
@NarineK Thanks for the PR. The thing I worry about is that this will break any code users write with the 2.0 release and they'll need to change their code if we ship this in 2.1 -- Other than passing the option around, do you know if there is any way to maintain backwards compatibility ? |
|
That's a good point, @shivaram. However, I was thinking about the following option: Is this sound reasonable or is it a still hackish ? |
|
Yeah I think something like that is fine. Basically doing some pre-processing or post-processing after the UDF has run using our own R code is a good way to add new features |
|
cool! Let me give a try that option. |
|
It seems that, currently, in SparkR the |
|
Sure - Appending more information to the R object is fine. Also it looks like we actually have a handle to the RelationalGroupedDataset when we call groupBy on the scala side
|
|
Thanks, @shivaram! Yes, we have a handle to RelationalGroupedDataset, but I couldn't access column fields of RelationalGroupedDataset's instance. Is there a way to access the columns ? |
|
I'm not sure I understand the question. Also some of the SQL committers like @liancheng might be able to answer this better |
|
My point is the following: Let's say we have the following: |
|
[1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L48 |
|
yes, @shivaram , that will be one way to do. |
|
Made a pull request for grouping columns: #14742 |
|
@NarineK @shivaram any updates here? Also cc @felixcheung |
|
Think this is still useful to have and perhaps important to get consistent with what might be being planned for Python?
|
|
This will introduce an external behavior change, right? |
|
yes, but we only need read access. |
| #' 1 0.699883 0.3303370 0.9455356 -0.1697527 | ||
| #' 2 1.895540 0.3868576 0.9083370 -0.6792238 | ||
| #' 3 2.351890 0.6548350 0.2375602 0.2521257 | ||
| #' Model Species (Intercept) Sepal_Width Petal_Length Petal_Width |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an external change. I think such an external change is not acceptable after we already introduce it. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's going to be a breaking change, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past we had a discussion about backward compatibility with shivaram.
#14431 (comment)
I think I didn't push R changes, because I wanted to be able to access the grouping columns on sql side first. Without being able to access the grouping columns I couldn't find a way to keep backward compatibility without breaking anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I'd think it's reasonable if under a switch
|
You just need to change the above line You can access |
|
@NarineK how about adding this as a new API e.g., |
|
@falaki, I'd be fine with a separate |
|
Thank you, @gatorsmile! I'll give a try. |
|
I haven't looked at this closely.
Just on top of my head, if two methods are semantically very similar - I'd prefer to have one method with parameter. We already have all sort of problems with the number of methods we export.
Obviously a breaking change should be avoid.
Perhaps as suggested here to maintain backward compatibility
#14431 (comment)
and here #14431 (comment)
|
|
If we want to avoid yet another method, we could add this functionality as a non-default behavior. E.g., |
|
Compared to introducing a new API, I think @falaki 's idea of adding a non-default option is better |
|
I think @falaki's approach is good, only I find the key which is passed as an argument together with x as an input of function is a little superfluous. |
|
btw, if the key is the very first column, that sounds like prefix and not append? and about your comment, do you mean |
|
I think |
|
I'm not too worry about the exact words, but |
|
Alright, give me couple days to address those cases. |
|
@gatorsmile, I'm able to access What would you suggest as a best option of accessing column name from Thank you, |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
The following pull request addresses the new feature request described in SPARK-16258.
It automatically('by default') appends grouping keys to output
DataFrame.I've also tried to solve the problem by adding an optional flag in
gapplythat states if the key is required or not. However, the optional flag needs to be passed as an argument through a number of methods which is not necessarily elegant and leads to some issues such as "The number of parameters should not exceed 10" in '..../logical/object.scala:290'Since this pull request already appends the grouping key automatically, I was thinking if we really need to pass 'key' as R functions input argument - function(key, x) {....} Isn't it superfluous ?
I'd be happy to hear your thoughts on that.
Thanks!
How was this patch tested?
Test cases in R.