-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary #20446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #86856 has finished for PR 20446 at commit
|
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments
docs/ml-statistics.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "contain" -> "are" or "include"?
docs/ml-statistics.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "The following example demonstrates using Summarizer(...) to compute the mean and variance for the input dataframe, with and without a weight column"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just df.select(...).show()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because spark user will usually want to get the summary result (multiple vectors), I want to show the simple way to extract these results from the returned dataframe which contains only one row. I think some user is possible to get stuck here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just df.select(...).show()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies here, why not just df.select(...).show()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies here, why not just df.select(...).show()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but Tuple1 not required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean us .as[((Vector, Vector))] ? It compile fails..
or Do you mean change to
val (meanVal, varianceVal) = df.select(metrics("mean", "variance")
.summary($"features", $"weight"))
.as[(Vector, Vector)].first()
? Seems also do not work because it is a "struct type" value in the returned row. So the first row format should match Row(Row(mean, variance))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok - perhaps select("summary.mean", "summary.variance") would work to extract into two columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. This way make code easier to read.
Done.
|
Test build #86968 has finished for PR 20446 at commit
|
docs/ml-statistics.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, one more comment here
I think perhaps "... to compute the mean and variance for a vector column of the input dataframe ..."
(and same below)
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor nit-picking comments, otherwise LGTM
|
Test build #86975 has finished for PR 20446 at commit
|
|
Test build #86980 has finished for PR 20446 at commit
|
f02172f to
f9eb02a
Compare
|
Test build #89569 has finished for PR 20446 at commit
|
|
Test build #89570 has finished for PR 20446 at commit
|
|
Test build #92539 has finished for PR 20446 at commit
|
|
@WeichenXu123 looks like there was one more outstanding comment, about using |
|
@srowen The reason I do not use |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you replied on the first instance of that comment.
|
Merged to master |
What changes were proposed in this pull request?
Add user guide and scala/java/python examples for
ml.stat.SummarizerHow was this patch tested?
Doc generated snapshot: