[SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary #20446

WeichenXu123 · 2018-01-31T01:45:26Z

What changes were proposed in this pull request?

Add user guide and scala/java/python examples for ml.stat.Summarizer

How was this patch tested?

Doc generated snapshot:

WeichenXu123 · 2018-01-31T01:46:24Z

@MLnick @MrBago Thanks!

SparkQA · 2018-01-31T02:02:52Z

Test build #86856 has finished for PR 20446 at commit 307f75f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaSummarizerExample

MLnick

Few minor comments

MLnick · 2018-02-01T13:52:49Z

docs/ml-statistics.md

Perhaps "contain" -> "are" or "include"?

MLnick · 2018-02-01T13:58:34Z

docs/ml-statistics.md

Perhaps "The following example demonstrates using Summarizer(...) to compute the mean and variance for the input dataframe, with and without a weight column"?

MLnick · 2018-02-01T13:59:14Z

examples/src/main/java/org/apache/spark/examples/ml/JavaSummarizerExample.java

Why not just df.select(...).show()?

Because spark user will usually want to get the summary result (multiple vectors), I want to show the simple way to extract these results from the returned dataframe which contains only one row. I think some user is possible to get stuck here.

ok fair enough

MLnick · 2018-02-01T13:59:26Z

examples/src/main/java/org/apache/spark/examples/ml/JavaSummarizerExample.java

Why not just df.select(...).show()?

MLnick · 2018-02-01T13:59:46Z

examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala

Same applies here, why not just df.select(...).show()?

MLnick · 2018-02-01T13:59:54Z

examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala

Same applies here, why not just df.select(...).show()?

MLnick · 2018-02-02T06:34:14Z

examples/src/main/scala/org/apache/spark/examples/ml/SummarizerExample.scala

nit, but Tuple1 not required here?

Do you mean us .as[((Vector, Vector))] ? It compile fails..
or Do you mean change to

val (meanVal, varianceVal) = df.select(metrics("mean", "variance") .summary($"features", $"weight")) .as[(Vector, Vector)].first()

? Seems also do not work because it is a "struct type" value in the returned row. So the first row format should match Row(Row(mean, variance))

oh ok - perhaps select("summary.mean", "summary.variance") would work to extract into two columns?

Good idea. This way make code easier to read.
Done.

SparkQA · 2018-02-02T06:37:34Z

Test build #86968 has finished for PR 20446 at commit 2592bb9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-02-02T06:41:49Z

docs/ml-statistics.md

sorry, one more comment here

I think perhaps "... to compute the mean and variance for a vector column of the input dataframe ..."

(and same below)

MLnick

A couple minor nit-picking comments, otherwise LGTM

SparkQA · 2018-02-02T07:46:34Z

Test build #86975 has finished for PR 20446 at commit fc9622b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T08:16:32Z

Test build #86980 has finished for PR 20446 at commit f02172f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-19T10:09:08Z

Test build #89569 has finished for PR 20446 at commit f9eb02a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-19T10:30:35Z

Test build #89570 has finished for PR 20446 at commit ee9d368.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2018-04-19T10:42:38Z

@MLnick @srowen

SparkQA · 2018-07-02T20:35:28Z

Test build #92539 has finished for PR 20446 at commit ee9d368.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-07-03T13:12:44Z

@WeichenXu123 looks like there was one more outstanding comment, about using .show()?

WeichenXu123 · 2018-07-09T18:53:05Z

@srowen The reason I do not use .show I have already reply here #20446 (comment)
thanks!

srowen

I see, you replied on the first instance of that comment.

srowen · 2018-07-11T18:56:30Z

Merged to master

MLnick reviewed Feb 1, 2018

View reviewed changes

MLnick reviewed Feb 2, 2018

View reviewed changes

WeichenXu123 added 6 commits April 19, 2018 17:35

init pr

a13eec3

address nick's comments

0935fd1

update doc format

a152f7b

update doc

4cbad95

extract struct type column

60286a8

add python example and guide entry

f9eb02a

WeichenXu123 force-pushed the summ_guide branch from f02172f to f9eb02a Compare April 19, 2018 09:51

WeichenXu123 changed the title ~~[SPARK-23254][ML] Add user guide entry for DataFrame multivariate summary~~ [SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary Apr 19, 2018

fix py

ee9d368

srowen approved these changes Jul 10, 2018

View reviewed changes

asfgit closed this in 59c3c23 Jul 11, 2018

WeichenXu123 deleted the summ_guide branch April 24, 2019 21:18

[SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary #20446

[SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary #20446

Uh oh!

Conversation

WeichenXu123 commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

WeichenXu123 commented Jan 31, 2018

Uh oh!

SparkQA commented Jan 31, 2018

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

WeichenXu123 commented Apr 19, 2018

Uh oh!

SparkQA commented Jul 2, 2018

Uh oh!

srowen commented Jul 3, 2018

Uh oh!

WeichenXu123 commented Jul 9, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Jul 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WeichenXu123 commented Jan 31, 2018 •

edited

Loading