Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Jan 31, 2018

What changes were proposed in this pull request?

Add user guide and scala/java/python examples for ml.stat.Summarizer

How was this patch tested?

Doc generated snapshot:

image
image
image
image

@WeichenXu123
Copy link
Contributor Author

@MLnick @MrBago Thanks!

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86856 has finished for PR 20446 at commit 307f75f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaSummarizerExample

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "contain" -> "are" or "include"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "The following example demonstrates using Summarizer(...) to compute the mean and variance for the input dataframe, with and without a weight column"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just df.select(...).show()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because spark user will usually want to get the summary result (multiple vectors), I want to show the simple way to extract these results from the returned dataframe which contains only one row. I think some user is possible to get stuck here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fair enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just df.select(...).show()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here, why not just df.select(...).show()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here, why not just df.select(...).show()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but Tuple1 not required here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean us .as[((Vector, Vector))] ? It compile fails..
or Do you mean change to

val (meanVal, varianceVal) = df.select(metrics("mean", "variance")
      .summary($"features", $"weight"))
      .as[(Vector, Vector)].first()

? Seems also do not work because it is a "struct type" value in the returned row. So the first row format should match Row(Row(mean, variance))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok - perhaps select("summary.mean", "summary.variance") would work to extract into two columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. This way make code easier to read.
Done.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86968 has finished for PR 20446 at commit 2592bb9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, one more comment here

I think perhaps "... to compute the mean and variance for a vector column of the input dataframe ..."

(and same below)

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor nit-picking comments, otherwise LGTM

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86975 has finished for PR 20446 at commit fc9622b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86980 has finished for PR 20446 at commit f02172f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [SPARK-23254][ML] Add user guide entry for DataFrame multivariate summary [SPARK-23254][ML] Add user guide entry and example for DataFrame multivariate summary Apr 19, 2018
@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89569 has finished for PR 20446 at commit f9eb02a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89570 has finished for PR 20446 at commit ee9d368.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

@MLnick @srowen

@SparkQA
Copy link

SparkQA commented Jul 2, 2018

Test build #92539 has finished for PR 20446 at commit ee9d368.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 3, 2018

@WeichenXu123 looks like there was one more outstanding comment, about using .show()?

@WeichenXu123
Copy link
Contributor Author

@srowen The reason I do not use .show I have already reply here #20446 (comment)
thanks!

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you replied on the first instance of that comment.

@srowen
Copy link
Member

srowen commented Jul 11, 2018

Merged to master

@asfgit asfgit closed this in 59c3c23 Jul 11, 2018
@WeichenXu123 WeichenXu123 deleted the summ_guide branch April 24, 2019 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants