[SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer #19156

WeichenXu123 · 2017-09-07T13:13:17Z

What changes were proposed in this pull request?

Make several improvements in dataframe vectorized summarizer.

Make the summarizer return Vector type for all metrics (except "count").
It will return "WrappedArray" type before which won't be very convenient.
Make MetricsAggregate inherit ImplicitCastInputTypes trait. So it can check and implicitly cast input values.
Add "weight" parameter for all single metric method.
Update doc and improve the example code in doc.
Simplified test cases.

How was this patch tested?

Test added and simplified.

WeichenXu123 · 2017-09-07T13:15:42Z

cc @yanboliang @thunterdb Thanks!

SparkQA · 2017-09-07T14:24:43Z

Test build #81517 has finished for PR 19156 at commit 7b9fbdc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2017-09-07T17:27:23Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

I am not a fan of default parameters, it tends to cause issues with binary compatibility. Unless you have some good reasons, you should have two different functions:

def mean(col: Column): Column = mean(col, lit(1.0)) def mean(col: Column, weightCol: Column): Column = ...

WeichenXu123 · 2017-09-08T08:49:52Z

Thanks @thunterdb code updated.

WeichenXu123 · 2017-09-08T08:52:18Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

Use def metrics(metrics: String*) instead of def metrics(firstMetric: String, metrics: String*).
It will make pyspark call this interface more easier. (Later I will add python API)

have you tried about java? IIRC this style is for java compatibility.

+1 @cloud-fan

I haven't test this on Java. But, I can find some other places use similar style, such as Dataset.toDF, Dataset.drop, Does it mean they also have java compatibility issue ?

@cloud-fan Do you say about this bug ? https://issues.apache.org/jira/browse/SPARK-5904
But it is only related to abstract method.
Now I add java testsuite to make sure it works fine.

SparkQA · 2017-09-08T10:00:41Z

Test build #81554 has finished for PR 19156 at commit f5b0b11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-09-13T11:17:31Z

ping @yanboliang Any other comments ?
We need merge this before 2.3 release.

yanboliang · 2017-09-19T14:40:57Z

@WeichenXu123 Sorry for late response, really busy in these days. I will take a look in a few days. Thanks for your patience.

WeichenXu123 · 2017-09-22T15:22:36Z

@cloud-fan Can you help review the part of code which related to SQL interface ?

yanboliang · 2017-11-07T05:06:51Z

I'd like to make a pass soon.

cloud-fan · 2017-11-07T11:27:44Z

the SQL part LGTM

SparkQA · 2017-11-08T02:44:54Z

Test build #83574 has finished for PR 19156 at commit 480e80d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-08T11:20:34Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

nit: indent

cloud-fan · 2017-11-08T11:23:57Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

why change the return type?

Both of them works, but other similar aggregate function also use Any. Will it cause some issues ?

yanboliang · 2017-11-08T23:02:52Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

Could you let me know why did you make this change? I think we should use long array rather than double array to store numNonZeros.

org.apache.spark.mllib.stat.MultivariateOnlineSummarizer also return Vector for numNonZeros. So I prefer keep consistent with it.

In the old mllib.stat.MultivariateOnlineSummarizer, the internal variable is type of Array[Long], but the return type is Vector. Do you know the impact of using Vector internally? Thanks.

Internally still use Array[Long] to do the computation. Only when returning result, convert it to vector.

SparkQA · 2017-11-09T09:46:56Z

Test build #83639 has finished for PR 19156 at commit 525692e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-09T09:53:01Z

Test build #83640 has finished for PR 19156 at commit 2e4b232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-09T10:53:06Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

How about binary compatibility? e.g. spark jobs built with old spark versions, can they run on new Spark without re-compile?

This class was added after 2.2, does it matters ?

ah then it doesn't matter

yanboliang · 2017-12-13T04:53:32Z

mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala

Why do you remove the test against ground true value?

yanboliang · 2017-12-13T04:55:19Z

mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala

nit: weight can be abbreviated to w.

SparkQA · 2017-12-13T11:42:35Z

Test build #84845 has finished for PR 19156 at commit 5647a49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-15T11:01:12Z

Test build #84954 has finished for PR 19156 at commit 4d6617e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-12-15T11:39:19Z

Jenkins retest this please.

SparkQA · 2017-12-15T12:42:40Z

Test build #84958 has finished for PR 19156 at commit 4d6617e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-15T14:13:38Z

Test build #84960 has finished for PR 19156 at commit f34da1f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-12-19T14:18:45Z

Jenkins retest this please.

SparkQA · 2017-12-19T15:25:38Z

Test build #85109 has finished for PR 19156 at commit f34da1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

LGTM except the last comment. Thanks.

yanboliang · 2017-12-20T21:19:18Z

mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala

    }
  }

-  test("basic error handling") {


Why do you remove these two tests?

SparkQA · 2017-12-21T03:36:18Z

Test build #85229 has finished for PR 19156 at commit 24697f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-12-21T03:53:04Z

Merged into master, thanks.

thunterdb reviewed Sep 7, 2017

View reviewed changes

WeichenXu123 commented Sep 8, 2017

View reviewed changes

WeichenXu123 changed the title ~~[SPARK-19634][FOLLOW-UP][ML] Improve interface of dataframe vectorized summarizer~~ [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer Sep 18, 2017

cloud-fan reviewed Nov 8, 2017

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala Outdated

Copy link

Contributor

cloud-fan Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

cloud-fan reviewed Nov 8, 2017

View reviewed changes

yanboliang reviewed Nov 8, 2017

View reviewed changes

WeichenXu123 force-pushed the improve_vec_summarizer branch 2 times, most recently from 525692e to 2e4b232 Compare November 9, 2017 08:38

cloud-fan reviewed Nov 9, 2017

View reviewed changes

yanboliang reviewed Dec 13, 2017

View reviewed changes

WeichenXu123 force-pushed the improve_vec_summarizer branch from 2e4b232 to 5647a49 Compare December 13, 2017 10:40

WeichenXu123 added 3 commits December 15, 2017 21:09

init pr

60394d9

update

6adcfa7

update

5b0baf5

WeichenXu123 added 4 commits December 15, 2017 21:09

update

2742cd3

address comments

6800218

address comments

647dbbe

improve testcode

f34da1f

WeichenXu123 force-pushed the improve_vec_summarizer branch from 4d6617e to f34da1f Compare December 15, 2017 13:10

yanboliang reviewed Dec 20, 2017

View reviewed changes

add no element test

24697f3

asfgit closed this in d3ae3e1 Dec 21, 2017

WeichenXu123 deleted the improve_vec_summarizer branch April 24, 2019 21:18

Uh oh!

[SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer #19156

[SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer #19156

Uh oh!

Conversation

WeichenXu123 commented Sep 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

WeichenXu123 commented Sep 7, 2017

Uh oh!

SparkQA commented Sep 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Sep 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 8, 2017

Uh oh!

WeichenXu123 commented Sep 13, 2017

Uh oh!

yanboliang commented Sep 19, 2017

Uh oh!

WeichenXu123 commented Sep 22, 2017

Uh oh!

yanboliang commented Nov 7, 2017

Uh oh!

cloud-fan commented Nov 7, 2017

Uh oh!

SparkQA commented Nov 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang Dec 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 9, 2017

Uh oh!

SparkQA commented Nov 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 13, 2017

Uh oh!

SparkQA commented Dec 15, 2017

Uh oh!

WeichenXu123 commented Dec 15, 2017

Uh oh!

SparkQA commented Dec 15, 2017

Uh oh!

WeichenXu123 commented Sep 7, 2017 •

edited

Loading

yanboliang Dec 12, 2017 •

edited

Loading