Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Sep 7, 2017

What changes were proposed in this pull request?

Make several improvements in dataframe vectorized summarizer.

  1. Make the summarizer return Vector type for all metrics (except "count").
    It will return "WrappedArray" type before which won't be very convenient.

  2. Make MetricsAggregate inherit ImplicitCastInputTypes trait. So it can check and implicitly cast input values.

  3. Add "weight" parameter for all single metric method.

  4. Update doc and improve the example code in doc.

  5. Simplified test cases.

How was this patch tested?

Test added and simplified.

@WeichenXu123
Copy link
Contributor Author

cc @yanboliang @thunterdb Thanks!

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81517 has finished for PR 19156 at commit 7b9fbdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of default parameters, it tends to cause issues with binary compatibility. Unless you have some good reasons, you should have two different functions:

def mean(col: Column): Column = mean(col, lit(1.0))
def mean(col: Column, weightCol: Column): Column = ...

@WeichenXu123
Copy link
Contributor Author

Thanks @thunterdb code updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use def metrics(metrics: String*) instead of def metrics(firstMetric: String, metrics: String*).
It will make pyspark call this interface more easier. (Later I will add python API)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried about java? IIRC this style is for java compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't test this on Java. But, I can find some other places use similar style, such as Dataset.toDF, Dataset.drop, Does it mean they also have java compatibility issue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Do you say about this bug ? https://issues.apache.org/jira/browse/SPARK-5904
But it is only related to abstract method.
Now I add java testsuite to make sure it works fine.

@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81554 has finished for PR 19156 at commit f5b0b11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

ping @yanboliang Any other comments ?
We need merge this before 2.3 release.

@WeichenXu123 WeichenXu123 changed the title [SPARK-19634][FOLLOW-UP][ML] Improve interface of dataframe vectorized summarizer [SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer Sep 18, 2017
@yanboliang
Copy link
Contributor

@WeichenXu123 Sorry for late response, really busy in these days. I will take a look in a few days. Thanks for your patience.

@WeichenXu123
Copy link
Contributor Author

@cloud-fan Can you help review the part of code which related to SQL interface ?

@yanboliang
Copy link
Contributor

I'd like to make a pass soon.

@cloud-fan
Copy link
Contributor

the SQL part LGTM

@SparkQA
Copy link

SparkQA commented Nov 8, 2017

Test build #83574 has finished for PR 19156 at commit 480e80d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the return type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of them works, but other similar aggregate function also use Any. Will it cause some issues ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you let me know why did you make this change? I think we should use long array rather than double array to store numNonZeros.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.spark.mllib.stat.MultivariateOnlineSummarizer also return Vector for numNonZeros. So I prefer keep consistent with it.

Copy link
Contributor

@yanboliang yanboliang Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old mllib.stat.MultivariateOnlineSummarizer, the internal variable is type of Array[Long], but the return type is Vector. Do you know the impact of using Vector internally? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally still use Array[Long] to do the computation. Only when returning result, convert it to vector.

@WeichenXu123 WeichenXu123 force-pushed the improve_vec_summarizer branch 2 times, most recently from 525692e to 2e4b232 Compare November 9, 2017 08:38
@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83639 has finished for PR 19156 at commit 525692e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83640 has finished for PR 19156 at commit 2e4b232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about binary compatibility? e.g. spark jobs built with old spark versions, can they run on new Spark without re-compile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was added after 2.2, does it matters ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah then it doesn't matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove the test against ground true value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: weight can be abbreviated to w.

@WeichenXu123 WeichenXu123 force-pushed the improve_vec_summarizer branch from 2e4b232 to 5647a49 Compare December 13, 2017 10:40
@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84845 has finished for PR 19156 at commit 5647a49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84954 has finished for PR 19156 at commit 4d6617e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84958 has finished for PR 19156 at commit 4d6617e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 force-pushed the improve_vec_summarizer branch from 4d6617e to f34da1f Compare December 15, 2017 13:10
@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84960 has finished for PR 19156 at commit f34da1f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85109 has finished for PR 19156 at commit f34da1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except the last comment. Thanks.

}
}

test("basic error handling") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove these two tests?

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85229 has finished for PR 19156 at commit 24697f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

Merged into master, thanks.

@asfgit asfgit closed this in d3ae3e1 Dec 21, 2017
@WeichenXu123 WeichenXu123 deleted the improve_vec_summarizer branch April 24, 2019 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants