-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer #19156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @yanboliang @thunterdb Thanks! |
|
Test build #81517 has finished for PR 19156 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of default parameters, it tends to cause issues with binary compatibility. Unless you have some good reasons, you should have two different functions:
def mean(col: Column): Column = mean(col, lit(1.0))
def mean(col: Column, weightCol: Column): Column = ...|
Thanks @thunterdb code updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use def metrics(metrics: String*) instead of def metrics(firstMetric: String, metrics: String*).
It will make pyspark call this interface more easier. (Later I will add python API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tried about java? IIRC this style is for java compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't test this on Java. But, I can find some other places use similar style, such as Dataset.toDF, Dataset.drop, Does it mean they also have java compatibility issue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Do you say about this bug ? https://issues.apache.org/jira/browse/SPARK-5904
But it is only related to abstract method.
Now I add java testsuite to make sure it works fine.
|
Test build #81554 has finished for PR 19156 at commit
|
|
ping @yanboliang Any other comments ? |
|
@WeichenXu123 Sorry for late response, really busy in these days. I will take a look in a few days. Thanks for your patience. |
|
@cloud-fan Can you help review the part of code which related to SQL interface ? |
|
I'd like to make a pass soon. |
|
the SQL part LGTM |
|
Test build #83574 has finished for PR 19156 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of them works, but other similar aggregate function also use Any. Will it cause some issues ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you let me know why did you make this change? I think we should use long array rather than double array to store numNonZeros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer also return Vector for numNonZeros. So I prefer keep consistent with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old mllib.stat.MultivariateOnlineSummarizer, the internal variable is type of Array[Long], but the return type is Vector. Do you know the impact of using Vector internally? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally still use Array[Long] to do the computation. Only when returning result, convert it to vector.
525692e to
2e4b232
Compare
|
Test build #83639 has finished for PR 19156 at commit
|
|
Test build #83640 has finished for PR 19156 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about binary compatibility? e.g. spark jobs built with old spark versions, can they run on new Spark without re-compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was added after 2.2, does it matters ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah then it doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you remove the test against ground true value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: weight can be abbreviated to w.
2e4b232 to
5647a49
Compare
|
Test build #84845 has finished for PR 19156 at commit
|
|
Test build #84954 has finished for PR 19156 at commit
|
|
Jenkins retest this please. |
|
Test build #84958 has finished for PR 19156 at commit
|
4d6617e to
f34da1f
Compare
|
Test build #84960 has finished for PR 19156 at commit
|
|
Jenkins retest this please. |
|
Test build #85109 has finished for PR 19156 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except the last comment. Thanks.
| } | ||
| } | ||
|
|
||
| test("basic error handling") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you remove these two tests?
|
Test build #85229 has finished for PR 19156 at commit
|
|
Merged into master, thanks. |
What changes were proposed in this pull request?
Make several improvements in dataframe vectorized summarizer.
Make the summarizer return
Vectortype for all metrics (except "count").It will return "WrappedArray" type before which won't be very convenient.
Make
MetricsAggregateinheritImplicitCastInputTypestrait. So it can check and implicitly cast input values.Add "weight" parameter for all single metric method.
Update doc and improve the example code in doc.
Simplified test cases.
How was this patch tested?
Test added and simplified.