-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative result #19029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| realVariance(i) = (currM2n(i) + deltaMean(i) * deltaMean(i) * weightSum(i) * | ||
| (totalWeightSum - weightSum(i)) / totalWeightSum) / denominator | ||
| // Because of numerical error, it is possible to get negative real variance | ||
| if (realVariance(i) < 0.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use math.max(0.0 ...) in the line above? no need to assign it twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computation of variance may be touch this numerical error, it seems WeightedLeastSquares also use the same method to compute variance , does it will have similar issue? @WeichenXu123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. WeightedLeastSquares use another way to compute variance Var(X) = E(X^2) - E(X)^2. But it seems also possible to have this problem.
|
Test build #81032 has finished for PR 19029 at commit
|
|
Test build #81118 has finished for PR 19029 at commit
|
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comments, otherwise, LGTM.
| * Weighted population standard deviation of labels. | ||
| */ | ||
| def bStd: Double = math.sqrt(bbSum / wSum - bBar * bBar) | ||
| def bStd: Double = math.sqrt(math.max(bbSum / wSum - bBar * bBar, 0.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment here and bellow to clarify that we are preventing from negative value caused by numerical error.
|
Test build #81127 has finished for PR 19029 at commit
|
|
Test build #81129 has finished for PR 19029 at commit
|
| * We prevent variance from negative value caused by numerical error. | ||
| */ | ||
| def bStd: Double = math.sqrt(bbSum / wSum - bBar * bBar) | ||
| def bStd: Double = math.sqrt(math.max(bbSum / wSum - bBar * bBar, 0.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple more places where variance is computed in this file -- I think they need this too?
|
Test build #81167 has finished for PR 19029 at commit
|
|
|
||
| /** | ||
| * Weighted population standard deviation of labels. | ||
| * We prevent variance from negative value caused by numerical error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so against this, but this is really an implementation detail and not relevant to the caller. It's a value that is by definition nonnegative.
|
Test build #81171 has finished for PR 19029 at commit
|
|
Merged to master/2.2 |
…ance generate negative result
Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
**This is a serious bug because many algos in MLLib**
**use stddev computed from** `sqrt(variance)`
**it will generate NaN and crash the whole algorithm.**
we can reproduce this bug use the following code:
```
val summarizer1 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.7)
val summarizer2 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.4)
val summarizer3 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.5)
val summarizer4 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.4)
val summarizer = summarizer1
.merge(summarizer2)
.merge(summarizer3)
.merge(summarizer4)
println(summarizer.variance(0))
```
This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`
test cases added.
Author: WeichenXu <[email protected]>
Closes #19029 from WeichenXu123/fix_summarizer_var_bug.
(cherry picked from commit 0456b40)
Signed-off-by: Sean Owen <[email protected]>
…ance generate negative result
Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
**This is a serious bug because many algos in MLLib**
**use stddev computed from** `sqrt(variance)`
**it will generate NaN and crash the whole algorithm.**
we can reproduce this bug use the following code:
```
val summarizer1 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.7)
val summarizer2 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.4)
val summarizer3 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.5)
val summarizer4 = (new MultivariateOnlineSummarizer)
.add(Vectors.dense(3.0), 0.4)
val summarizer = summarizer1
.merge(summarizer2)
.merge(summarizer3)
.merge(summarizer4)
println(summarizer.variance(0))
```
This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`
test cases added.
Author: WeichenXu <[email protected]>
Closes apache#19029 from WeichenXu123/fix_summarizer_var_bug.
(cherry picked from commit 0456b40)
Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
This is a serious bug because many algos in MLLib
use stddev computed from
sqrt(variance)it will generate NaN and crash the whole algorithm.
we can reproduce this bug use the following code:
This PR fix the bugs in
mllib.stat.MultivariateOnlineSummarizer.varianceandml.stat.SummarizerBuffer.variance, and several places inWeightedLeastSquaresHow was this patch tested?
test cases added.