Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Aug 23, 2017

What changes were proposed in this pull request?

Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.

This is a serious bug because many algos in MLLib
use stddev computed from sqrt(variance)
it will generate NaN and crash the whole algorithm.

we can reproduce this bug use the following code:

    val summarizer1 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.7)
    val summarizer2 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)
    val summarizer3 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.5)
    val summarizer4 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)

    val summarizer = summarizer1
      .merge(summarizer2)
      .merge(summarizer3)
      .merge(summarizer4)

    println(summarizer.variance(0))

This PR fix the bugs in mllib.stat.MultivariateOnlineSummarizer.variance and ml.stat.SummarizerBuffer.variance, and several places in WeightedLeastSquares

How was this patch tested?

test cases added.

realVariance(i) = (currM2n(i) + deltaMean(i) * deltaMean(i) * weightSum(i) *
(totalWeightSum - weightSum(i)) / totalWeightSum) / denominator
// Because of numerical error, it is possible to get negative real variance
if (realVariance(i) < 0.0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use math.max(0.0 ...) in the line above? no need to assign it twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computation of variance may be touch this numerical error, it seems WeightedLeastSquares also use the same method to compute variance , does it will have similar issue? @WeichenXu123

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. WeightedLeastSquares use another way to compute variance Var(X) = E(X^2) - E(X)^2. But it seems also possible to have this problem.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81032 has finished for PR 19029 at commit 9c92730.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81118 has finished for PR 19029 at commit c24292c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comments, otherwise, LGTM.

* Weighted population standard deviation of labels.
*/
def bStd: Double = math.sqrt(bbSum / wSum - bBar * bBar)
def bStd: Double = math.sqrt(math.max(bbSum / wSum - bBar * bBar, 0.0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment here and bellow to clarify that we are preventing from negative value caused by numerical error.

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81127 has finished for PR 19029 at commit 9a47579.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81129 has finished for PR 19029 at commit 56c0d41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* We prevent variance from negative value caused by numerical error.
*/
def bStd: Double = math.sqrt(bbSum / wSum - bBar * bBar)
def bStd: Double = math.sqrt(math.max(bbSum / wSum - bBar * bBar, 0.0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple more places where variance is computed in this file -- I think they need this too?

@SparkQA
Copy link

SparkQA commented Aug 27, 2017

Test build #81167 has finished for PR 19029 at commit 21e7ff7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Weighted population standard deviation of labels.
* We prevent variance from negative value caused by numerical error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so against this, but this is really an implementation detail and not relevant to the caller. It's a value that is by definition nonnegative.

@SparkQA
Copy link

SparkQA commented Aug 28, 2017

Test build #81171 has finished for PR 19029 at commit c40eba3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 28, 2017

Merged to master/2.2

@asfgit asfgit closed this in 0456b40 Aug 28, 2017
asfgit pushed a commit that referenced this pull request Aug 28, 2017
…ance generate negative result

Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.

**This is a serious bug because many algos in MLLib**
**use stddev computed from** `sqrt(variance)`
**it will generate NaN and crash the whole algorithm.**

we can reproduce this bug use the following code:
```
    val summarizer1 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.7)
    val summarizer2 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)
    val summarizer3 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.5)
    val summarizer4 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)

    val summarizer = summarizer1
      .merge(summarizer2)
      .merge(summarizer3)
      .merge(summarizer4)

    println(summarizer.variance(0))
```
This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`

test cases added.

Author: WeichenXu <[email protected]>

Closes #19029 from WeichenXu123/fix_summarizer_var_bug.

(cherry picked from commit 0456b40)
Signed-off-by: Sean Owen <[email protected]>
@WeichenXu123 WeichenXu123 deleted the fix_summarizer_var_bug branch August 28, 2017 10:28
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…ance generate negative result

Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.

**This is a serious bug because many algos in MLLib**
**use stddev computed from** `sqrt(variance)`
**it will generate NaN and crash the whole algorithm.**

we can reproduce this bug use the following code:
```
    val summarizer1 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.7)
    val summarizer2 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)
    val summarizer3 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.5)
    val summarizer4 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)

    val summarizer = summarizer1
      .merge(summarizer2)
      .merge(summarizer3)
      .merge(summarizer4)

    println(summarizer.variance(0))
```
This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`

test cases added.

Author: WeichenXu <[email protected]>

Closes apache#19029 from WeichenXu123/fix_summarizer_var_bug.

(cherry picked from commit 0456b40)
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants