Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Apr 20, 2017

What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the coefficientMatrix by the global coeffs means.

In fact the coefficientMatrix should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose { w0, w1, .. w(K-1) } make up the coefficientMatrix,
then { w0 + c, w1 + c, ... w(K - 1) + c} will also be the equivalent solution.
c is an arbitrary vector of numFeatures dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the coefficientMatrix on each feature dimension separately.

We can also confirm this through R library glmnet, that MLOR in glmnet always generate coefficients result that the sum of each dimension is all zero, when reg == 0.

How was this patch tested?

Tests added.

@WeichenXu123 WeichenXu123 changed the title fix MLOR coeffs centering when reg == 0 [ML] fix MLOR coeffs centering when reg == 0 Apr 20, 2017
@SparkQA
Copy link

SparkQA commented Apr 20, 2017

Test build #75990 has finished for PR 17706 at commit 1d0fb87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Apr 20, 2017

@WeichenXu123 Thanks for the pr. Is there a JIRA? Why is testing "not applicable"? Seems you are correct on this, but could you please provide a good reference?

@WeichenXu123 WeichenXu123 changed the title [ML] fix MLOR coeffs centering when reg == 0 [SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 Apr 21, 2017
@SparkQA
Copy link

SparkQA commented Apr 21, 2017

Test build #76018 has started for PR 17706 at commit a694d13.

@WeichenXu123
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2017

Test build #76021 has finished for PR 17706 at commit a694d13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM, thanks for catching this. cc @dbtsai

@dbtsai
Copy link
Member

dbtsai commented Apr 21, 2017

LGTM. Thanks!

asfgit pushed a commit that referenced this pull request Apr 21, 2017
## What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.

In fact the `coefficientMatrix` should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
`c` is an arbitrary vector of `numFeatures` dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the `coefficientMatrix` on each feature dimension separately.

**We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**

## How was this patch tested?

Tests added.

Author: WeichenXu <[email protected]>

Closes #17706 from WeichenXu123/mlor_center.

(cherry picked from commit eb00378)
Signed-off-by: DB Tsai <[email protected]>
@asfgit asfgit closed this in eb00378 Apr 21, 2017
-0.3180040, 0.9679074, -0.2252219, -0.4319914,
0.2452411, -0.6046524, 0.1050710, 0.1180180), isTransposed = true)

model1.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we tested that the coefficients have zero mean using:

assert(model1.coefficientMatrix.toArray.sum ~== 0.0 absTol eps)

We should replace every instance of that test with this new one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks right to me. Let's add a test that failing the original implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LBFGS seems to automatically find a solution where the coefficients for each feature index sum to zero, so I'm not sure of a way to find a case where this does not happen, TBH.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. In theory, LBFGS can not see this since all LBFGS knows is the objective function. I think if we feed it with different solutions by justing a constant vector, LBFGS will stop as well. I think maybe it's related to how we setup the initial condition leading to the solution we get.

@dbtsai
Copy link
Member

dbtsai commented Apr 21, 2017

Merged into master and 2.2 branch. Thanks.

@WeichenXu123 WeichenXu123 deleted the mlor_center branch April 22, 2017 01:44
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.

In fact the `coefficientMatrix` should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
`c` is an arbitrary vector of `numFeatures` dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the `coefficientMatrix` on each feature dimension separately.

**We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**

## How was this patch tested?

Tests added.

Author: WeichenXu <[email protected]>

Closes apache#17706 from WeichenXu123/mlor_center.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants