[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) #11402

oliverpierson · 2016-02-26T21:17:35Z

What changes were proposed in this pull request?

Fixes a bug in QuantileDiscretizer that results in the wrong number of bins for datasets larger than 10K rows and adds regression test. This PR corrects an issue with PR #11319.

How was this patch tested?

Manual tests and test-only QuantileDiscretizerSuite

srowen · 2016-02-26T21:38:39Z

It looks like this only contains the test?

oliverpierson · 2016-02-26T22:29:09Z

Yeah, was about to add the other stuff and got pulled away on something
else. I'll push the other code this evening when I back to my computer.

On Friday, February 26, 2016, Sean Owen [email protected] wrote:

It looks like this only contains the test?

—
Reply to this email directly or view it on GitHub
#11402 (comment).

fixed bad split bug in QuantileDiscretizer

oliverpierson · 2016-02-27T04:16:02Z

Ok, this one should be good to go.

srowen · 2016-02-27T13:56:18Z

Jenkins test this please

SparkQA · 2016-02-27T14:05:15Z

Test build #52121 has finished for PR 11402 at commit db2594d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-02-27T14:07:04Z

Jenkins retest this please

SparkQA · 2016-02-27T14:43:01Z

Test build #52123 has finished for PR 11402 at commit db2594d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-27T14:45:08Z

Test build #52124 has finished for PR 11402 at commit db2594d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

oliverpierson · 2016-02-27T14:52:00Z

The new test is failing for Jenkins, however it's passing on my machine. I'll try to take a look at it this weekend but may be Monday before I can get to it.

oliverpierson · 2016-02-29T20:30:24Z

After running the test on my machine again, I discovered that it randomly passes/fails. It appears that the problem is in findSplitsCandidate. This method will give n+1 buckets under certain circumstances when only n buckets are desired. The reason that the new test randomly passes/fails is because it involves random sampling of the data in order to estimate the quantiles.

However, the method can still fail deterministically. For example, consider the following:

val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
val discretizer = new QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
discretizer.fit(df).getSplits

This gives the following splits:

Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)

which corresponds to six buckets.

There are a few ways to fix findSplitCandidates. The most straightforward (albeit, less elegant) way is to track the number of splits discovered so far while iterating the while loop and terminate the loop when (index < valueCounts.length && splitsSoFar < numSplits). I believe this is probably the best option for the bug in branch-1.6. If there's no objections I can put a commit together.

As for the master branch, I'm considering rewriting the findSplitCandidates method using the usual method for finding quantiles. It's done this way in Numpy/Scipy and I believe it would be at least as fast as the current routine. I'm curious if anybody has any objections or concerns when it comes to rewrite?

mengxr · 2016-03-01T06:35:56Z

@oliverpierson I haven't seen this test fails in the master build. If I'm correct, we control the random seed in the master branch resulting deterministic behavior. But we don't have it in branch-1.6. If that is the case, we can either backport the commit that implements setSeed (574571c) or backport it but hide the public APIs and fix the seed on branch-1.6 (so we don't expose new APIs). @srowen Which one do you prefer?

oliverpierson · 2016-03-01T14:25:12Z

@mengxr You're correct that backporting setSeed will make the new test pass, however we'd still be left with issue above, namely that QuantileDiscretizer can return the wrong number of buckets. I think that the wrong bucket number issue is minor since (I believe) it only affects the min/max values in your data. In that case, it may be fine to leave the findSplitCandidates as is and just backport setSeed.

mengxr · 2016-03-01T17:35:04Z

@oliverpierson That would be a separate issue. Could you create a JIRA so we can move our discussion there? cc: @yinxusen

For this PR, let's backport the setSeed commit to branch-1.6 as a bug fix.

oliverpierson · 2016-03-01T18:48:07Z

Sounds good. You can find the new JIRA here SPARK-13600.

srowen · 2016-03-02T16:03:14Z

I think it's simplest to back-port the setSeed commit, and then re-add the call to this PR. OK by everyone?

oliverpierson · 2016-03-02T21:58:26Z

Sounds good. Curious how we should go about re-adding the call to setSeed? Should I wait until after it's backported and then rebase my SPARK-13444 branch of the last commit to branch-1.6?

srowen · 2016-03-03T22:47:14Z

OK @mengxr I'm going to back-port 574571c for 1.6.2. Then we basically try the original approach again here.

srowen · 2016-03-04T14:17:43Z

OK, it's back-ported now. I think this PR can again use setSeed. If we can make it pass then we deal with SPARK-13600 separately

mengxr · 2016-03-04T16:33:49Z

Let me try to backport the original commit from master to branch-1.6 then.

mengxr · 2016-03-04T20:53:54Z

Merged #11319 back to branch-1.6. Ran the test couple times on my local machine and all passed. @oliverpierson Could you please close this PR? Thanks!

oliverpierson · 2016-03-04T20:57:08Z

No problem. Thanks!

oliverpierson changed the title ~~[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch)~~ [SPARK-13444] [MLlib] [WIP] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) Feb 26, 2016

fixed bad split bug in QuantileDiscretizer

db2594d

fixed bad split bug in QuantileDiscretizer

oliverpierson force-pushed the SPARK-13444 branch from 9a34d9a to db2594d Compare February 27, 2016 04:11

oliverpierson changed the title ~~[SPARK-13444] [MLlib] [WIP] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch)~~ [SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) Feb 27, 2016

oliverpierson closed this Mar 4, 2016

[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) #11402

[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) #11402

Uh oh!

Conversation

oliverpierson commented Feb 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Feb 26, 2016

Uh oh!

oliverpierson commented Feb 26, 2016

Uh oh!

oliverpierson commented Feb 27, 2016

Uh oh!

srowen commented Feb 27, 2016

Uh oh!

SparkQA commented Feb 27, 2016

Uh oh!

srowen commented Feb 27, 2016

Uh oh!

SparkQA commented Feb 27, 2016

Uh oh!

SparkQA commented Feb 27, 2016

Uh oh!

oliverpierson commented Feb 27, 2016

Uh oh!

oliverpierson commented Feb 29, 2016

Uh oh!

mengxr commented Mar 1, 2016

Uh oh!

oliverpierson commented Mar 1, 2016

Uh oh!

mengxr commented Mar 1, 2016

Uh oh!

oliverpierson commented Mar 1, 2016

Uh oh!

srowen commented Mar 2, 2016

Uh oh!

oliverpierson commented Mar 2, 2016

Uh oh!

srowen commented Mar 3, 2016

Uh oh!

srowen commented Mar 4, 2016

Uh oh!

mengxr commented Mar 4, 2016

Uh oh!

mengxr commented Mar 4, 2016

Uh oh!

oliverpierson commented Mar 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants