Skip to content

Conversation

oliverpierson
Copy link

What changes were proposed in this pull request?

Fixes a bug in QuantileDiscretizer that results in the wrong number of bins for datasets larger than 10K rows and adds regression test. This PR corrects an issue with PR #11319.

How was this patch tested?

Manual tests and test-only QuantileDiscretizerSuite

@oliverpierson oliverpierson changed the title [SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) [SPARK-13444] [MLlib] [WIP] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) Feb 26, 2016
@srowen
Copy link
Member

srowen commented Feb 26, 2016

It looks like this only contains the test?

@oliverpierson
Copy link
Author

Yeah, was about to add the other stuff and got pulled away on something
else. I'll push the other code this evening when I back to my computer.

On Friday, February 26, 2016, Sean Owen [email protected] wrote:

It looks like this only contains the test?


Reply to this email directly or view it on GitHub
#11402 (comment).

fixed bad split bug in QuantileDiscretizer
@oliverpierson
Copy link
Author

Ok, this one should be good to go.

@oliverpierson oliverpierson changed the title [SPARK-13444] [MLlib] [WIP] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) [SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) Feb 27, 2016
@srowen
Copy link
Member

srowen commented Feb 27, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Feb 27, 2016

Test build #52121 has finished for PR 11402 at commit db2594d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 27, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 27, 2016

Test build #52123 has finished for PR 11402 at commit db2594d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2016

Test build #52124 has finished for PR 11402 at commit db2594d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@oliverpierson
Copy link
Author

The new test is failing for Jenkins, however it's passing on my machine. I'll try to take a look at it this weekend but may be Monday before I can get to it.

@oliverpierson
Copy link
Author

After running the test on my machine again, I discovered that it randomly passes/fails. It appears that the problem is in findSplitsCandidate. This method will give n+1 buckets under certain circumstances when only n buckets are desired. The reason that the new test randomly passes/fails is because it involves random sampling of the data in order to estimate the quantiles.

However, the method can still fail deterministically. For example, consider the following:

val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
val discretizer = new QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
discretizer.fit(df).getSplits

This gives the following splits:

Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)

which corresponds to six buckets.

There are a few ways to fix findSplitCandidates. The most straightforward (albeit, less elegant) way is to track the number of splits discovered so far while iterating the while loop and terminate the loop when (index < valueCounts.length && splitsSoFar < numSplits). I believe this is probably the best option for the bug in branch-1.6. If there's no objections I can put a commit together.

As for the master branch, I'm considering rewriting the findSplitCandidates method using the usual method for finding quantiles. It's done this way in Numpy/Scipy and I believe it would be at least as fast as the current routine. I'm curious if anybody has any objections or concerns when it comes to rewrite?

@mengxr
Copy link
Contributor

mengxr commented Mar 1, 2016

@oliverpierson I haven't seen this test fails in the master build. If I'm correct, we control the random seed in the master branch resulting deterministic behavior. But we don't have it in branch-1.6. If that is the case, we can either backport the commit that implements setSeed (574571c) or backport it but hide the public APIs and fix the seed on branch-1.6 (so we don't expose new APIs). @srowen Which one do you prefer?

@oliverpierson
Copy link
Author

@mengxr You're correct that backporting setSeed will make the new test pass, however we'd still be left with issue above, namely that QuantileDiscretizer can return the wrong number of buckets. I think that the wrong bucket number issue is minor since (I believe) it only affects the min/max values in your data. In that case, it may be fine to leave the findSplitCandidates as is and just backport setSeed.

@mengxr
Copy link
Contributor

mengxr commented Mar 1, 2016

@oliverpierson That would be a separate issue. Could you create a JIRA so we can move our discussion there? cc: @yinxusen

For this PR, let's backport the setSeed commit to branch-1.6 as a bug fix.

@oliverpierson
Copy link
Author

Sounds good. You can find the new JIRA here SPARK-13600.

@srowen
Copy link
Member

srowen commented Mar 2, 2016

I think it's simplest to back-port the setSeed commit, and then re-add the call to this PR. OK by everyone?

@oliverpierson
Copy link
Author

Sounds good. Curious how we should go about re-adding the call to setSeed? Should I wait until after it's backported and then rebase my SPARK-13444 branch of the last commit to branch-1.6?

@srowen
Copy link
Member

srowen commented Mar 3, 2016

OK @mengxr I'm going to back-port 574571c for 1.6.2. Then we basically try the original approach again here.

@srowen
Copy link
Member

srowen commented Mar 4, 2016

OK, it's back-ported now. I think this PR can again use setSeed. If we can make it pass then we deal with SPARK-13600 separately

@mengxr
Copy link
Contributor

mengxr commented Mar 4, 2016

Let me try to backport the original commit from master to branch-1.6 then.

@mengxr
Copy link
Contributor

mengxr commented Mar 4, 2016

Merged #11319 back to branch-1.6. Ran the test couple times on my local machine and all passed. @oliverpierson Could you please close this PR? Thanks!

@oliverpierson
Copy link
Author

No problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants