-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) #11402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like this only contains the test? |
Yeah, was about to add the other stuff and got pulled away on something On Friday, February 26, 2016, Sean Owen [email protected] wrote:
|
fixed bad split bug in QuantileDiscretizer
9a34d9a
to
db2594d
Compare
Ok, this one should be good to go. |
Jenkins test this please |
Test build #52121 has finished for PR 11402 at commit
|
Jenkins retest this please |
Test build #52123 has finished for PR 11402 at commit
|
Test build #52124 has finished for PR 11402 at commit
|
The new test is failing for Jenkins, however it's passing on my machine. I'll try to take a look at it this weekend but may be Monday before I can get to it. |
After running the test on my machine again, I discovered that it randomly passes/fails. It appears that the problem is in However, the method can still fail deterministically. For example, consider the following:
This gives the following splits:
which corresponds to six buckets. There are a few ways to fix As for the |
@oliverpierson I haven't seen this test fails in the master build. If I'm correct, we control the random seed in the master branch resulting deterministic behavior. But we don't have it in branch-1.6. If that is the case, we can either backport the commit that implements |
@mengxr You're correct that backporting |
@oliverpierson That would be a separate issue. Could you create a JIRA so we can move our discussion there? cc: @yinxusen For this PR, let's backport the |
Sounds good. You can find the new JIRA here SPARK-13600. |
I think it's simplest to back-port the |
Sounds good. Curious how we should go about re-adding the call to |
OK, it's back-ported now. I think this PR can again use |
Let me try to backport the original commit from master to branch-1.6 then. |
Merged #11319 back to branch-1.6. Ran the test couple times on my local machine and all passed. @oliverpierson Could you please close this PR? Thanks! |
No problem. Thanks! |
What changes were proposed in this pull request?
Fixes a bug in QuantileDiscretizer that results in the wrong number of bins for datasets larger than 10K rows and adds regression test. This PR corrects an issue with PR #11319.
How was this patch tested?
Manual tests and test-only QuantileDiscretizerSuite