-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames #11319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
require(totalSamples > 0, | ||
"QuantileDiscretizer requires non-empty input dataset but was given an empty input.") | ||
val requiredSamples = math.max(numBins * numBins, 10000) | ||
val requiredSamples = math.max(numBins * numBins, 10000.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd very slightly prefer to cast this to a double in the following line. This makes requiredSamples
a double itself, which is slightly wrong.
Jenkins, test this please |
Test build #51764 has finished for PR 11319 at commit
|
I think the explicit cast is better as well; it makes it clear that |
Could you add a regression test for this? |
The point is that |
Yeah, I think I can put a test together. Also, the value
is hard coded. Should this be changed to a class attribute? I ask because I test would required a dataset with dataset.count > 10000. |
I don't think making a 10K element test set is too big. I wouldn't expose it as a configurable setter just for this, no. |
* Minimum number of samples required for finding splits, regardless of number of bins. If | ||
* the dataset has less rows than this value, the entire dataset column will be used. | ||
*/ | ||
val minSamplesRequired: Int = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test will require creating a dataset at least as large as minSamplesRequired
, so making its value excessively large could slow down testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less rows -> fewer rows
entire dataset column -> entire dataset?
10000 just isn't that big. A dummy data set in a test would be, what, a megabyte in memory? Am I missing a bigger problem there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you're missing anything. It's my first time contributing and I just want to be explicit for reviewers of the patch. I agree that 10K isn't that big, especially out "in the wild". However, I wasn't sure if there were standards for time/memory consumption for tests so I added the line note so that reviewers with more experience than me would be aware in case there are established/de facto testing standards.
I'll make the "->" changes you've indicated and push a new commit sometime today. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, my original reason for asking about removing the hard coded value of 10K was because that value was partly the cause of the bug and so a regression test would need to know the value.
I could have hard coded 10k in to my test. However if a developer increased its value later, say to 100K, without increasing the hard coded test value as well, they would render the test useless since it would always pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we have muuuch bigger tests that turn up whole small clusters. Making a little data set is fine. You can expose the 10000 figure as a private[spark] val
in the object
so you can reference it from test code.
Jenkins, test this please |
Test build #51967 has finished for PR 11319 at commit
|
@srowen Let me know if you'd like any further changes. Also, thanks for the review and your help. |
…DataFrames Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <[email protected]> Author: Oliver Pierson <[email protected]> Closes #11319 from oliverpierson/SPARK-13444. (cherry picked from commit 6f8e835) Signed-off-by: Sean Owen <[email protected]>
Merged to master/1.6 |
This broke compilation in branch 1.6:
|
Ah... I am 0 for 2 today. I am away from keyboard so feel free to revert it if it is a Blocker. This really doesn't have to go in this branch. |
Hmmm.... any idea why it's failing? This line is no different than line 115 of the same value, except for the line breaks. |
.setInputCol("input") | ||
.setOutputCol("result") | ||
.setNumBuckets(numBuckets) | ||
.setSeed(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the offending line
@oliverpierson I reverted the change in branch-1.6 so we don't block other builds. Could you remove the |
@mengxr No problem. I'll do it later today when I get to my laptop. I thought since random sampling of data was used to find splits, then |
@mengxr Sorry, was in a rush and missed the last part of your comment. I've opened a new PR, but against |
I didn't revert the changes in master. I have to revert the changes in 1.6 because |
…DataFrames ## What changes were proposed in this pull request? Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` ## How was the this patch tested? Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <[email protected]> Author: Oliver Pierson <[email protected]> Closes #11319 from oliverpierson/SPARK-13444.
What changes were proposed in this pull request?
Change line 113 of QuantileDiscretizer.scala to
val requiredSamples = math.max(numBins * numBins, 10000.0)
so that
requiredSamples
is aDouble
. This will fix the division in line 114 which currently results in zero ifrequiredSamples < dataset.count
How was the this patch tested?
Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.