[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames #11319

oliverpierson · 2016-02-23T04:05:02Z

What changes were proposed in this pull request?

Change line 113 of QuantileDiscretizer.scala to

val requiredSamples = math.max(numBins * numBins, 10000.0)

so that requiredSamples is a Double. This will fix the division in line 114 which currently results in zero if requiredSamples < dataset.count

How was the this patch tested?

Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.

srowen · 2016-02-23T10:39:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

    require(totalSamples > 0,
      "QuantileDiscretizer requires non-empty input dataset but was given an empty input.")
-    val requiredSamples = math.max(numBins * numBins, 10000)
+    val requiredSamples = math.max(numBins * numBins, 10000.0)


I'd very slightly prefer to cast this to a double in the following line. This makes requiredSamples a double itself, which is slightly wrong.

srowen · 2016-02-23T10:39:53Z

Jenkins, test this please

SparkQA · 2016-02-23T11:39:54Z

Test build #51764 has finished for PR 11319 at commit 635fb4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

oliverpierson · 2016-02-23T14:28:08Z

I think the explicit cast is better as well; it makes it clear that requiredSamples is a Double. I'm a bit new to this whole PR process so I have to ask: can I just push the changes to the branch on my fork and they will automatically get pulled into this PR?

hvanhovell · 2016-02-23T14:40:30Z

Could you add a regression test for this?

srowen · 2016-02-23T14:56:03Z

The point is that requiredSamples isn't a double, but needs to be construed as one to get a calculation right. Yes like any github PR you just push more commits to your branch.

oliverpierson · 2016-02-23T14:58:36Z

Yeah, I think I can put a test together. Also, the value 10000 in line 113:

val requiredSamples = math.max(numBins * numBins, 10000)

is hard coded. Should this be changed to a class attribute? I ask because I test would required a dataset with dataset.count > 10000.

srowen · 2016-02-23T15:00:29Z

I don't think making a 10K element test set is too big. I wouldn't expose it as a configurable setter just for this, no.

oliverpierson · 2016-02-23T21:25:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala

+   * Minimum number of samples required for finding splits, regardless of number of bins.  If
+   * the dataset has less rows than this value, the entire dataset column will be used.
+   */
+  val minSamplesRequired: Int = 10000


The test will require creating a dataset at least as large as minSamplesRequired, so making its value excessively large could slow down testing.

less rows -> fewer rows
entire dataset column -> entire dataset?

10000 just isn't that big. A dummy data set in a test would be, what, a megabyte in memory? Am I missing a bigger problem there?

I don't think you're missing anything. It's my first time contributing and I just want to be explicit for reviewers of the patch. I agree that 10K isn't that big, especially out "in the wild". However, I wasn't sure if there were standards for time/memory consumption for tests so I added the line note so that reviewers with more experience than me would be aware in case there are established/de facto testing standards.

I'll make the "->" changes you've indicated and push a new commit sometime today. Thanks.

Also, my original reason for asking about removing the hard coded value of 10K was because that value was partly the cause of the bug and so a regression test would need to know the value.

I could have hard coded 10k in to my test. However if a developer increased its value later, say to 100K, without increasing the hard coded test value as well, they would render the test useless since it would always pass.

Yeah we have muuuch bigger tests that turn up whole small clusters. Making a little data set is fine. You can expose the 10000 figure as a private[spark] val in the object so you can reference it from test code.

srowen · 2016-02-25T12:13:34Z

Jenkins, test this please

SparkQA · 2016-02-25T12:59:32Z

Test build #51967 has finished for PR 11319 at commit abea876.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

oliverpierson · 2016-02-25T13:05:47Z

@srowen Let me know if you'd like any further changes. Also, thanks for the review and your help.

…DataFrames Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <[email protected]> Author: Oliver Pierson <[email protected]> Closes #11319 from oliverpierson/SPARK-13444. (cherry picked from commit 6f8e835) Signed-off-by: Sean Owen <[email protected]>

srowen · 2016-02-25T13:27:44Z

Merged to master/1.6

JoshRosen · 2016-02-25T19:40:27Z

This broke compilation in branch 1.6:

[error] /home/jenkins/workspace/spark-branch-1.6-compile-maven-pre-yarn-1.2.1/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala:85: value setSeed is not a member of org.apache.spark.ml.feature.QuantileDiscretizer
[error] possible cause: maybe a semicolon is missing before `value setSeed'?
[error]       .setSeed(1)
[error]        ^
[error] one error found
[error] Compile failed at Feb 25, 2016 8:49:20 AM [18.290s]

srowen · 2016-02-25T19:43:46Z

Ah... I am 0 for 2 today. I am away from keyboard so feel free to revert it if it is a Blocker. This really doesn't have to go in this branch.

oliverpierson · 2016-02-25T19:48:36Z

Hmmm.... any idea why it's failing? This line is no different than line 115 of the same value, except for the line breaks.

oliverpierson · 2016-02-25T19:55:00Z

mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala

+      .setInputCol("input")
+      .setOutputCol("result")
+      .setNumBuckets(numBuckets)
+      .setSeed(1)


This is the offending line

mengxr · 2016-02-25T20:29:22Z

@oliverpierson I reverted the change in branch-1.6 so we don't block other builds. Could you remove the setSeed line and submit another PR to branch-1.6? Thanks!

oliverpierson · 2016-02-25T21:06:23Z

@mengxr No problem. I'll do it later today when I get to my laptop.

I thought since random sampling of data was used to find splits, then setSeed would be good for ensuring consistent tests. However, it's not necessary so happy to remove it.

oliverpierson · 2016-02-25T23:33:25Z

@mengxr Sorry, was in a rush and missed the last part of your comment. I've opened a new PR, but against apache:master. However, you'd like a PR against apache:branch-1.6? If that's correct, just let me know and I'll close the other PR and open a new PR against 1.6. Sorry for the mixup.

mengxr · 2016-02-26T05:20:25Z

I didn't revert the changes in master. I have to revert the changes in 1.6 because setSeed is not available in branch-1.6. Usually we only backport bug fixes instead of features. So you should open a PR against 1.6.

…DataFrames ## What changes were proposed in this pull request? Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` ## How was the this patch tested? Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <[email protected]> Author: Oliver Pierson <[email protected]> Closes #11319 from oliverpierson/SPARK-13444.

fixed splits bug in QuantileDiscretizer

635fb4e

srowen reviewed Feb 23, 2016
View reviewed changes

Oliver Pierson added 3 commits February 23, 2016 12:58

explicitly cast requiredSamples to Double

d5dbaa2

added minSamplesRequired parameter to QuantileDiscretizer

4892fb7

test for QuantileDiscretizer on large datasets

3b55b60

oliverpierson reviewed Feb 23, 2016
View reviewed changes

Oliver Pierson added 2 commits February 24, 2016 10:14

private-tize minSamplesRequired; updated comments

c0052e4

change QuantileDiscretizer test name to better reflect purpose

abea876

asfgit closed this in 6f8e835 Feb 25, 2016

oliverpierson deleted the SPARK-13444 branch February 25, 2016 13:35

oliverpierson restored the SPARK-13444 branch February 25, 2016 19:48

oliverpierson reviewed Feb 25, 2016
View reviewed changes

oliverpierson mentioned this pull request Feb 25, 2016

[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames #11377

Closed

oliverpierson mentioned this pull request Feb 26, 2016

[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames (1.6 patch) #11402

Closed

[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames #11319

[SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames #11319

Uh oh!

Conversation

oliverpierson commented Feb 23, 2016

What changes were proposed in this pull request?

How was the this patch tested?

Uh oh!

srowen Feb 23, 2016

Choose a reason for hiding this comment

Uh oh!

srowen commented Feb 23, 2016

Uh oh!

SparkQA commented Feb 23, 2016

Uh oh!

oliverpierson commented Feb 23, 2016

Uh oh!

hvanhovell commented Feb 23, 2016

Uh oh!

srowen commented Feb 23, 2016

Uh oh!

oliverpierson commented Feb 23, 2016

Uh oh!

srowen commented Feb 23, 2016

Uh oh!

oliverpierson Feb 23, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

oliverpierson Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

oliverpierson Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Feb 24, 2016

Choose a reason for hiding this comment

Uh oh!

srowen commented Feb 25, 2016

Uh oh!

SparkQA commented Feb 25, 2016

Uh oh!

oliverpierson commented Feb 25, 2016

Uh oh!

srowen commented Feb 25, 2016

Uh oh!

JoshRosen commented Feb 25, 2016

Uh oh!

srowen commented Feb 25, 2016

Uh oh!

oliverpierson commented Feb 25, 2016

Uh oh!

oliverpierson Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Feb 25, 2016

Uh oh!

oliverpierson commented Feb 25, 2016

Uh oh!

oliverpierson commented Feb 25, 2016

Uh oh!

mengxr commented Feb 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants