Skip to content

Conversation

oliverpierson
Copy link

What changes were proposed in this pull request?

Change line 113 of QuantileDiscretizer.scala to

val requiredSamples = math.max(numBins * numBins, 10000.0)

so that requiredSamples is a Double. This will fix the division in line 114 which currently results in zero if requiredSamples < dataset.count

How was the this patch tested?

Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.

require(totalSamples > 0,
"QuantileDiscretizer requires non-empty input dataset but was given an empty input.")
val requiredSamples = math.max(numBins * numBins, 10000)
val requiredSamples = math.max(numBins * numBins, 10000.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd very slightly prefer to cast this to a double in the following line. This makes requiredSamples a double itself, which is slightly wrong.

@srowen
Copy link
Member

srowen commented Feb 23, 2016

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 23, 2016

Test build #51764 has finished for PR 11319 at commit 635fb4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@oliverpierson
Copy link
Author

I think the explicit cast is better as well; it makes it clear that requiredSamples is a Double. I'm a bit new to this whole PR process so I have to ask: can I just push the changes to the branch on my fork and they will automatically get pulled into this PR?

@hvanhovell
Copy link
Contributor

Could you add a regression test for this?

@srowen
Copy link
Member

srowen commented Feb 23, 2016

The point is that requiredSamples isn't a double, but needs to be construed as one to get a calculation right. Yes like any github PR you just push more commits to your branch.

@oliverpierson
Copy link
Author

Yeah, I think I can put a test together. Also, the value 10000 in line 113:

val requiredSamples = math.max(numBins * numBins, 10000)

is hard coded. Should this be changed to a class attribute? I ask because I test would required a dataset with dataset.count > 10000.

@srowen
Copy link
Member

srowen commented Feb 23, 2016

I don't think making a 10K element test set is too big. I wouldn't expose it as a configurable setter just for this, no.

* Minimum number of samples required for finding splits, regardless of number of bins. If
* the dataset has less rows than this value, the entire dataset column will be used.
*/
val minSamplesRequired: Int = 10000
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test will require creating a dataset at least as large as minSamplesRequired, so making its value excessively large could slow down testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less rows -> fewer rows
entire dataset column -> entire dataset?

10000 just isn't that big. A dummy data set in a test would be, what, a megabyte in memory? Am I missing a bigger problem there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're missing anything. It's my first time contributing and I just want to be explicit for reviewers of the patch. I agree that 10K isn't that big, especially out "in the wild". However, I wasn't sure if there were standards for time/memory consumption for tests so I added the line note so that reviewers with more experience than me would be aware in case there are established/de facto testing standards.

I'll make the "->" changes you've indicated and push a new commit sometime today. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, my original reason for asking about removing the hard coded value of 10K was because that value was partly the cause of the bug and so a regression test would need to know the value.

I could have hard coded 10k in to my test. However if a developer increased its value later, say to 100K, without increasing the hard coded test value as well, they would render the test useless since it would always pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we have muuuch bigger tests that turn up whole small clusters. Making a little data set is fine. You can expose the 10000 figure as a private[spark] val in the object so you can reference it from test code.

@srowen
Copy link
Member

srowen commented Feb 25, 2016

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51967 has finished for PR 11319 at commit abea876.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@oliverpierson
Copy link
Author

@srowen Let me know if you'd like any further changes. Also, thanks for the review and your help.

@asfgit asfgit closed this in 6f8e835 Feb 25, 2016
asfgit pushed a commit that referenced this pull request Feb 25, 2016
…DataFrames

Change line 113 of QuantileDiscretizer.scala to

`val requiredSamples = math.max(numBins * numBins, 10000.0)`

so that `requiredSamples` is a `Double`.  This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`

Manual tests.  I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.

Author: Oliver Pierson <[email protected]>
Author: Oliver Pierson <[email protected]>

Closes #11319 from oliverpierson/SPARK-13444.

(cherry picked from commit 6f8e835)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Feb 25, 2016

Merged to master/1.6

@oliverpierson oliverpierson deleted the SPARK-13444 branch February 25, 2016 13:35
@JoshRosen
Copy link
Contributor

This broke compilation in branch 1.6:

[error] /home/jenkins/workspace/spark-branch-1.6-compile-maven-pre-yarn-1.2.1/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala:85: value setSeed is not a member of org.apache.spark.ml.feature.QuantileDiscretizer
[error] possible cause: maybe a semicolon is missing before `value setSeed'?
[error]       .setSeed(1)
[error]        ^
[error] one error found
[error] Compile failed at Feb 25, 2016 8:49:20 AM [18.290s]

@srowen
Copy link
Member

srowen commented Feb 25, 2016

Ah... I am 0 for 2 today. I am away from keyboard so feel free to revert it if it is a Blocker. This really doesn't have to go in this branch.

@oliverpierson
Copy link
Author

Hmmm.... any idea why it's failing? This line is no different than line 115 of the same value, except for the line breaks.

@oliverpierson oliverpierson restored the SPARK-13444 branch February 25, 2016 19:48
.setInputCol("input")
.setOutputCol("result")
.setNumBuckets(numBuckets)
.setSeed(1)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the offending line

@mengxr
Copy link
Contributor

mengxr commented Feb 25, 2016

@oliverpierson I reverted the change in branch-1.6 so we don't block other builds. Could you remove the setSeed line and submit another PR to branch-1.6? Thanks!

@oliverpierson
Copy link
Author

@mengxr No problem. I'll do it later today when I get to my laptop.

I thought since random sampling of data was used to find splits, then setSeed would be good for ensuring consistent tests. However, it's not necessary so happy to remove it.

@oliverpierson
Copy link
Author

@mengxr Sorry, was in a rush and missed the last part of your comment. I've opened a new PR, but against apache:master. However, you'd like a PR against apache:branch-1.6? If that's correct, just let me know and I'll close the other PR and open a new PR against 1.6. Sorry for the mixup.

@mengxr
Copy link
Contributor

mengxr commented Feb 26, 2016

I didn't revert the changes in master. I have to revert the changes in 1.6 because setSeed is not available in branch-1.6. Usually we only backport bug fixes instead of features. So you should open a PR against 1.6.

asfgit pushed a commit that referenced this pull request Mar 4, 2016
…DataFrames

## What changes were proposed in this pull request?

Change line 113 of QuantileDiscretizer.scala to

`val requiredSamples = math.max(numBins * numBins, 10000.0)`

so that `requiredSamples` is a `Double`.  This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`

## How was the this patch tested?
Manual tests.  I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.

Author: Oliver Pierson <[email protected]>
Author: Oliver Pierson <[email protected]>

Closes #11319 from oliverpierson/SPARK-13444.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants