[SPARK-14610][ML] Remove superfluous split for continuous features in decision tree training #12374

sethah · 2016-04-13T23:54:25Z

What changes were proposed in this pull request?

A nonsensical split is produced from method findSplitsForContinuousFeature for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is > 0 is removed, and instead features with zero possible splits are ignored.

How was this patch tested?

A unit test was added to check that finding splits for a constant feature produces an empty array.

sethah · 2016-04-13T23:58:33Z

cc @jkbradley

This is a small PR that generally makes things more correct. But, I realize that this did not really have any adverse effects before, so I'll understand if this does not get merged. It is likely to be more of a problem when working on micro datasets. Although, there is a small change I included which I do think is incorrect and should be fixed, regarding how possibleSplits is defined in findSplitsForContinuousFeatures.

sethah · 2016-04-14T00:00:08Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

The number of possible bins should be valueCounts.length, and the number of possible splits should therefore be valueCounts.length - 1.

SparkQA · 2016-04-14T00:22:38Z

Test build #55765 has finished for PR 12374 at commit 98c31e9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-14T01:13:40Z

Test build #55768 has finished for PR 12374 at commit 0a26a1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-23T18:48:31Z

Test build #59138 has finished for PR 12374 at commit 0a26a1f.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

sethah · 2016-06-21T22:58:06Z

cc @MechCoder @MLnick Could you take a look?

sethah · 2016-06-21T22:59:38Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

This test would have failed before due to the assertion that splits.length > 0.

"train with constant features" -> "train with constant continuous features"?

This test is not specific to continuous features.

SparkQA · 2016-06-21T23:11:15Z

Test build #60975 has finished for PR 12374 at commit a42c126.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-22T00:18:13Z

Test build #60979 has finished for PR 12374 at commit 1b7f826.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2016-07-06T22:36:54Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

This seems slightly hacky to me. What is your opinion about doing filtering out the feature indices that have zero splits (something similar to this)?

val validFeaturesSplits = Range(0, binAggregates.metadata.numFeaturesPerNode).filter { featureIndexIdx => val featureIndex = if (featuresForNode.nonEmpty) { featuresForNode.get.apply(featureIndexIdx) } else { featureIndexIdx } binAggregates.metadata.numSplits(featureIndex) != 0 }

That will prevent code-rewrite for this corner-case in PR's such as #13959 and #8540

I agree. I modified your suggestion to work with a view, so we don't allocate unnecessary memory.

MechCoder · 2016-07-06T23:00:20Z

@sethah Nice catch! This superfluous split seems to be only for continuous features in which the number of unique values - 1 is lesser than or equal to the number of splits. Can you update the PR title or description to reflect this change? Thanks!

MechCoder · 2016-07-06T23:03:02Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

Why is the impurity of the rootNode "-1"? Since there is only one class and no splits should it not be just zero?

No, since the node found no valid split we flag the impurity as invalid. See here

MechCoder · 2016-07-06T23:08:10Z

Outside of this PR, I would like to either:

Update the documentation of findSplitsForContinuousFeature to reflect that the return type is an array of thresholds, rather than an array of Splits.
Change the return type of findSplitsForContinuousFeature to return an array of splits directly.

(The 2nd one more preferable)

sethah · 2016-07-11T18:07:09Z

@MechCoder I addressed your comments. I updated the scala doc for findSplitsForContinuousFeature to reflect the return type. I think it's fine to simply fix the doc for now. Let me know if you see anything else.

SparkQA · 2016-07-11T18:59:27Z

Test build #62106 has finished for PR 12374 at commit 3bb28fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T00:29:26Z

Test build #62562 has finished for PR 12374 at commit eddac63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2016-07-20T22:03:39Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+    val validFeatureSplits =
+      Range(0, binAggregates.metadata.numFeaturesPerNode).view.map { featureIndexIdx =>
+        if (featuresForNode.nonEmpty) {
+          (featureIndexIdx, featuresForNode.get.apply(featureIndexIdx))


Is the apply here redundant?

I don't think so. The alternative is featuresForNode.get(featureIndexIdx) which is misleading even though it does work. It looks like you are calling a function get and passing featureIndexIdx as an argument. Explicit apply seems clearer.

MechCoder · 2016-07-20T22:03:55Z

LGTM

SparkQA · 2016-07-25T18:33:26Z

Test build #62830 has finished for PR 12374 at commit 3c73726.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-25T13:22:35Z

Test build #64417 has finished for PR 12374 at commit 3c73726.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-10T20:11:11Z

ping @jkbradley or @yanboliang

SparkQA · 2016-10-10T21:10:23Z

Test build #66676 has finished for PR 12374 at commit 928a834.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-11T00:03:46Z

LGTM
Merging with master
Thanks!

sethah · 2016-10-11T00:09:32Z

Thanks @jkbradley!

… decision tree training ## What changes were proposed in this pull request? A nonsensical split is produced from method `findSplitsForContinuousFeature` for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is `> 0` is removed, and instead features with zero possible splits are ignored. ## How was this patch tested? A unit test was added to check that finding splits for a constant feature produces an empty array. Author: sethah <[email protected]> Closes apache#12374 from sethah/SPARK-14610.

sethah reviewed Apr 14, 2016
View reviewed changes

sethah mentioned this pull request Apr 14, 2016

[SPARK-9478] [ml] Add class weights to Random Forest #9008

Closed

sethah mentioned this pull request Apr 28, 2016

[SPARK-14599][ML] BaggedPoint should support sample weights. #12370

Closed

sethah force-pushed the SPARK-14610 branch from 0a26a1f to a42c126 Compare June 21, 2016 22:57

sethah reviewed Jun 21, 2016
View reviewed changes

MechCoder reviewed Jul 6, 2016
View reviewed changes

sethah added 6 commits July 10, 2016 11:06

remove extra split for continuous features

bbdbf20

cleanup

2da8474

unit test failure

ab5694a

handle empty case

8835f64

add instrumentation variable to test

c707b25

address some review comments

3bb28fe

sethah force-pushed the SPARK-14610 branch from 1b7f826 to 3bb28fe Compare July 11, 2016 18:00

cleanup

eddac63

MechCoder reviewed Jul 20, 2016
View reviewed changes

update numClasses in test

3c73726

small cleanups

928a834

asfgit closed this in 03c4020 Oct 11, 2016

[SPARK-14610][ML] Remove superfluous split for continuous features in decision tree training #12374

[SPARK-14610][ML] Remove superfluous split for continuous features in decision tree training #12374

Uh oh!

Conversation

sethah commented Apr 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sethah commented Apr 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

SparkQA commented May 23, 2016

Uh oh!

sethah commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Jul 6, 2016

Uh oh!

MechCoder Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Jul 6, 2016

Uh oh!

sethah commented Jul 11, 2016

Uh oh!

SparkQA commented Jul 11, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

SparkQA commented Aug 25, 2016

Uh oh!

sethah commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

jkbradley commented Oct 11, 2016

Uh oh!

sethah commented Oct 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MechCoder Jul 6, 2016 •

edited

Loading