Skip to content

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Apr 13, 2016

What changes were proposed in this pull request?

A nonsensical split is produced from method findSplitsForContinuousFeature for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is > 0 is removed, and instead features with zero possible splits are ignored.

How was this patch tested?

A unit test was added to check that finding splits for a constant feature produces an empty array.

@sethah
Copy link
Contributor Author

sethah commented Apr 13, 2016

cc @jkbradley

This is a small PR that generally makes things more correct. But, I realize that this did not really have any adverse effects before, so I'll understand if this does not get merged. It is likely to be more of a problem when working on micro datasets. Although, there is a small change I included which I do think is incorrect and should be fixed, regarding how possibleSplits is defined in findSplitsForContinuousFeatures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of possible bins should be valueCounts.length, and the number of possible splits should therefore be valueCounts.length - 1.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55765 has finished for PR 12374 at commit 98c31e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55768 has finished for PR 12374 at commit 0a26a1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59138 has finished for PR 12374 at commit 0a26a1f.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Jun 21, 2016

cc @MechCoder @MLnick Could you take a look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would have failed before due to the assertion that splits.length > 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"train with constant features" -> "train with constant continuous features"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not specific to continuous features.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60975 has finished for PR 12374 at commit a42c126.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #60979 has finished for PR 12374 at commit 1b7f826.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems slightly hacky to me. What is your opinion about doing filtering out the feature indices that have zero splits (something similar to this)?

val validFeaturesSplits = Range(0, binAggregates.metadata.numFeaturesPerNode).filter { featureIndexIdx =>
  val featureIndex = if (featuresForNode.nonEmpty) {
    featuresForNode.get.apply(featureIndexIdx)
  } else {
    featureIndexIdx
  }
  binAggregates.metadata.numSplits(featureIndex) != 0
}

That will prevent code-rewrite for this corner-case in PR's such as #13959 and #8540

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I modified your suggestion to work with a view, so we don't allocate unnecessary memory.

@MechCoder
Copy link
Contributor

@sethah Nice catch! This superfluous split seems to be only for continuous features in which the number of unique values - 1 is lesser than or equal to the number of splits. Can you update the PR title or description to reflect this change? Thanks!

Copy link
Contributor

@MechCoder MechCoder Jul 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the impurity of the rootNode "-1"? Since there is only one class and no splits should it not be just zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since the node found no valid split we flag the impurity as invalid. See here

@MechCoder
Copy link
Contributor

Outside of this PR, I would like to either:

  1. Update the documentation of findSplitsForContinuousFeature to reflect that the return type is an array of thresholds, rather than an array of Splits.
  2. Change the return type of findSplitsForContinuousFeature to return an array of splits directly.

(The 2nd one more preferable)

@sethah
Copy link
Contributor Author

sethah commented Jul 11, 2016

@MechCoder I addressed your comments. I updated the scala doc for findSplitsForContinuousFeature to reflect the return type. I think it's fine to simply fix the doc for now. Let me know if you see anything else.

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62106 has finished for PR 12374 at commit 3bb28fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62562 has finished for PR 12374 at commit eddac63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val validFeatureSplits =
Range(0, binAggregates.metadata.numFeaturesPerNode).view.map { featureIndexIdx =>
if (featuresForNode.nonEmpty) {
(featureIndexIdx, featuresForNode.get.apply(featureIndexIdx))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the apply here redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The alternative is featuresForNode.get(featureIndexIdx) which is misleading even though it does work. It looks like you are calling a function get and passing featureIndexIdx as an argument. Explicit apply seems clearer.

@MechCoder
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62830 has finished for PR 12374 at commit 3c73726.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2016

Test build #64417 has finished for PR 12374 at commit 3c73726.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Oct 10, 2016

ping @jkbradley or @yanboliang

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66676 has finished for PR 12374 at commit 928a834.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks!

@asfgit asfgit closed this in 03c4020 Oct 11, 2016
@sethah
Copy link
Contributor Author

sethah commented Oct 11, 2016

Thanks @jkbradley!

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… decision tree training

## What changes were proposed in this pull request?

A nonsensical split is produced from method `findSplitsForContinuousFeature` for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is `> 0` is removed, and instead features with zero possible splits are ignored.

## How was this patch tested?

A unit test was added to check that finding splits for a constant feature produces an empty array.

Author: sethah <[email protected]>

Closes apache#12374 from sethah/SPARK-14610.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants