-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14610][ML] Remove superfluous split for continuous features in decision tree training #12374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @jkbradley This is a small PR that generally makes things more correct. But, I realize that this did not really have any adverse effects before, so I'll understand if this does not get merged. It is likely to be more of a problem when working on micro datasets. Although, there is a small change I included which I do think is incorrect and should be fixed, regarding how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of possible bins should be valueCounts.length
, and the number of possible splits should therefore be valueCounts.length - 1
.
Test build #55765 has finished for PR 12374 at commit
|
Test build #55768 has finished for PR 12374 at commit
|
Test build #59138 has finished for PR 12374 at commit
|
cc @MechCoder @MLnick Could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would have failed before due to the assertion that splits.length > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"train with constant features" -> "train with constant continuous features"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not specific to continuous features.
Test build #60975 has finished for PR 12374 at commit
|
Test build #60979 has finished for PR 12374 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems slightly hacky to me. What is your opinion about doing filtering out the feature indices that have zero splits (something similar to this)?
val validFeaturesSplits = Range(0, binAggregates.metadata.numFeaturesPerNode).filter { featureIndexIdx =>
val featureIndex = if (featuresForNode.nonEmpty) {
featuresForNode.get.apply(featureIndexIdx)
} else {
featureIndexIdx
}
binAggregates.metadata.numSplits(featureIndex) != 0
}
That will prevent code-rewrite for this corner-case in PR's such as #13959 and #8540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I modified your suggestion to work with a view, so we don't allocate unnecessary memory.
@sethah Nice catch! This superfluous split seems to be only for continuous features in which the number of unique values - 1 is lesser than or equal to the number of splits. Can you update the PR title or description to reflect this change? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the impurity of the rootNode "-1"? Since there is only one class and no splits should it not be just zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since the node found no valid split we flag the impurity as invalid. See here
Outside of this PR, I would like to either:
(The 2nd one more preferable) |
@MechCoder I addressed your comments. I updated the scala doc for |
Test build #62106 has finished for PR 12374 at commit
|
Test build #62562 has finished for PR 12374 at commit
|
val validFeatureSplits = | ||
Range(0, binAggregates.metadata.numFeaturesPerNode).view.map { featureIndexIdx => | ||
if (featuresForNode.nonEmpty) { | ||
(featureIndexIdx, featuresForNode.get.apply(featureIndexIdx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the apply
here redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The alternative is featuresForNode.get(featureIndexIdx)
which is misleading even though it does work. It looks like you are calling a function get
and passing featureIndexIdx
as an argument. Explicit apply seems clearer.
LGTM |
Test build #62830 has finished for PR 12374 at commit
|
Test build #64417 has finished for PR 12374 at commit
|
ping @jkbradley or @yanboliang |
Test build #66676 has finished for PR 12374 at commit
|
LGTM |
Thanks @jkbradley! |
… decision tree training ## What changes were proposed in this pull request? A nonsensical split is produced from method `findSplitsForContinuousFeature` for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is `> 0` is removed, and instead features with zero possible splits are ignored. ## How was this patch tested? A unit test was added to check that finding splits for a constant feature produces an empty array. Author: sethah <[email protected]> Closes apache#12374 from sethah/SPARK-14610.
What changes were proposed in this pull request?
A nonsensical split is produced from method
findSplitsForContinuousFeature
for decision trees. This PR removes the superfluous split and updates unit tests accordingly. Additionally, an assertion to check that the number of found splits is> 0
is removed, and instead features with zero possible splits are ignored.How was this patch tested?
A unit test was added to check that finding splits for a constant feature produces an empty array.