-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2692] [mllib] Decision Tree API update and multiclass bug fix #1582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…isionTreeRegressor class,object, update docs, tests and examples.
…ll need to update documentation. Split classes: * DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor * DecisionTreeModel --> DecisionTreeClassifierModel, DecisionTreeRegressorModel * Super-classes DecisionTree, DecisionTreeModel are private to mllib. Included print() function for human-readable model descriptions * For: DecisionTreeClassifierModel, DecisionTreeRegressorModel, Node parameters (used to be named Strategy) * Split into: DTParams, DTClassifierParams, DTRegressorParams. * Added defaultParams() method to DecisionTreeClassifier/Regressor. * impurity ** Made private to mllib package. ** Split Impurity into ClassifierImpurity, RegressorImpurity ** Added factories: ClassifierImpurities, RegressorImpurities * QuantileStrategy: Added factory QuantileStrategies * maxDepth: Changed meaning by 1. Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes. This matches scikit-learn and rpart. train() functions: * Changed to use DatasetInfo class for metadata. * Eliminated many of the static train() functions to prevent users from needing to remember the order of long lists of parameters. DecisionTree internals: * renamed numSplits to numBins (since it was a duplicate name)
…d not. This one implements this change: maxDepth: Changed meaning by 1. Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes. This matches scikit-learn and rpart. Internally, this meant replacing: maxDepth <— maxDepth+1. In tests, decremented maxDepth by 1.
Changed Params classes to take strings instead of special types. Made impurity names lists publicly accessible via Params classes. Simplified impurity factories.
…reeSuite.java since it fails currently
…reeSuite.java since it fails currently Comments which should have been added to previous commit: Fixed one test in DecisionTreeSuite to undo a change in previous commit (“stump with categorical variables for multiclass classification”). Reverted impurity from Entropy back to Gini. Java compatibility: * Changed non-static train() methods’ names to run() to avoid conflicts with static train() methods in Java. * Added setter functions to *Params classes.
…rdered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). * Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern. Unit tests in DecisionTreeSuite * Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above. * Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes. Bug fix: calculateGainForSplit (for classification): * It used to return dummy prediction values when either the right or left children had 0 weight. These were incorrect for multiclass classification. It has been corrected. Updated impurities to allow for count = 0. This was related to the above bug fix for calculateGainForSplit (for classification). Small updates to documentation and coding style.
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
@jkbradley Could you add the JIRA number to the PR title like |
@mengxr This does not have a specific JIRA. (It does not solve the Python API JIRA [SPARK-2478] yet.) |
@jkbradley Awesome! A couple of quick thoughts:
|
@jkbradley You might want to create a JIRA for this one and ask Matei to assign to you. It's a big enough change to require one. :-) |
QA tests have started for PR 1582. This patch merges cleanly. |
@manishamde I do realize it's a big change, and I hope it does not cause too much trouble for the other methods! The functionality should be the same, and the internals are almost identical (mostly moving around code, with no major duplication), so performance should not change much. (I do have some ideas for future optimizations, but we will push the API update through first.) I appreciate your thoughts on the update! |
It looks like the Jenkins failures are MIMA issues; I'll work on fixing them. |
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
QA tests have started for PR 1582. This patch merges cleanly. |
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
QA results for PR 1582: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioning supported impurities might help since such errors are generally typos.
* Eliminated model type parameter for DecisionTree abstract class. * In DecisionTree, renamed trainSub() to runSub(). * Updated DT*Params to print list of supported parameter options when an invalid one is given. * Made DTParams private[mllib].
I just submitted an update based on the comments from @manishamde (Thank you for them!) |
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
…hanged prefix String parameter to indentFactor (Int), following JSON.
Added numNodes, depth methods to DecisionTreeModel, plus test of those in DecisionTreeSuite.
Just updated with a few improvements. Main change was for print() methods, to use toString() instead. |
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
…ewlines in model toString functions.
QA tests have started for PR 1582. This patch merges cleanly. |
QA results for PR 1582: |
Closing this PR: Due to time constraints, this PR may not fit into the v1.1 release window. We will revisit it after v1.1. We will submit bug fixes as another PR. |
…oupBy by default (apache#1582)
Summary:
(1) Split DecisionTree API into separate Classifier and Regressor classes. (https://issues.apache.org/jira/browse/SPARK-2692)
(2) Bug fixes for recent multiclass PR (#886)
This is in preparation for a Python API (https://issues.apache.org/jira/browse/SPARK-2478)
Details on (1) API:
(1a) Split classes: E.g.: DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
(1b) Included print() function for human-readable model descriptions
(1c) Renamed Strategy to *Params. Changed to take strings instead of special types.
(1d) Made configuration classes (Impurity, QuantileStrategy) private to mllib.
(1e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
(1f) Removed static train() functions in favor of using Params classes.
(1g) Introduced DatasetInfo class for metadata.
Details on (2) bug fixes:
(2a) Inconsistent aggregate (agg) indexing for unordered features.
(2b) Fixed gain calculations for edge cases.
CC: @mengxr @manishamde