Skip to content

Conversation

jkbradley
Copy link
Member

Summary:
(1) Split DecisionTree API into separate Classifier and Regressor classes. (https://issues.apache.org/jira/browse/SPARK-2692)
(2) Bug fixes for recent multiclass PR (#886)
This is in preparation for a Python API (https://issues.apache.org/jira/browse/SPARK-2478)

Details on (1) API:

(1a) Split classes: E.g.: DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
(1b) Included print() function for human-readable model descriptions
(1c) Renamed Strategy to *Params. Changed to take strings instead of special types.
(1d) Made configuration classes (Impurity, QuantileStrategy) private to mllib.
(1e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
(1f) Removed static train() functions in favor of using Params classes.
(1g) Introduced DatasetInfo class for metadata.

Details on (2) bug fixes:

(2a) Inconsistent aggregate (agg) indexing for unordered features.
(2b) Fixed gain calculations for edge cases.

CC: @mengxr @manishamde

jkbradley added 12 commits July 18, 2014 14:34
…isionTreeRegressor class,object, update docs, tests and examples.
…ll need to update documentation.

Split classes:
* DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
* DecisionTreeModel --> DecisionTreeClassifierModel, DecisionTreeRegressorModel
* Super-classes DecisionTree, DecisionTreeModel are private to mllib.

Included print() function for human-readable model descriptions
* For: DecisionTreeClassifierModel, DecisionTreeRegressorModel, Node

parameters (used to be named Strategy)
* Split into: DTParams, DTClassifierParams, DTRegressorParams.
* Added defaultParams() method to DecisionTreeClassifier/Regressor.
* impurity
** Made private to mllib package.
** Split Impurity into ClassifierImpurity, RegressorImpurity
** Added factories: ClassifierImpurities, RegressorImpurities
* QuantileStrategy: Added factory QuantileStrategies
* maxDepth: Changed meaning by 1.  Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes.  This matches scikit-learn and rpart.

train() functions:
* Changed to use DatasetInfo class for metadata.
* Eliminated many of the static train() functions to prevent users from needing to remember the order of long lists of parameters.

DecisionTree internals:
* renamed numSplits to numBins (since it was a duplicate name)
…d not.

This one implements this change:

maxDepth: Changed meaning by 1.  Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes.  This matches scikit-learn and rpart.
Internally, this meant replacing: maxDepth <— maxDepth+1.
In tests, decremented maxDepth by 1.
Changed Params classes to take strings instead of special types.
Made impurity names lists publicly accessible via Params classes.
Simplified impurity factories.
…reeSuite.java since it fails currently

Comments which should have been added to previous commit:

Fixed one test in DecisionTreeSuite to undo a change in previous commit (“stump with categorical variables for multiclass classification”).  Reverted impurity from Entropy back to Gini.

Java compatibility:
* Changed non-static train() methods’ names to run() to avoid conflicts with static train() methods in Java.
* Added setter functions to *Params classes.
…rdered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).
* Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.

Unit tests in DecisionTreeSuite
* Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
* Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.

Bug fix: calculateGainForSplit (for classification):
* It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.

Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).

Small updates to documentation and coding style.
@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17148/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17148/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17151/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17151/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Jul 25, 2014

@jkbradley Could you add the JIRA number to the PR title like [SPARK-####][MLLIB]?

@jkbradley
Copy link
Member Author

@mengxr This does not have a specific JIRA. (It does not solve the Python API JIRA [SPARK-2478] yet.)

@manishamde
Copy link
Contributor

@jkbradley Awesome!

A couple of quick thoughts:

  • I am not completely convinced about the strategy for1a (I was expecting thin wrappers for regression and classification tree) but I guess that was expected considering I am very familiar with the existing code. I will sleep over it and get back. :-) To give a historical perspective, we had a similar split implementations for regression and classification in the beginning that we decided to combine into one. Perhaps, it's the right time to split them again. @etrain was also hinting at that in the multiclass review.
  • I have ensemble RF and Boosting implementations close-to-ready which will need major refactoring or rewriting from scratch considering the magnitude of this PR. That's fine but we should try and get it accepted ASAP. I promise prompt piecemeal reviews.
  • We should perform regression testing and compare with the 1.0 release.

@manishamde
Copy link
Contributor

@jkbradley You might want to create a JIRA for this one and ask Matei to assign to you. It's a big enough change to require one. :-)

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull

@jkbradley
Copy link
Member Author

@manishamde I do realize it's a big change, and I hope it does not cause too much trouble for the other methods! The functionality should be the same, and the internals are almost identical (mostly moving around code, with no major duplication), so performance should not change much. (I do have some ideas for future optimizations, but we will push the API update through first.) I appreciate your thoughts on the update!

@jkbradley
Copy link
Member Author

It looks like the Jenkins failures are MIMA issues; I'll work on fixing them.

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17167/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 25, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17170/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17305/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17306/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17309/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17306/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1582:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17309/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning supported impurities might help since such errors are generally typos.

* Eliminated model type parameter for DecisionTree abstract class.
* In DecisionTree, renamed trainSub() to runSub().
* Updated DT*Params to print list of supported parameter options when an invalid one is given.
* Made DTParams private[mllib].
@jkbradley
Copy link
Member Author

I just submitted an update based on the comments from @manishamde (Thank you for them!)

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17366/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1582:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17366/consoleFull

…hanged prefix String parameter to indentFactor (Int), following JSON.
Added numNodes, depth methods to DecisionTreeModel, plus test of those in DecisionTreeSuite.
@jkbradley
Copy link
Member Author

Just updated with a few improvements. Main change was for print() methods, to use toString() instead.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17373/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17373/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17378/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17378/consoleFull

@jkbradley
Copy link
Member Author

Closing this PR: Due to time constraints, this PR may not fit into the v1.1 release window. We will revisit it after v1.1. We will submit bug fixes as another PR.

@jkbradley jkbradley closed this Jul 30, 2014
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants