[SPARK-2692] [mllib] Decision Tree API update and multiclass bug fix #1582

jkbradley · 2014-07-25T01:00:59Z

Summary:
(1) Split DecisionTree API into separate Classifier and Regressor classes. (https://issues.apache.org/jira/browse/SPARK-2692)
(2) Bug fixes for recent multiclass PR (#886)
This is in preparation for a Python API (https://issues.apache.org/jira/browse/SPARK-2478)

Details on (1) API:

(1a) Split classes: E.g.: DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
(1b) Included print() function for human-readable model descriptions
(1c) Renamed Strategy to *Params. Changed to take strings instead of special types.
(1d) Made configuration classes (Impurity, QuantileStrategy) private to mllib.
(1e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
(1f) Removed static train() functions in favor of using Params classes.
(1g) Introduced DatasetInfo class for metadata.

Details on (2) bug fixes:

(2a) Inconsistent aggregate (agg) indexing for unordered features.
(2b) Fixed gain calculations for edge cases.

CC: @mengxr @manishamde

… API.

…isionTreeRegressor class,object, update docs, tests and examples.

…ll need to update documentation. Split classes: * DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor * DecisionTreeModel --> DecisionTreeClassifierModel, DecisionTreeRegressorModel * Super-classes DecisionTree, DecisionTreeModel are private to mllib. Included print() function for human-readable model descriptions * For: DecisionTreeClassifierModel, DecisionTreeRegressorModel, Node parameters (used to be named Strategy) * Split into: DTParams, DTClassifierParams, DTRegressorParams. * Added defaultParams() method to DecisionTreeClassifier/Regressor. * impurity ** Made private to mllib package. ** Split Impurity into ClassifierImpurity, RegressorImpurity ** Added factories: ClassifierImpurities, RegressorImpurities * QuantileStrategy: Added factory QuantileStrategies * maxDepth: Changed meaning by 1. Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes. This matches scikit-learn and rpart. train() functions: * Changed to use DatasetInfo class for metadata. * Eliminated many of the static train() functions to prevent users from needing to remember the order of long lists of parameters. DecisionTree internals: * renamed numSplits to numBins (since it was a duplicate name)

…d not. This one implements this change: maxDepth: Changed meaning by 1. Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes. This matches scikit-learn and rpart. Internally, this meant replacing: maxDepth <— maxDepth+1. In tests, decremented maxDepth by 1.

Changed Params classes to take strings instead of special types. Made impurity names lists publicly accessible via Params classes. Simplified impurity factories.

…reeSuite.java since it fails currently

…reeSuite.java since it fails currently Comments which should have been added to previous commit: Fixed one test in DecisionTreeSuite to undo a change in previous commit (“stump with categorical variables for multiclass classification”). Reverted impurity from Entropy back to Gini. Java compatibility: * Changed non-static train() methods’ names to run() to avoid conflicts with static train() methods in Java. * Added setter functions to *Params classes.

…cisiontree-api

…rdered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). * Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern. Unit tests in DecisionTreeSuite * Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above. * Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes. Bug fix: calculateGainForSplit (for classification): * It used to return dummy prediction values when either the right or left children had 0 weight. These were incorrect for multiclass classification. It has been corrected. Updated impurities to allow for count = 0. This was related to the above bug fix for calculateGainForSplit (for classification). Small updates to documentation and coding style.

SparkQA · 2014-07-25T01:03:31Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17148/consoleFull

SparkQA · 2014-07-25T01:04:12Z

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17148/consoleFull

SparkQA · 2014-07-25T01:23:38Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17151/consoleFull

SparkQA · 2014-07-25T02:07:36Z

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17151/consoleFull

mengxr · 2014-07-25T04:32:37Z

@jkbradley Could you add the JIRA number to the PR title like [SPARK-####][MLLIB]?

jkbradley · 2014-07-25T04:41:11Z

@mengxr This does not have a specific JIRA. (It does not solve the Python API JIRA [SPARK-2478] yet.)

manishamde · 2014-07-25T04:54:25Z

@jkbradley Awesome!

A couple of quick thoughts:

I am not completely convinced about the strategy for1a (I was expecting thin wrappers for regression and classification tree) but I guess that was expected considering I am very familiar with the existing code. I will sleep over it and get back. :-) To give a historical perspective, we had a similar split implementations for regression and classification in the beginning that we decided to combine into one. Perhaps, it's the right time to split them again. @etrain was also hinting at that in the multiclass review.
I have ensemble RF and Boosting implementations close-to-ready which will need major refactoring or rewriting from scratch considering the magnitude of this PR. That's fine but we should try and get it accepted ASAP. I promise prompt piecemeal reviews.
We should perform regression testing and compare with the 1.0 release.

manishamde · 2014-07-25T04:56:25Z

@jkbradley You might want to create a JIRA for this one and ask Matei to assign to you. It's a big enough change to require one. :-)

SparkQA · 2014-07-25T05:13:37Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull

jkbradley · 2014-07-25T05:29:48Z

@manishamde I do realize it's a big change, and I hope it does not cause too much trouble for the other methods! The functionality should be the same, and the internals are almost identical (mostly moving around code, with no major duplication), so performance should not change much. (I do have some ideas for future optimizations, but we will push the API update through first.) I appreciate your thoughts on the update!

jkbradley · 2014-07-25T05:30:34Z

It looks like the Jenkins failures are MIMA issues; I'll work on fixing them.

SparkQA · 2014-07-25T05:38:42Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17167/consoleFull

SparkQA · 2014-07-25T06:00:29Z

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull

SparkQA · 2014-07-25T06:08:27Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17170/consoleFull

SparkQA · 2014-07-28T21:44:42Z

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17305/consoleFull

SparkQA · 2014-07-28T21:53:48Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17306/consoleFull

SparkQA · 2014-07-28T22:23:55Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17309/consoleFull

SparkQA · 2014-07-28T22:38:48Z

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17306/consoleFull

SparkQA · 2014-07-28T23:13:51Z

QA results for PR 1582:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17309/consoleFull

manishamde · 2014-07-28T23:20:52Z

mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/DTClassifierParams.scala

Mentioning supported impurities might help since such errors are generally typos.

* Eliminated model type parameter for DecisionTree abstract class. * In DecisionTree, renamed trainSub() to runSub(). * Updated DT*Params to print list of supported parameter options when an invalid one is given. * Made DTParams private[mllib].

jkbradley · 2014-07-29T17:56:41Z

I just submitted an update based on the comments from @manishamde (Thank you for them!)

SparkQA · 2014-07-29T17:58:52Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17366/consoleFull

SparkQA · 2014-07-29T18:44:42Z

QA results for PR 1582:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17366/consoleFull

…hanged prefix String parameter to indentFactor (Int), following JSON.

Added numNodes, depth methods to DecisionTreeModel, plus test of those in DecisionTreeSuite.

jkbradley · 2014-07-29T20:21:44Z

Just updated with a few improvements. Main change was for print() methods, to use toString() instead.

SparkQA · 2014-07-29T20:23:52Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17373/consoleFull

SparkQA · 2014-07-29T20:27:00Z

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17373/consoleFull

…ewlines in model toString functions.

SparkQA · 2014-07-29T20:48:54Z

QA tests have started for PR 1582. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17378/consoleFull

SparkQA · 2014-07-29T21:33:05Z

QA results for PR 1582:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17378/consoleFull

jkbradley · 2014-07-30T21:53:35Z

Closing this PR: Due to time constraints, this PR may not fit into the v1.1 release window. We will revisit it after v1.1. We will submit bug fixes as another PR.

…oupBy by default (apache#1582)

jkbradley added 12 commits July 18, 2014 14:34

updating DT APIf

929f0e6

Merging multiclass DT PR, plus others, into branch with updates to DT…

29e29b8

… API.

Mostly done with DecisionTree API re-config. Still need to update Dec…

20fc805

…isionTreeRegressor class,object, update docs, tests and examples.

Merge remote-tracking branch 'upstream/master' into decisiontree-api

4ba347f

Changed all config/impurity classes/objects to be private[mllib].

4506844

Changed Params classes to take strings instead of special types. Made impurity names lists publicly accessible via Params classes. Simplified impurity factories.

removed mllib/src/test/java/org/apache/spark/mllib/tree/JavaDecisionT…

b6b0809

…reeSuite.java since it fails currently

Merge branch 'decisiontree-api' of github.com:jkbradley/spark into de…

0cb9866

…cisiontree-api

Merge remote-tracking branch 'upstream/master' into decisiontree-api

3ba5b4c

Fixed scala style issues reported by Jenkins

e1243a5

Merge remote-tracking branch 'upstream/master' into decisiontree-api

62c2fbc

Added Algo exception to MimaExcludes.scala

3eea304

Added more exceptions to MimaExcludes.scala

cda2a80

added newline character for Scala style

c0a46be

Updated documentation for Decision Trees based on new API

4bea4bd

manishamde reviewed Jul 28, 2014
View reviewed changes

jkbradley added 2 commits July 29, 2014 10:37

Merge remote-tracking branch 'upstream/master' into decisiontree-api

40c81e3

jkbradley added 3 commits July 29, 2014 12:16

Merge remote-tracking branch 'upstream/master' into decisiontree-api

f543f94

Changed DecisionTree*Model print() methods to be called toString(). C…

bdc2aa7

…hanged prefix String parameter to indentFactor (Int), following JSON.

Added @experimental tags to some Decision Tree objects.

17dcc09

Added numNodes, depth methods to DecisionTreeModel, plus test of those in DecisionTreeSuite.

Fixed bug in DecisionTreeRunner with old print function name. Added n…

d2c1dad

…ewlines in model toString functions.

jkbradley closed this Jul 30, 2014

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://100721323 Enable spark.sql.legacy.groupingIdWithAppendedUserGr…

3431d61

…oupBy by default (apache#1582)

[SPARK-2692] [mllib] Decision Tree API update and multiclass bug fix #1582

[SPARK-2692] [mllib] Decision Tree API update and multiclass bug fix #1582

Uh oh!

Conversation

jkbradley commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

mengxr commented Jul 25, 2014

Uh oh!

jkbradley commented Jul 25, 2014

Uh oh!

manishamde commented Jul 25, 2014

Uh oh!

manishamde commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

jkbradley commented Jul 25, 2014

Uh oh!

jkbradley commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 25, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

manishamde Jul 28, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

jkbradley commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

jkbradley commented Jul 30, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants