Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, add evaluation metric in summary
2, add an evaluate() method which returns a summary

How was this patch tested?

added tests

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71698 has finished for PR 16654 at commit bb01219.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71707 has started for PR 16654 at commit 29bda3f.

@zhengruifeng
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71710 has finished for PR 16654 at commit 29bda3f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 20, 2017

General question: isn't this what Evaluators are for?

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71717 has finished for PR 16654 at commit 1d89914.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imatiach-msft
Copy link
Contributor

+1 with @srowen , this should be limited to the evaluator/metrics classes. If we have an evaluator for clustering then will we be able to use it with hyperparameter tuner (cross validate)?

@zhengruifeng
Copy link
Contributor Author

I think now clustering metrics are not that general, comparing with classification/regression metrics:
WSSSE only apply to KMeans and BiKMeans
Loglikelihood only apply to GMM

I had opened a jira about clusteringEvaluator https://issues.apache.org/jira/browse/SPARK-14516, which may add metrics included in scikit-learn http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.cluster

@yanboliang @jkbradley What's your opinion?

@zhengruifeng zhengruifeng force-pushed the clustering_model_evaluate branch from 1d89914 to 5937ce7 Compare January 22, 2017 06:22
@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71799 has started for PR 16654 at commit 5937ce7.

@zhengruifeng
Copy link
Contributor Author

Jenkins, retest this please

@srowen
Copy link
Member

srowen commented Jan 22, 2017

Yes, I think this is at best a duplicate of SPARK-14516. You don't want to add ad-hoc methods for this.

@zhengruifeng
Copy link
Contributor Author

@srowen I think I had not clarify my thoughts. WSSSE and Loglikelihood are algorithm-specific metrics.
For example:
WSSSE dont make sense for clustering algorithms like DBSCAN,
GMM's Loglikelihood is even different from MixtureModels of other distribution: Given a RDD[Int] representing clusterID or RDD[Array[Double]] representing cluster probability distribution, we can not design a general method to compute the Loglikelihood.

Some general clustering metrics are listed in http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.cluster, but wssse and loglikelihood are not in it.

@srowen
Copy link
Member

srowen commented Jan 22, 2017

I agree that clustering metrics are different from classification metrics, but that doesn't mean they can't have some common abstraction -- they're applied to a model and data set and produce a number. It's true that not every evaluation metric makes sense for every model, but that's not a problem per se.

Why wouldn't WSSSE make sense for DBSCAN?

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71805 has finished for PR 16654 at commit 5937ce7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@srowen The concept of center don't exist in DBSCAN.

@srowen
Copy link
Member

srowen commented Jan 23, 2017

Metrics evaluate the clustering though; the details of the algorithm are irrelevant. This still clusters points in a continuous space so you can measure WSSSE.

@imatiach-msft
Copy link
Contributor

Wouldn't we eventually want to add a lot more clustering metrics like Dunn, Davies-Bouldin, Simplified Silhouette etc... there are a lot of clustering metrics and it seems like a good idea to have separate metrics and evaluator classes for them so that it would be easy to use/extend them. It would also be nice to be able to sweep over the hyperparameters of the clustering algorithms and use the evaluators as part of cross validate (or am I misunderstanding and it is already possible with the changes in this review?).

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Jan 23, 2017

Also, if some metrics are only applicable to some models, as @srowen noted, we can either make separate evaluator classes or put all metrics on one but throw if the model does not support that metric. Either solution would work and would be much better than putting the metrics calculation on the clustering model itself.

@zhengruifeng
Copy link
Contributor Author

Existing metrics (WSSSE,Loglikelihood) are relevant to detail of algorithm. Computation of WSSSE for KMeans/BisectKMeans use the average vectors as the centers, but for KMedoids the medoids, other than averages, should be used. If we use the same logic in KMeans to compute the WSSSE for KMedoids, I think it will be a mistake.
And I found that some supervised algorithms support evaluate method in models: LiR,LoR,GLR.

@srowen
Copy link
Member

srowen commented Jan 24, 2017

Sure, and classification metrics like AUC only make sense for classifiers that output more than just a label -- they have to output a probability or score of some kind. Not every metric necessarily makes sense for every model, and we can use class hierarchy or just argument checking to avoid applying metrics where nonsensical. WSSSE can't be used for k-medoids, yes. k-medoids is also not in Spark, AFAIK. It's still not an argument to not abstract this at all.

@zhengruifeng
Copy link
Contributor Author

@srowen I agree that metric should be irrelevant to details of the algorithms. AUC is irrelevant to algorithms, it is just relevant to the dataset: In spark-ml, scikit-learn, or any other packages, the input dataset contains label,decision values(or probabilities), if and only if there exist two labels in the dataset, AUC can be computed, no matter which classifier is used.

I also agree that some general metrics should be abstracted in Evaluator.

I just disagree that if we treat WSSSE as a general metric:
There have been some attempts to add K-Medoids in spark, although their PRs were not accepted, there are still some third-party source implementing K-Medoids on spark.
More realisticly, Spark is used together with other ml-packages in many cases, suppose use other packages to generate the model locally, and evaluate the result in spark.

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Jan 26, 2017

@zhengruifeng don't most ML libraries have separate clustering evaluators? For example, WEKA has ClusterEvaluation class. Scikit-learn just has a metrics class and functions you can call, but I don't really like that option and in any case it is separate from the estimator/model. H2O has a MetricBuilder that all ML learners (supervised/unsupervised) generate. I think creating a separate evaluation class which would fit in with the other evaluators in spark would be ideal - it would conform to the structure of the current codebase and possibly limit any confusion users might have.

@HyukjinKwon
Copy link
Member

gentle ping @zhengruifeng

@zhengruifeng zhengruifeng deleted the clustering_model_evaluate branch May 18, 2017 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants