[WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilarity #6213

debasish83 · 2015-05-17T00:42:35Z

For RowMatrix with 100K columns, colSimilarity with bruteforce/dimsum sampling is used. This PR adds rowSimilarity to IndexedRowMatrix which outputs a CoordinateMatrix. For matrices where columns are > 1M, rowSimilarity flow scales better compared to column similarity flow.

For most applications, topK similar items requirement is much less than all available items and therefore the rowSimilarity API takes topK and threshold as input. topK and threshold help in improving shuffle space.

For MatrixFactorization model generally the columns for both user and product factors are ~50-200 and therefore the column similarity flow does not work for such cases. This PR also adds batch similarUsers and similarProducts (SPARK-4675).

The following ideas are added:

Similarity computation is abstracted as Kernel
Kernel implementations for Cosine, RBF, Euclidean and Product (for distributed matrix multiply) are added
Tests cover Cosine Kernel. More tests will be added for Euclidean, RBF and Product kernels.
IndexedRowMatrix object adds a kernalized distributed matrix multiply which is used by similarity computation.
In examples, MovieLensSimilarity is added that shows col and row based flows on MovieLens as runtime experiment.
Level-1 BLAS is used so that kernel abstraction can be used. We can either design the Kernel abstraction with Level-3 BLAS (might be difficult) or use BlockMatrix for distributed matrix multiply.

Next steps:

In MovieLensSimilarity add ALS + similarItems example
Use RBF similarity in power iteration clustering flow

From internal experiments, we have run 6M users, 1.6M items, 351M ratings through row similarity flow with topK=200 in 1.1 hr with 240 cores running over 30 nodes. We had difficult time in scaling column similarity flow since the topK optimization can't be added until reduce phase is done in that flow.

On MovieLens-1M and Netflix dataset I will report row and col similarity runtime comparisons.

…rs and similarProducts

…eatures and ALS factors

… similarity

SparkQA · 2015-05-17T02:02:53Z

Test build #32916 has finished for PR 6213 at commit cc4e104.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class EuclideanKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double) extends Kernel

debasish83 · 2015-05-17T19:37:55Z

@mengxr the failures are related to yarn suite which does not look related to my changes...tests I added ran fine...
[info] *** 1 TEST FAILED ***
[error] Failed: Total 39, Failed 1, Errors 0, Passed 38
[error] Failed tests:
[error] org.apache.spark.deploy.yarn.YarnClusterSuite

debasish83 · 2015-05-17T19:51:30Z

For CosineKernel and ProductKernel, we should be able to have a separate code path with BLAS-2 once SparseMatrix x SparseVector merges and BLAS-3 once SparseMatrix x SparseMatrix merges..Basically refactor blockify from MatrixFactorizationModel to IndexedRowMatrix...Right now the sparse features are not in master yet...For Euclidean, RBF and Pearson, even with these changes merged, I think we still have to stay in BLAS-1

debasish83 · 2015-05-19T17:56:51Z

SparseMatrix x SparseVector got merged to Master today #6209.

I will update the PR and separate the code path for CosineKernel/ProductKernel and EuclideanKernel/RBFKernel to see the runtime improvements.

debasish83 · 2015-05-19T18:02:18Z

I am thinking more. May be EuclideanKernel can be decomposed using Matrix x Vector as well

debasish83 · 2015-05-20T06:02:43Z

Actually both for Euclidean and RBF it is possible as || x - y || can be decomposed as ||x||2 + ||y||2 - 2*dot(x,y) where dot(x,y) can be computed through dgemv...dgemm we can't use yet since BLAS does not have SparseMatrix x SparseMatrix...Is there a open PR for it ?

debasish83 · 2015-05-20T15:26:14Z

For gemv it is not clear how to re-use the scratch space for result vector...if we can't reuse the result vector over multiple calls to kernel.compute we won't get much runtime benefits...I am considering that for Vector based IndexedRowMatrix, we define the kernel as the traditional (vector, vector) compute and use level 1 BLAS as done in this PR. The big runtime benefit will come from Approximate KNN that I will open up next but we still need the brute-force KNN for cross validation.

For (Long, Array[Double]) from matrix factorization model (similarUsers and similarProducts) we can use dgemm specifically for DenseMatrix x DenseMatrix...@mengxr what do you think ? That way we can use dgemm when the features are Dense..Also (Long, Array[Double]) data structure can be defined in recommendation/linalg package and re-used by dense kernel computation Or perhaps for similarity/KNN computation it is fine to stay in vector space and not do gemv/gemm optimization?

…and item->item (cosine kernel)

debasish83 · 2015-05-23T18:31:08Z

@mengxr I generalized MatrixFactorizationModel.recommendAll and use it for similarUsers and similarProducts and use dgemm...In IndexedRowMatrix I only exposed rowSimilarity as the public API and it uses blocked BLAS level-1 computation...It is easy to use gemv in IndexedRowMatrix.rowSimilarity for CosineKernel but for RBFKernel things will get tricky since for sparse vector, I don't think we can write euclidean distance as norm1 \times norm1 + norm2 \times norm2 - 2 dot(x, y) without letting go of some accuracy which might be ok compared to runtime benefits...I am looking further into RBF computation using dgemv...

SparkQA · 2015-05-23T18:31:11Z

Test build #33417 has finished for PR 6213 at commit 00cabd0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

debasish83 · 2015-05-23T19:04:06Z

Internally vector flow in IndexedRowMatrix has helped us to do additional optimization through user defined kernels and cut the computation which won't happen if we go to dgemv since the matrix compute will be done first before filters based on norm can be applied...I think we should keep the vector based kernel compute and get user feedback first...

SparkQA · 2015-05-23T19:05:54Z

Test build #33418 has finished for PR 6213 at commit aa21954.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

SparkQA · 2015-05-23T19:15:45Z

Test build #33419 has finished for PR 6213 at commit 156ceca.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

SparkQA · 2015-05-23T21:46:07Z

Test build #33421 has finished for PR 6213 at commit efe68f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

…LS for low rank item similarity calculation

debasish83 · 2015-05-24T02:16:27Z

Runtime comparison are posted on https://issues.apache.org/jira/browse/SPARK-4823 on MovieLens1m dataset, 8 core, 1 GB driver memory, 4 GB executor memory from my laptop.

Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s
Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins

I have not yet gone to gemv which will decrease the runtime further but will add some approximations in RBFKernel. I think for users we should give both vector based flow and gemv based flow to let them choose what they want.

I updated the example code that shows similarity flows in mllib through examples.mllib.MovieLensSimilarity

@MLnick @srowen could you please take a look at examples.mllib.MovieLensSimilarity ? I am running ALS in implicit mode with no regularization (basically full RMSE optimization) and comparing similarities as generated from raw features and MatrixFactorizationModel.productFactors.

I get topK=50 from raw features as golden labels and find MAP on top50 predictions from MatrixFactorizationModel.similarItems() that this PR added.

I will add a testcase for RBFKernel and add the PowerIterationClustering driver to use IndexedRowMatrix.rowSimilarities code before taking out WIP label from the PR.

debasish83 · 2015-05-24T02:35:29Z

Refactoring MatrixFactorizationModel.recommendForAll to a common place like Vectors/Matrices will help users who have dense data with modest columns (~1000-10K, most IoT data falls in this category) reuse dgemm based kernel computation. I am not sure which is a good place for this code ?

SparkQA · 2015-05-24T03:59:44Z

Test build #33423 has finished for PR 6213 at commit af70583.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel extends Serializable
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

rezazadeh · 2015-05-30T03:19:02Z

Hi @debasish83 thank you for this PR. As it stands, it has too many components, which it makes it hard to review individual contributions. @mengxr and I spoke about this, and are wondering if you'd like to split it up to smaller PRs. In order, the PRs would be the following:

Adding rowSimilarities() for just cosine similarity (more similarity types adds extra reviewing, so please leave those out). Once this is done, then:
Adding similarProducts and similarUsers to MatrixFactorizationModel, once this is done, then:
Adding different similarity kernels

Could you please close this PR and submit the above in order, one at a time? We should work on each in order, i.e. wait for one to be merged before the next one is started to be reviewed. The relevant JIRAs are 1) SPARK-4823, 2) SPARK-4675, and 3) is new.

debasish83 · 2015-05-31T02:41:48Z

@rezazadeh sure I will do that....Could you add a JIRA for 3 (Kernel Clustering / PIC) so that we can add RBFKernel flow and implement PIC with vector - matrix multiply for comparisons ? Also in general topK can decrease the kernel size and is a cross validation parameter to see the degradation of the clustering compared to full kernel which is always difficult to keep as the rows grow...No such experiments have been done for PIC.

I am experimenting with gemv based optimization for SparseVector x SparseMatrix and if I get further speedup compared to level 1 flow most likely we will provide both options to the users in SPARK-4823.

rezazadeh · 2015-06-05T19:58:34Z

@debasish83 It's not clear whether we need (3) yet, let's focus on (1) and (2) first. It's probably overkill to include all kinds of different similarity scores here. Let's focus on the performance of (1) and (2) first. You should probably close this PR and make a new one for (1). Please don't start working on (2) until (1) is merged, because they depend on each other heavily.

debasish83 · 2015-06-06T22:13:48Z

Internally we are using this code for euclidean/rbf driving PIC for example...but sure we can focus on cosine first...

SparkQA · 2015-07-24T07:56:04Z

Test build #38333 has finished for PR 6213 at commit 3c49d98.

This patch fails to build.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel extends Serializable
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

…milarity example

SparkQA · 2015-07-24T09:46:23Z

Test build #38347 has finished for PR 6213 at commit 4e95ea6.

This patch fails to build.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- trait Kernel extends Serializable
- case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
- case class ProductKernel() extends Kernel
- case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

rezazadeh · 2015-08-31T20:31:32Z

Any progress on this @debasish83 ?

debasish83 · 2015-08-31T21:16:14Z

@rezazadeh got busy with spark streaming version of KNN :-) I will open up 2 PRs over the weekend as we discussed.

jlamcanopy · 2015-10-21T15:16:08Z

@debasish83 How this is looking? I want to try it soon :)

tn531 · 2015-12-09T21:27:11Z

@debasish83 any status on this PR?

rxin · 2015-12-31T02:46:51Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

Debasish Das added 6 commits May 16, 2015 17:05

blocked kernalized row similarity calculation and tests

f9fd6fb

Cosine, Euclidean, RBF and Product Kernel added

66176f9

row similarity API added to drive MatrixFactorizationModel similarUse…

3f96963

…rs and similarProducts

MovieLens flow to demonstrate item similarity calculation using raw f…

6dc9e18

…eatures and ALS factors

import cleanup

71f24a4

Merge branch 'similarity' of https://github.com/debasish83/spark into…

cc4e104

… similarity

debasish83 changed the title ~~[MLLIB][SPARK-4675, SPARK-4823] RowSimilarity~~ [MLLIB][WIP][SPARK-4675][SPARK-4823]RowSimilarity May 20, 2015

debasish83 changed the title ~~[MLLIB][WIP][SPARK-4675][SPARK-4823]RowSimilarity~~ [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilarity May 20, 2015

Debasish Das added 4 commits May 23, 2015 11:12

Cleaned EuclideanKernel since PowerIterationClustering flow uses RBF

4b8a7eb

IndexedRowMatrix refactor: rowSimilarity is only exposed as public API

e36c677

recommendAll generalized for user->item (product kernel), user->user …

2541cd7

…and item->item (cosine kernel)

testcases for similarUsers, similarProducts

00cabd0

scalastyle cleanup

aa21954

Debasish Das added 2 commits May 23, 2015 12:08

recommendAll changed back to recommendForAll

e0034c8

newline character added

156ceca

scalastyle fix

efe68f4

examples updated with row and column based flow and comparison with A…

af70583

…LS for low rank item similarity calculation

debasish83 mentioned this pull request May 24, 2015

[MLLIB][SPARK-4675] Find similar products and similar users in MatrixFactorizationModel #3536

Closed

debasish83 mentioned this pull request Jul 11, 2015

[MLLIB][WIP] SPARK-4638: Kernels feature for MLLIB #5503

Closed

use coalesce in place of repartition to avoid shuffle

3c49d98

cleaned caches for runtime/shuffle benchmark; added topk option in si…

4e95ea6

…milarity example

asfgit closed this in 7b4452b Dec 31, 2015

Uh oh!

[WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilarity #6213

[WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilarity #6213

Uh oh!

Conversation

debasish83 commented May 17, 2015

Uh oh!

SparkQA commented May 17, 2015

Uh oh!

debasish83 commented May 17, 2015

Uh oh!

debasish83 commented May 17, 2015

Uh oh!

debasish83 commented May 19, 2015

Uh oh!

debasish83 commented May 19, 2015

Uh oh!

debasish83 commented May 20, 2015

Uh oh!

debasish83 commented May 20, 2015

Uh oh!

debasish83 commented May 23, 2015

Uh oh!

SparkQA commented May 23, 2015

Uh oh!

debasish83 commented May 23, 2015

Uh oh!

SparkQA commented May 23, 2015

Uh oh!

SparkQA commented May 23, 2015

Uh oh!

SparkQA commented May 23, 2015

Uh oh!

debasish83 commented May 24, 2015

Uh oh!

debasish83 commented May 24, 2015

Uh oh!

SparkQA commented May 24, 2015

Uh oh!

rezazadeh commented May 30, 2015

Uh oh!

debasish83 commented May 31, 2015

Uh oh!

rezazadeh commented Jun 5, 2015

Uh oh!

debasish83 commented Jun 6, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

rezazadeh commented Aug 31, 2015

Uh oh!

debasish83 commented Aug 31, 2015

Uh oh!

jlamcanopy commented Oct 21, 2015

Uh oh!

tn531 commented Dec 9, 2015

Uh oh!

rxin commented Dec 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants