Skip to content

Conversation

@debasish83
Copy link

@mengxr @srowen @rezazadeh

For RowMatrix with 100K columns, colSimilarity with bruteforce/dimsum sampling is used. This PR adds rowSimilarity to IndexedRowMatrix which outputs a CoordinateMatrix. For matrices where columns are > 1M, rowSimilarity flow scales better compared to column similarity flow.

For most applications, topK similar items requirement is much less than all available items and therefore the rowSimilarity API takes topK and threshold as input. topK and threshold help in improving shuffle space.

For MatrixFactorization model generally the columns for both user and product factors are ~50-200 and therefore the column similarity flow does not work for such cases. This PR also adds batch similarUsers and similarProducts (SPARK-4675).

The following ideas are added:

  1. Similarity computation is abstracted as Kernel
  2. Kernel implementations for Cosine, RBF, Euclidean and Product (for distributed matrix multiply) are added
  3. Tests cover Cosine Kernel. More tests will be added for Euclidean, RBF and Product kernels.
  4. IndexedRowMatrix object adds a kernalized distributed matrix multiply which is used by similarity computation.
  5. In examples, MovieLensSimilarity is added that shows col and row based flows on MovieLens as runtime experiment.
  6. Level-1 BLAS is used so that kernel abstraction can be used. We can either design the Kernel abstraction with Level-3 BLAS (might be difficult) or use BlockMatrix for distributed matrix multiply.

Next steps:

  1. In MovieLensSimilarity add ALS + similarItems example
  2. Use RBF similarity in power iteration clustering flow

From internal experiments, we have run 6M users, 1.6M items, 351M ratings through row similarity flow with topK=200 in 1.1 hr with 240 cores running over 30 nodes. We had difficult time in scaling column similarity flow since the topK optimization can't be added until reduce phase is done in that flow.

On MovieLens-1M and Netflix dataset I will report row and col similarity runtime comparisons.

@SparkQA
Copy link

SparkQA commented May 17, 2015

Test build #32916 has finished for PR 6213 at commit cc4e104.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class EuclideanKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double) extends Kernel

@debasish83
Copy link
Author

@mengxr the failures are related to yarn suite which does not look related to my changes...tests I added ran fine...
[info] *** 1 TEST FAILED ***
[error] Failed: Total 39, Failed 1, Errors 0, Passed 38
[error] Failed tests:
[error] org.apache.spark.deploy.yarn.YarnClusterSuite

@debasish83
Copy link
Author

For CosineKernel and ProductKernel, we should be able to have a separate code path with BLAS-2 once SparseMatrix x SparseVector merges and BLAS-3 once SparseMatrix x SparseMatrix merges..Basically refactor blockify from MatrixFactorizationModel to IndexedRowMatrix...Right now the sparse features are not in master yet...For Euclidean, RBF and Pearson, even with these changes merged, I think we still have to stay in BLAS-1

@debasish83
Copy link
Author

SparseMatrix x SparseVector got merged to Master today #6209.

I will update the PR and separate the code path for CosineKernel/ProductKernel and EuclideanKernel/RBFKernel to see the runtime improvements.

@debasish83
Copy link
Author

I am thinking more. May be EuclideanKernel can be decomposed using Matrix x Vector as well

@debasish83
Copy link
Author

Actually both for Euclidean and RBF it is possible as || x - y || can be decomposed as ||x||2 + ||y||2 - 2*dot(x,y) where dot(x,y) can be computed through dgemv...dgemm we can't use yet since BLAS does not have SparseMatrix x SparseMatrix...Is there a open PR for it ?

@debasish83
Copy link
Author

For gemv it is not clear how to re-use the scratch space for result vector...if we can't reuse the result vector over multiple calls to kernel.compute we won't get much runtime benefits...I am considering that for Vector based IndexedRowMatrix, we define the kernel as the traditional (vector, vector) compute and use level 1 BLAS as done in this PR. The big runtime benefit will come from Approximate KNN that I will open up next but we still need the brute-force KNN for cross validation.

For (Long, Array[Double]) from matrix factorization model (similarUsers and similarProducts) we can use dgemm specifically for DenseMatrix x DenseMatrix...@mengxr what do you think ? That way we can use dgemm when the features are Dense..Also (Long, Array[Double]) data structure can be defined in recommendation/linalg package and re-used by dense kernel computation Or perhaps for similarity/KNN computation it is fine to stay in vector space and not do gemv/gemm optimization?

@debasish83 debasish83 changed the title [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity [MLLIB][WIP][SPARK-4675][SPARK-4823]RowSimilarity May 20, 2015
@debasish83 debasish83 changed the title [MLLIB][WIP][SPARK-4675][SPARK-4823]RowSimilarity [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilarity May 20, 2015
@debasish83
Copy link
Author

@mengxr I generalized MatrixFactorizationModel.recommendAll and use it for similarUsers and similarProducts and use dgemm...In IndexedRowMatrix I only exposed rowSimilarity as the public API and it uses blocked BLAS level-1 computation...It is easy to use gemv in IndexedRowMatrix.rowSimilarity for CosineKernel but for RBFKernel things will get tricky since for sparse vector, I don't think we can write euclidean distance as norm1 \times norm1 + norm2 \times norm2 - 2 dot(x, y) without letting go of some accuracy which might be ok compared to runtime benefits...I am looking further into RBF computation using dgemv...

@SparkQA
Copy link

SparkQA commented May 23, 2015

Test build #33417 has finished for PR 6213 at commit 00cabd0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

@debasish83
Copy link
Author

Internally vector flow in IndexedRowMatrix has helped us to do additional optimization through user defined kernels and cut the computation which won't happen if we go to dgemv since the matrix compute will be done first before filters based on norm can be applied...I think we should keep the vector based kernel compute and get user feedback first...

@SparkQA
Copy link

SparkQA commented May 23, 2015

Test build #33418 has finished for PR 6213 at commit aa21954.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

@SparkQA
Copy link

SparkQA commented May 23, 2015

Test build #33419 has finished for PR 6213 at commit 156ceca.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

@SparkQA
Copy link

SparkQA commented May 23, 2015

Test build #33421 has finished for PR 6213 at commit efe68f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

…LS for low rank item similarity calculation
@debasish83
Copy link
Author

Runtime comparison are posted on https://issues.apache.org/jira/browse/SPARK-4823 on MovieLens1m dataset, 8 core, 1 GB driver memory, 4 GB executor memory from my laptop.

Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s
Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins

I have not yet gone to gemv which will decrease the runtime further but will add some approximations in RBFKernel. I think for users we should give both vector based flow and gemv based flow to let them choose what they want.

I updated the example code that shows similarity flows in mllib through examples.mllib.MovieLensSimilarity

@MLnick @srowen could you please take a look at examples.mllib.MovieLensSimilarity ? I am running ALS in implicit mode with no regularization (basically full RMSE optimization) and comparing similarities as generated from raw features and MatrixFactorizationModel.productFactors.

I get topK=50 from raw features as golden labels and find MAP on top50 predictions from MatrixFactorizationModel.similarItems() that this PR added.

I will add a testcase for RBFKernel and add the PowerIterationClustering driver to use IndexedRowMatrix.rowSimilarities code before taking out WIP label from the PR.

@debasish83
Copy link
Author

Refactoring MatrixFactorizationModel.recommendForAll to a common place like Vectors/Matrices will help users who have dense data with modest columns (~1000-10K, most IoT data falls in this category) reuse dgemm based kernel computation. I am not sure which is a good place for this code ?

@SparkQA
Copy link

SparkQA commented May 24, 2015

Test build #33423 has finished for PR 6213 at commit af70583.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel extends Serializable
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

@rezazadeh
Copy link
Contributor

Hi @debasish83 thank you for this PR. As it stands, it has too many components, which it makes it hard to review individual contributions. @mengxr and I spoke about this, and are wondering if you'd like to split it up to smaller PRs. In order, the PRs would be the following:

  1. Adding rowSimilarities() for just cosine similarity (more similarity types adds extra reviewing, so please leave those out). Once this is done, then:
  2. Adding similarProducts and similarUsers to MatrixFactorizationModel, once this is done, then:
  3. Adding different similarity kernels

Could you please close this PR and submit the above in order, one at a time? We should work on each in order, i.e. wait for one to be merged before the next one is started to be reviewed. The relevant JIRAs are 1) SPARK-4823, 2) SPARK-4675, and 3) is new.

@debasish83
Copy link
Author

@rezazadeh sure I will do that....Could you add a JIRA for 3 (Kernel Clustering / PIC) so that we can add RBFKernel flow and implement PIC with vector - matrix multiply for comparisons ? Also in general topK can decrease the kernel size and is a cross validation parameter to see the degradation of the clustering compared to full kernel which is always difficult to keep as the rows grow...No such experiments have been done for PIC.

I am experimenting with gemv based optimization for SparseVector x SparseMatrix and if I get further speedup compared to level 1 flow most likely we will provide both options to the users in SPARK-4823.

@rezazadeh
Copy link
Contributor

@debasish83 It's not clear whether we need (3) yet, let's focus on (1) and (2) first. It's probably overkill to include all kinds of different similarity scores here. Let's focus on the performance of (1) and (2) first. You should probably close this PR and make a new one for (1). Please don't start working on (2) until (1) is merged, because they depend on each other heavily.

@debasish83
Copy link
Author

Internally we are using this code for euclidean/rbf driving PIC for example...but sure we can focus on cosine first...

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38333 has finished for PR 6213 at commit 3c49d98.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel extends Serializable
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38347 has finished for PR 6213 at commit 4e95ea6.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • trait Kernel extends Serializable
    • case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel
    • case class ProductKernel() extends Kernel
    • case class RBFKernel(rowNorms: Map[Long, Double], sigma: Double, threshold: Double) extends Kernel

@rezazadeh
Copy link
Contributor

Any progress on this @debasish83 ?

@debasish83
Copy link
Author

@rezazadeh got busy with spark streaming version of KNN :-) I will open up 2 PRs over the weekend as we discussed.

@jlamcanopy
Copy link

@debasish83 How this is looking? I want to try it soon :)

@tn531
Copy link

tn531 commented Dec 9, 2015

@debasish83 any status on this PR?

@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

@asfgit asfgit closed this in 7b4452b Dec 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants