-
Couldn't load subscription status.
- Fork 28.9k
[WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilarity #6213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rs and similarProducts
…eatures and ALS factors
|
Test build #32916 has finished for PR 6213 at commit
|
|
@mengxr the failures are related to yarn suite which does not look related to my changes...tests I added ran fine... |
|
For CosineKernel and ProductKernel, we should be able to have a separate code path with BLAS-2 once SparseMatrix x SparseVector merges and BLAS-3 once SparseMatrix x SparseMatrix merges..Basically refactor blockify from MatrixFactorizationModel to IndexedRowMatrix...Right now the sparse features are not in master yet...For Euclidean, RBF and Pearson, even with these changes merged, I think we still have to stay in BLAS-1 |
|
SparseMatrix x SparseVector got merged to Master today #6209. I will update the PR and separate the code path for CosineKernel/ProductKernel and EuclideanKernel/RBFKernel to see the runtime improvements. |
|
I am thinking more. May be EuclideanKernel can be decomposed using Matrix x Vector as well |
|
Actually both for Euclidean and RBF it is possible as || x - y || can be decomposed as ||x||2 + ||y||2 - 2*dot(x,y) where dot(x,y) can be computed through dgemv...dgemm we can't use yet since BLAS does not have SparseMatrix x SparseMatrix...Is there a open PR for it ? |
|
For gemv it is not clear how to re-use the scratch space for result vector...if we can't reuse the result vector over multiple calls to kernel.compute we won't get much runtime benefits...I am considering that for Vector based IndexedRowMatrix, we define the kernel as the traditional (vector, vector) compute and use level 1 BLAS as done in this PR. The big runtime benefit will come from Approximate KNN that I will open up next but we still need the brute-force KNN for cross validation. For (Long, Array[Double]) from matrix factorization model (similarUsers and similarProducts) we can use dgemm specifically for DenseMatrix x DenseMatrix...@mengxr what do you think ? That way we can use dgemm when the features are Dense..Also (Long, Array[Double]) data structure can be defined in recommendation/linalg package and re-used by dense kernel computation Or perhaps for similarity/KNN computation it is fine to stay in vector space and not do gemv/gemm optimization? |
…and item->item (cosine kernel)
|
@mengxr I generalized MatrixFactorizationModel.recommendAll and use it for similarUsers and similarProducts and use dgemm...In IndexedRowMatrix I only exposed rowSimilarity as the public API and it uses blocked BLAS level-1 computation...It is easy to use gemv in IndexedRowMatrix.rowSimilarity for CosineKernel but for RBFKernel things will get tricky since for sparse vector, I don't think we can write euclidean distance as norm1 \times norm1 + norm2 \times norm2 - 2 dot(x, y) without letting go of some accuracy which might be ok compared to runtime benefits...I am looking further into RBF computation using dgemv... |
|
Test build #33417 has finished for PR 6213 at commit
|
|
Internally vector flow in IndexedRowMatrix has helped us to do additional optimization through user defined kernels and cut the computation which won't happen if we go to dgemv since the matrix compute will be done first before filters based on norm can be applied...I think we should keep the vector based kernel compute and get user feedback first... |
|
Test build #33418 has finished for PR 6213 at commit
|
|
Test build #33419 has finished for PR 6213 at commit
|
|
Test build #33421 has finished for PR 6213 at commit
|
…LS for low rank item similarity calculation
|
Runtime comparison are posted on https://issues.apache.org/jira/browse/SPARK-4823 on MovieLens1m dataset, 8 core, 1 GB driver memory, 4 GB executor memory from my laptop. Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s I have not yet gone to gemv which will decrease the runtime further but will add some approximations in RBFKernel. I think for users we should give both vector based flow and gemv based flow to let them choose what they want. I updated the example code that shows similarity flows in mllib through examples.mllib.MovieLensSimilarity @MLnick @srowen could you please take a look at examples.mllib.MovieLensSimilarity ? I am running ALS in implicit mode with no regularization (basically full RMSE optimization) and comparing similarities as generated from raw features and MatrixFactorizationModel.productFactors. I get topK=50 from raw features as golden labels and find MAP on top50 predictions from MatrixFactorizationModel.similarItems() that this PR added. I will add a testcase for RBFKernel and add the PowerIterationClustering driver to use IndexedRowMatrix.rowSimilarities code before taking out WIP label from the PR. |
|
Refactoring MatrixFactorizationModel.recommendForAll to a common place like Vectors/Matrices will help users who have dense data with modest columns (~1000-10K, most IoT data falls in this category) reuse dgemm based kernel computation. I am not sure which is a good place for this code ? |
|
Test build #33423 has finished for PR 6213 at commit
|
|
Hi @debasish83 thank you for this PR. As it stands, it has too many components, which it makes it hard to review individual contributions. @mengxr and I spoke about this, and are wondering if you'd like to split it up to smaller PRs. In order, the PRs would be the following:
Could you please close this PR and submit the above in order, one at a time? We should work on each in order, i.e. wait for one to be merged before the next one is started to be reviewed. The relevant JIRAs are 1) SPARK-4823, 2) SPARK-4675, and 3) is new. |
|
@rezazadeh sure I will do that....Could you add a JIRA for 3 (Kernel Clustering / PIC) so that we can add RBFKernel flow and implement PIC with vector - matrix multiply for comparisons ? Also in general topK can decrease the kernel size and is a cross validation parameter to see the degradation of the clustering compared to full kernel which is always difficult to keep as the rows grow...No such experiments have been done for PIC. I am experimenting with gemv based optimization for SparseVector x SparseMatrix and if I get further speedup compared to level 1 flow most likely we will provide both options to the users in SPARK-4823. |
|
@debasish83 It's not clear whether we need (3) yet, let's focus on (1) and (2) first. It's probably overkill to include all kinds of different similarity scores here. Let's focus on the performance of (1) and (2) first. You should probably close this PR and make a new one for (1). Please don't start working on (2) until (1) is merged, because they depend on each other heavily. |
|
Internally we are using this code for euclidean/rbf driving PIC for example...but sure we can focus on cosine first... |
|
Test build #38333 has finished for PR 6213 at commit
|
|
Test build #38347 has finished for PR 6213 at commit
|
|
Any progress on this @debasish83 ? |
|
@rezazadeh got busy with spark streaming version of KNN :-) I will open up 2 PRs over the weekend as we discussed. |
|
@debasish83 How this is looking? I want to try it soon :) |
|
@debasish83 any status on this PR? |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
@mengxr @srowen @rezazadeh
For RowMatrix with 100K columns, colSimilarity with bruteforce/dimsum sampling is used. This PR adds rowSimilarity to IndexedRowMatrix which outputs a CoordinateMatrix. For matrices where columns are > 1M, rowSimilarity flow scales better compared to column similarity flow.
For most applications, topK similar items requirement is much less than all available items and therefore the rowSimilarity API takes topK and threshold as input. topK and threshold help in improving shuffle space.
For MatrixFactorization model generally the columns for both user and product factors are ~50-200 and therefore the column similarity flow does not work for such cases. This PR also adds batch similarUsers and similarProducts (SPARK-4675).
The following ideas are added:
Next steps:
From internal experiments, we have run 6M users, 1.6M items, 351M ratings through row similarity flow with topK=200 in 1.1 hr with 240 cores running over 30 nodes. We had difficult time in scaling column similarity flow since the topK optimization can't be added until reduce phase is done in that flow.
On MovieLens-1M and Netflix dataset I will report row and col similarity runtime comparisons.