-
Couldn't load subscription status.
- Fork 28.9k
[MLLIB][SPARK-4675] Find similar products and similar users in MatrixFactorizationModel #3536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Similari, also below.
|
I think it's essential to explain (even in internal comments, or this PR) what the similarity metric is. It's just ranking by dot product, which makes it something like cosine similarity. The differences are that it isn't in [-1,1], and the result doesn't normalize away the length of the feature vectors. This tends to favor popular items, or mean that somewhat less similar items may rank higher because they're popular. I had traditionally viewed that as a negative, and preferred the more standard cosine similarity, but it's certainly up for debate. |
|
Re: Explaining similarity metric 👍 I'll do that. Re: Cosine - no biggie to add. I used dot product because 1) Taking the logic that CF is finding "similar" items based on the latent space for a user when recommending products and 2) Using dot product would reduce the new code added to MatrixFactorizationModel ( I don't want to create clutter :)) So 👍 will change to cosine Re: Popularity, I'll look into that as well then. |
|
I wasn't necessarily suggesting changing the similarity metric although I I suppose my point is that the recommendation computation does not use a
|
|
I'd agree that cosine similarity is preferred. Can't really think of a case where I've not used cosine sim for a similar items or similar users computation. Of course, it could be added as an option potentially (whether to use cosine sim - default - or plain dot product). |
|
Can we discuss it more on the JIRA ? I updated it with my comments... I think we should add a API called rowSimilarities() in RowMatrix.scala and use that to compute similar rows...We can send ProductFactor and UserFactor both to this API to compute the topK row similarities through map-reduce... In RowMatrix right now cosine is used but Reza has plans to add jaccard and other popular similarity measures.. We should be able to re-use the code to generate kernel matrices as well... We can assume that tall skinny matrices are being sent to rowSimilarities (we have a similar assumption on colSimilarities as well)...product x rank and user x rank are both tall-skinny matrices |
|
I kind of like this functionality; @sbourke are you in a position to continue with this? I think it needs a few typo fixes and commentary about its function. This is really about computing similar users and items rather than recommending items to users or users to items. I also think it has to include dividing by the norms to be cosine similarity. |
|
@srowen we have to be a bit careful since dense blas has to be used...I have a internal version with dot and it needs to be more faster..also one at a time is not a good idea...there has to be block dense matrix * block dense matrix operation...that way we can reuse native dgemm... |
|
Agree, although this is no worse than the existing implementation for recommendation (it reuses it even). |
|
agreed...Dense BLAS will be a common optimization to item->item, user->user and user->item APIs... |
|
@MLnick @srowen I did an experiment where I computed brute force topK similar items using cosine distance and compared the intersection with item factor based brute force topK similar items using cosine distance after running implicit factorization...intersection is only 42%...this is inline with Google Correlate paper where they have to do an additional reorder step in real feature space to increase the recall (intersect)...did you guys also see similar results for item->item validation ? |
|
Not sure I follow completely - do you mean you compared cosine sim between raw (ie "rating") item vectors, and cosine sim computed from item factor vectors? I would imagine they would be quite different... I always just use factor vectors |
|
I have not benchmarked these since neither is a "correct" answer to benchmark against the other. The cosine similarity isn't really that valid in the original feature space. It might still be interesting to know how different the answers are but they're probably going to be fairly different on purpose. |
|
@MLnick yes that's what I did...I have to convince users why use factor vectors :-) For user->item recommendation, convincing is easy by showing the ranking improvement through ALS @srowen without coming up with a validation strategy, someone might propose to run a different algorithm (KMeans on raw feature space followed by (item->cluster) join (cluster->items)) and claims his item->item results are better...how do we know whether ALS based flow is producing better result or KMeans based flow ? NNALS can be thought of soft-kmeans as well and so these flows are very similar. I am focused on implicit feedback here because then only we can run either KMeans or Similarity on raw feature space...With explicit feedback, I agree that cosine similarity is not valid in original feature space. But in most practical datasets, we are dealing with implicit feedback. |
|
Let's continue the validation discussion on #6213. The PR introduces batch gemm based similarity computation in MatrixFactorizationModel using kernel abstraction. Do need the online version as well that Steven added or it can be extracted out of batch results ? My focus was more on speeding up batch computation... |
|
Can one of the admins verify this patch? |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products based on the feature space.