Skip to content

Conversation

@sbourke
Copy link

@sbourke sbourke commented Dec 1, 2014

Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products based on the feature space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Similari, also below.

@srowen
Copy link
Member

srowen commented Dec 1, 2014

I think it's essential to explain (even in internal comments, or this PR) what the similarity metric is. It's just ranking by dot product, which makes it something like cosine similarity. The differences are that it isn't in [-1,1], and the result doesn't normalize away the length of the feature vectors. This tends to favor popular items, or mean that somewhat less similar items may rank higher because they're popular. I had traditionally viewed that as a negative, and preferred the more standard cosine similarity, but it's certainly up for debate.

@sbourke
Copy link
Author

sbourke commented Dec 3, 2014

Re: Explaining similarity metric 👍 I'll do that.

Re: Cosine - no biggie to add. I used dot product because 1) Taking the logic that CF is finding "similar" items based on the latent space for a user when recommending products and 2) Using dot product would reduce the new code added to MatrixFactorizationModel ( I don't want to create clutter :)) So 👍 will change to cosine

Re: Popularity, I'll look into that as well then.

@srowen
Copy link
Member

srowen commented Dec 3, 2014

I wasn't necessarily suggesting changing the similarity metric although I
ended up using cosine too. Note you can skip normalizing by the target
item's norm.

I suppose my point is that the recommendation computation does not use a
dot product because it is performing a similarity computation. Those
vectors are not even in the same space. So I wouldn't reuse that logic on
the grounds that it is reusing a similarity computation.
On Dec 3, 2014 5:03 AM, "Steven" [email protected] wrote:

Re: Explaining similarity metric [image: 👍] I'll do that.

Re: Cosine - no biggie to add. I used dot product because 1) Taking the
logic that CF is finding "similar" items based on the latent space for a
user when recommending products and 2) Using dot product would reduce the
new code added to MatrixFactorizationModel ( I don't want to create clutter
:)) So [image: 👍] will change to cosine

Re: Popularity, I'll look into that as well then.


Reply to this email directly or view it on GitHub
#3536 (comment).

@MLnick
Copy link
Contributor

MLnick commented Dec 9, 2014

I'd agree that cosine similarity is preferred. Can't really think of a case where I've not used cosine sim for a similar items or similar users computation. Of course, it could be added as an option potentially (whether to use cosine sim - default - or plain dot product).

@debasish83
Copy link

Can we discuss it more on the JIRA ? I updated it with my comments...

I think we should add a API called rowSimilarities() in RowMatrix.scala and use that to compute similar rows...We can send ProductFactor and UserFactor both to this API to compute the topK row similarities through map-reduce...

In RowMatrix right now cosine is used but Reza has plans to add jaccard and other popular similarity measures..

We should be able to re-use the code to generate kernel matrices as well...

We can assume that tall skinny matrices are being sent to rowSimilarities (we have a similar assumption on colSimilarities as well)...product x rank and user x rank are both tall-skinny matrices

@srowen
Copy link
Member

srowen commented Mar 19, 2015

I kind of like this functionality; @sbourke are you in a position to continue with this? I think it needs a few typo fixes and commentary about its function. This is really about computing similar users and items rather than recommending items to users or users to items. I also think it has to include dividing by the norms to be cosine similarity.

@debasish83
Copy link

@srowen we have to be a bit careful since dense blas has to be used...I have a internal version with dot and it needs to be more faster..also one at a time is not a good idea...there has to be block dense matrix * block dense matrix operation...that way we can reuse native dgemm...

@srowen
Copy link
Member

srowen commented Mar 19, 2015

Agree, although this is no worse than the existing implementation for recommendation (it reuses it even).

@debasish83
Copy link

agreed...Dense BLAS will be a common optimization to item->item, user->user and user->item APIs...

@debasish83
Copy link

@MLnick @srowen I did an experiment where I computed brute force topK similar items using cosine distance and compared the intersection with item factor based brute force topK similar items using cosine distance after running implicit factorization...intersection is only 42%...this is inline with Google Correlate paper where they have to do an additional reorder step in real feature space to increase the recall (intersect)...did you guys also see similar results for item->item validation ?

@MLnick
Copy link
Contributor

MLnick commented May 5, 2015

Not sure I follow completely - do you mean you compared cosine sim between raw (ie "rating") item vectors, and cosine sim computed from item factor vectors? I would imagine they would be quite different...

I always just use factor vectors

@srowen
Copy link
Member

srowen commented May 5, 2015

I have not benchmarked these since neither is a "correct" answer to benchmark against the other. The cosine similarity isn't really that valid in the original feature space. It might still be interesting to know how different the answers are but they're probably going to be fairly different on purpose.

@debasish83
Copy link

@MLnick yes that's what I did...I have to convince users why use factor vectors :-) For user->item recommendation, convincing is easy by showing the ranking improvement through ALS

@srowen without coming up with a validation strategy, someone might propose to run a different algorithm (KMeans on raw feature space followed by (item->cluster) join (cluster->items)) and claims his item->item results are better...how do we know whether ALS based flow is producing better result or KMeans based flow ? NNALS can be thought of soft-kmeans as well and so these flows are very similar.

I am focused on implicit feedback here because then only we can run either KMeans or Similarity on raw feature space...With explicit feedback, I agree that cosine similarity is not valid in original feature space. But in most practical datasets, we are dealing with implicit feedback.

@debasish83
Copy link

Let's continue the validation discussion on #6213. The PR introduces batch gemm based similarity computation in MatrixFactorizationModel using kernel abstraction. Do need the online version as well that Steven added or it can be extracted out of batch results ? My focus was more on speeding up batch computation...

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

@asfgit asfgit closed this in 7b4452b Dec 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants