[MLLIB][SPARK-4675] Find similar products and similar users in MatrixFactorizationModel #3536

sbourke · 2014-12-01T09:52:52Z

Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products based on the feature space.

srowen · 2014-12-01T09:53:53Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

Typo: Similari, also below.

srowen · 2014-12-01T10:00:23Z

I think it's essential to explain (even in internal comments, or this PR) what the similarity metric is. It's just ranking by dot product, which makes it something like cosine similarity. The differences are that it isn't in [-1,1], and the result doesn't normalize away the length of the feature vectors. This tends to favor popular items, or mean that somewhat less similar items may rank higher because they're popular. I had traditionally viewed that as a negative, and preferred the more standard cosine similarity, but it's certainly up for debate.

sbourke · 2014-12-03T11:02:41Z

Re: Explaining similarity metric 👍 I'll do that.

Re: Cosine - no biggie to add. I used dot product because 1) Taking the logic that CF is finding "similar" items based on the latent space for a user when recommending products and 2) Using dot product would reduce the new code added to MatrixFactorizationModel ( I don't want to create clutter :)) So 👍 will change to cosine

Re: Popularity, I'll look into that as well then.

srowen · 2014-12-03T11:20:41Z

I wasn't necessarily suggesting changing the similarity metric although I
ended up using cosine too. Note you can skip normalizing by the target
item's norm.

I suppose my point is that the recommendation computation does not use a
dot product because it is performing a similarity computation. Those
vectors are not even in the same space. So I wouldn't reuse that logic on
the grounds that it is reusing a similarity computation.
On Dec 3, 2014 5:03 AM, "Steven" [email protected] wrote:

Re: Explaining similarity metric [image: 👍] I'll do that.

Re: Cosine - no biggie to add. I used dot product because 1) Taking the
logic that CF is finding "similar" items based on the latent space for a
user when recommending products and 2) Using dot product would reduce the
new code added to MatrixFactorizationModel ( I don't want to create clutter
:)) So [image: 👍] will change to cosine

Re: Popularity, I'll look into that as well then.

—
Reply to this email directly or view it on GitHub
#3536 (comment).

MLnick · 2014-12-09T18:29:59Z

I'd agree that cosine similarity is preferred. Can't really think of a case where I've not used cosine sim for a similar items or similar users computation. Of course, it could be added as an option potentially (whether to use cosine sim - default - or plain dot product).

debasish83 · 2014-12-10T18:56:43Z

Can we discuss it more on the JIRA ? I updated it with my comments...

I think we should add a API called rowSimilarities() in RowMatrix.scala and use that to compute similar rows...We can send ProductFactor and UserFactor both to this API to compute the topK row similarities through map-reduce...

In RowMatrix right now cosine is used but Reza has plans to add jaccard and other popular similarity measures..

We should be able to re-use the code to generate kernel matrices as well...

We can assume that tall skinny matrices are being sent to rowSimilarities (we have a similar assumption on colSimilarities as well)...product x rank and user x rank are both tall-skinny matrices

srowen · 2015-03-19T20:18:46Z

I kind of like this functionality; @sbourke are you in a position to continue with this? I think it needs a few typo fixes and commentary about its function. This is really about computing similar users and items rather than recommending items to users or users to items. I also think it has to include dividing by the norms to be cosine similarity.

debasish83 · 2015-03-19T20:33:20Z

@srowen we have to be a bit careful since dense blas has to be used...I have a internal version with dot and it needs to be more faster..also one at a time is not a good idea...there has to be block dense matrix * block dense matrix operation...that way we can reuse native dgemm...

srowen · 2015-03-19T20:34:36Z

Agree, although this is no worse than the existing implementation for recommendation (it reuses it even).

debasish83 · 2015-03-19T20:42:18Z

agreed...Dense BLAS will be a common optimization to item->item, user->user and user->item APIs...

debasish83 · 2015-05-03T01:35:44Z

@MLnick @srowen I did an experiment where I computed brute force topK similar items using cosine distance and compared the intersection with item factor based brute force topK similar items using cosine distance after running implicit factorization...intersection is only 42%...this is inline with Google Correlate paper where they have to do an additional reorder step in real feature space to increase the recall (intersect)...did you guys also see similar results for item->item validation ?

MLnick · 2015-05-05T06:40:16Z

Not sure I follow completely - do you mean you compared cosine sim between raw (ie "rating") item vectors, and cosine sim computed from item factor vectors? I would imagine they would be quite different...

I always just use factor vectors

srowen · 2015-05-05T07:30:07Z

I have not benchmarked these since neither is a "correct" answer to benchmark against the other. The cosine similarity isn't really that valid in the original feature space. It might still be interesting to know how different the answers are but they're probably going to be fairly different on purpose.

debasish83 · 2015-05-05T14:38:21Z

@MLnick yes that's what I did...I have to convince users why use factor vectors :-) For user->item recommendation, convincing is easy by showing the ranking improvement through ALS

@srowen without coming up with a validation strategy, someone might propose to run a different algorithm (KMeans on raw feature space followed by (item->cluster) join (cluster->items)) and claims his item->item results are better...how do we know whether ALS based flow is producing better result or KMeans based flow ? NNALS can be thought of soft-kmeans as well and so these flows are very similar.

I am focused on implicit feedback here because then only we can run either KMeans or Similarity on raw feature space...With explicit feedback, I agree that cosine similarity is not valid in original feature space. But in most practical datasets, we are dealing with implicit feedback.

debasish83 · 2015-05-24T15:24:38Z

Let's continue the validation discussion on #6213. The PR introduces batch gemm based similarity computation in MatrixFactorizationModel using kernel abstraction. Do need the online version as well that Steven added or it can be extracted out of batch results ? My focus was more on speeding up batch computation...

AmplabJenkins · 2015-07-13T21:54:40Z

Can one of the admins verify this patch?

rxin · 2015-12-31T02:39:55Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

sbourke and others added 2 commits November 28, 2014 09:40

added functionality to find similar users and similar products

956ca1b

added unit test to make sure id isnt teh same

12e6b6b

srowen reviewed Dec 1, 2014
View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

Copy link

Member

srowen Dec 1, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Similari, also below.

asfgit closed this in 7b4452b Dec 31, 2015

Uh oh!

[MLLIB][SPARK-4675] Find similar products and similar users in MatrixFactorizationModel #3536

[MLLIB][SPARK-4675] Find similar products and similar users in MatrixFactorizationModel #3536

Uh oh!

Conversation

sbourke commented Dec 1, 2014

Uh oh!

srowen Dec 1, 2014

Choose a reason for hiding this comment

Uh oh!

srowen commented Dec 1, 2014

Uh oh!

sbourke commented Dec 3, 2014

Uh oh!

srowen commented Dec 3, 2014

Uh oh!

MLnick commented Dec 9, 2014

Uh oh!

debasish83 commented Dec 10, 2014

Uh oh!

srowen commented Mar 19, 2015

Uh oh!

debasish83 commented Mar 19, 2015

Uh oh!

srowen commented Mar 19, 2015

Uh oh!

debasish83 commented Mar 19, 2015

Uh oh!

debasish83 commented May 3, 2015

Uh oh!

MLnick commented May 5, 2015

Uh oh!

srowen commented May 5, 2015

Uh oh!

debasish83 commented May 5, 2015

Uh oh!

debasish83 commented May 24, 2015

Uh oh!

AmplabJenkins commented Jul 13, 2015

Uh oh!

rxin commented Dec 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants