-
Notifications
You must be signed in to change notification settings - Fork 138
Fix recall on score ties #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.0.0-beta.3
Are you sure you want to change the base?
Conversation
pkolaczk
commented
Sep 16, 2025
26c0c8b
to
b62c809
Compare
It is possible to insert non-equal vectors which score the same by the provided similarity measure. E.g. vectors (0.1, 0.1), (0.2, 0.2), (0.3, 0.3) all are really the same point under cosine metric and any pair of those would score 1.0 similarity. This edge case caused some serious issues with graph connectivity and the queries returned at most 33 nodes, even if for large graphs. This PR fixes it by improving fairness of node selection when nodes score the same. This is achieved by a small modification to how node ids are encoded in the NodeQueue. When nodes score the same, they were compared by node ids, which always preferred the nodes added earlier. That created a huge bias. If we reverse the bits of node ids, now this shuffles their order and breaks the systematic bias towards the older nodes. Another part of the fix is making sure nodes with the same score don't block backlinks to be formed. By placing new neighbours before the neighbours with the same score, we're giving them a chance to be linked to. It will drop some other neighbor from the list, but considering it was present in the graph for some time already, it is much more likely to be already well connected.
b62c809
to
aaf4136
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical issue where vectors with identical similarity scores could cause poor recall in graph connectivity. The fix involves improving fairness in node selection when scores are tied by reversing node ID bits instead of using raw node IDs, and modifying insertion order to place new neighbors before existing ones with the same score.
Key changes:
- Modified node ID encoding in NodeQueue to use bit reversal instead of complement for fairer tie-breaking
- Changed NodeArray insertion to use leftmost insertion point instead of rightmost for same-score nodes
- Added comprehensive test coverage for the cosine similarity edge case
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
NodeQueue.java | Updates node encoding to use Integer.reverse() instead of bitwise complement for fairer ordering |
NodeArray.java | Changes insertion logic to place new nodes before existing ones with same scores |
TestVectorGraph.java | Adds test case for vectors with identical cosine similarity scores |
TestNodeArray.java | Updates existing test expectations to match new tie-breaking behavior |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Even after those changes the number of results is sometimes less than 50% of the graph and the test is a bit flaky. I'm kinda reluctant to decrease the threshold in a test further before understanding fully why the graph still falls apart. |
There is something fishy going on here. The tests are sometimes failing with returning even fewer vectors than 33; which is unexpected to me considering each vector has space to connect to 32 neighbors. E.g. some return just 1 or 2 vectors. This happens only 1-10 / 1000 times. So far I debugged those failed runs to have some issue when constructing the graph.
|
@pkolaczk we are investigating a potentially related issue. Let's wait on this for a minute. |