-
Notifications
You must be signed in to change notification settings - Fork 138
Ensure that no node duplicates exist in the adjacency list of any node. #522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Would like to see the results of the perf benchmark GHA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good catch. I left some minor nits inline. I agree with Ted regarding measuring the perf, but I'm woefully behind the times on the work you've done there, so I'll hold off on approving since I don't know the standards.
|
||
// If elements remain in a2, add them | ||
if (j < a2.size()) { | ||
// avoid duplicates while adding nodes with the same score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment describing old strategy -- can clean up on commit to make this clearer
|
||
// If elements remain in a1, add them | ||
if (i < a1.size()) { | ||
// avoid duplicates while adding nodes with the same score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment describing old strategy -- can clean up on commit
for (int i = insertionPoint - 1; i >= 0 && scores[i] == newScore; i--) { | ||
if (nodes[i] == newNode) { | ||
return true; | ||
private boolean duplicateExists(int insertionPoint, int newNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be marginally cleaner -- at i = 0, we check nodes[insertion] point at each conditional, and the loop bounds are unnecessarily pessimistic (i.e., if the insertion point is right in the middle, we'll waste a bunch of loop iterations in both directions, when we could just go to the max radius around the insertion point to hit one end of the array). No idea if this will matter in practice but might be worth measuring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., something like
int n = this.size;
int[] a = this.nodes;
int left = insertionPoint - 1;
int right = insertionPoint;
// Exact hit fast path
if (right < n && a[right] == newNode) return true;
// Expand outward
while (left >= 0 || right < n) {
if (left >= 0 && a[left] == newNode) return true;
if (right < n && a[right] == newNode) return true;
left--;
right++;
}
return false;
Initial benchmarking shows that index construction got a bit slower (~10%), which is not surprising. Even if this PR moves things in the right direction conceptually, in practice it does not offer a net benefit. I think that we need a better solution and not just a patch. Will transform the PR into a draft until that superior solution is in place. |
The most recent commits overhaul the strategy used to ensure that edges in the graph are unique. NodeArray has a method |
…st the expected size down by 4 in GraphIndexBuilderTest.testEstimatedBytes
Most nodes in a graph have MANY duplicated entries in their adjacency list. This is undesirable as it reduces the effective degree of the graph.
This was occurring because scores where the main vehicle to check whether a node was inserted twice in an adjacency list. However, when we are building a graph with PQ we use a mix of
sim(x1, quant(x2))
andsim(quant(x1), quant(x2))
similarities. Because quantization is lossy,sim(x1, quant(x2)) != sim(quant(x1), quant(x2))
. Thus, we can have two different scores associated with a given node, depending on which quantized similarity we use.This PR changes the way we are computing duplicates in NodeArray, to prevent the emergence of these duplicates. Now, we only use the node ordinals for these checks.
One potential future improvement is to order every adjacency by the node ordinals, so that duplicate checks are faster. There is a fine tradeoff between accelerating these checks and decelerating the diversity computation that needs to be carefully analyzed.