Skip to content

Conversation

sam-herman
Copy link
Contributor

@sam-herman sam-herman commented Sep 12, 2025

Description

Multiple database projects such as C* and OpenSearch/Solr/Lucene use LSM mechanism with frequent merges.
This in turn creates a high overhead that forces us to reconstruct the entire graph from scratch upon every merge.
A more economic approach would be to pick a leading graph that was previously persisted to disk and incrementally add small graph nodes to it.

This PR makes some of the required changes to support that behavior in upstream systems such as C* and OpenSearch.

Changes

  1. Serializable Neighbor Distance Cache - Add a serializable cache that can store the node distances within the OnDiskGraphIndex for a faster re-creation of the OnHeapGraphIndex when read back from disk. The cache is separate and is optional, therefore we can choose whether to apply it or not, without any breaking changes to the current OnDiskGraphIndex. This can later be augmented to a format if we choose to.
  2. Help Methods And Constructors - Add helper method and constructors to facilitate easier usage by other projects with graph merge use cases.

Testing

Added tests for graph overlap and recall for reconstructed graphs.

@sam-herman sam-herman force-pushed the reconstruct-heap-graph-from-disk-graph branch from e30816a to 349bd07 Compare September 25, 2025 16:54
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant