Proposal: A Hybrid Graph & Vector Architecture for Code Intelligence #257

afidegnum · 2025-07-27T05:55:39Z

afidegnum
Jul 27, 2025

First, congratulations on VectorCode; its semantic capabilities are top-notch.

I'm writing to propose an architectural enhancement based on my own research into code intelligence. Like many, I started by exploring document and vector databases to understand code. While powerful for semantic meaning, I found a crucial missing piece. The breakthrough came when I introduced a graph database, which proved to be the perfect ally for modeling the explicit structure of a codebase.

I believe VectorCode could achieve a new level of code intelligence by adopting this hybrid approach. This architecture not only enhances its standalone capabilities but also provides a direct solution to some of the most requested—and difficult—features in today's AI coding assistants.

The Architecture: A Three-Step Pipeline

This model uses each tool for what it does best, creating a comprehensive understanding of code from text to execution flow.

1. Parse with Tree-sitter 🌳

As the foundational step, Tree-sitter reads raw source code and parses it into a detailed syntax tree. This provides the clean, structured data needed for all further analysis.

2. Build the Structure with a Graph Database 🔗

This is where my key finding comes into play. By traversing the Tree-sitter output, we can populate a graph database with the code's architecture:

Nodes: Functions, classes, files, variables.
Edges: CALLS, IMPORTS, INHERITS_FROM.

The graph database becomes the definitive source for structural truth.

3. Understand Meaning with a Vector Database 🧠

The vector database continues its vital role in storing embeddings for semantic search. Each node in our graph (e.g., the process_user_data function node) would be linked to its corresponding vector embedding, fusing the structural and semantic models.

Solving Unanswered Feature Needs (e.g., in Aider)

AI coding tools like Aider often receive feature requests for better repository-wide context. Users want the AI to understand the entire codebase, not just the files they've manually added to the chat. The graph architecture solves this elegantly.

The "Repo Map" is the Graph Itself

The most common request is for a "repository map." With this architecture, the graph database is the repo map—a live, queryable model of the entire project.

Problem: An LLM is asked to modify a function but doesn't know what other functions call it, leading to unsafe changes.
Solution: Before generating code, a quick graph query (MATCH (caller)-[:CALLS]->(this_function) RETURN caller) instantly provides the LLM with a complete list of callers. This is the exact context it needs to make informed decisions.

Answering Complex, Multi-File Questions

This architecture allows the AI to answer questions and perform tasks that require understanding relationships across many files.

User Request: "Refactor the User class and update all of its usages throughout the project."
How it Works:
1. The system locates the User class node in the graph.
2. It performs a graph traversal to find every node that has an IMPORTS or CALLS relationship with the User class.
3. This collection of files and functions forms the complete scope for the refactoring task, which can be automatically provided to the LLM.

This proactive, graph-based context gathering is far more robust and scalable than manually listing files or relying on simple text searches. It directly addresses the core need for an AI assistant that truly understands the architecture of the project it's working on.

Davidyz · 2025-07-27T08:52:37Z

Davidyz
Jul 27, 2025
Maintainer

I've thought about this before, and I feel like, in coding tasks (especially considering the fact that VectorCode is primarily used with neovim), a dedicated "graph database" may be overkill, because LSPs (those which support definitions/declarations, etc.) are essentially doing what graph databases are supposed to do. They build graphs of symbols in the repository and provide linked information such as call hierarchies, symbol definitions and references, and more. Codecompanion.nvim recently added a new tool that allows the LLM to retrieve information from LSP, and I have a personal toy project that gives the LLM access to a DAP session. Both of them utilise existing tool chains to provide context in a graph-like manner. Maybe a semantic search tool (like VectorCode) can supplement this on closed-source libraries where only the documentation is available, but that's a whole different level because it'll involve the cooperation of multiple tools, which a lot of the LLMs are struggling to do.

That said, I'm slowly refactoring VectorCode to support multiple database backends (currently there's only chromadb <1.0.0). When that's done, I'd be interested to look more closely at graph databases.

2 replies

afidegnum Jul 27, 2025
Author

Thanks for the thoughtful and detailed reply. You've hit on a fantastic point, and I completely agree with your core observation: LSPs are, in essence, building and serving a real-time graph of the codebase. For interactive, in-the-moment coding tasks like "find references," they are absolutely the right tool, and leveraging them as Codecompanion.nvim does is a very smart approach.

My suggestion for a graph database wasn't intended as a replacement for the LSP's real-time capabilities, but as a complementary tool for a different purpose: deep, whole-repo analytical queries.

Here’s how I see the distinction:

LSP Graph vs. A Persistent Analytical Graph

The LSP Graph is for the "Now": It's a live, often in-memory model designed for low-latency editor feedback. It excels at answering immediate questions like "What calls this specific function?" or "Where is this variable defined?"
A Persistent Graph DB is for the "Why" and "What If": This is a database that you build and enrich over time. Because it's persistent and has a powerful query language (like Cypher or GraphQL), it can answer much more complex, analytical questions that an LSP isn't designed for, such as:
- Architectural Analysis: "Show me all API endpoints that have a dependency path to a function marked with // DEPRECATED."
- Security Auditing: "Trace all possible execution paths from any function that handles user input to this vulnerable library function."
- Data Fusion: "Find all functions that were modified by 'John Doe' in the last month (from Git history), have high cyclomatic complexity (from a static analysis tool), and also touch our payment processing logic (from the code structure)."

Why This Matters for LLMs

This is where the real power for an AI assistant comes in. While an LSP can give an LLM context about a function's immediate callers, a query against the analytical graph can give it true architectural context. It can help the LLM reason more like a senior developer, understanding the ripple effects and design patterns of the entire system, not just the local file.

That's fantastic news that you're refactoring VectorCode to support multiple database backends. When that's done, a graph database could be a powerful new option—not to replace the LSP's role, but to enable this deeper, analytical layer of code intelligence.

I will be glad to contribute,

afidegnum Jul 27, 2025
Author

Since you will be looking into using a graph approach in the future, there are myriads of embedded graph database you can look into, they offer an amalgam of vector+graph features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: A Hybrid Graph & Vector Architecture for Code Intelligence #257

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Proposal: A Hybrid Graph & Vector Architecture for Code Intelligence #257

Uh oh!

Uh oh!

afidegnum Jul 27, 2025

The Architecture: A Three-Step Pipeline

1. Parse with Tree-sitter 🌳

2. Build the Structure with a Graph Database 🔗

3. Understand Meaning with a Vector Database 🧠

Solving Unanswered Feature Needs (e.g., in Aider)

The "Repo Map" is the Graph Itself

Answering Complex, Multi-File Questions

Replies: 1 comment · 2 replies

Uh oh!

Davidyz Jul 27, 2025 Maintainer

Uh oh!

afidegnum Jul 27, 2025 Author

LSP Graph vs. A Persistent Analytical Graph

Why This Matters for LLMs

Uh oh!

afidegnum Jul 27, 2025 Author

afidegnum
Jul 27, 2025

Replies: 1 comment 2 replies

Davidyz
Jul 27, 2025
Maintainer

afidegnum Jul 27, 2025
Author

afidegnum Jul 27, 2025
Author