Replies: 1 comment 2 replies
-
I've thought about this before, and I feel like, in coding tasks (especially considering the fact that VectorCode is primarily used with neovim), a dedicated "graph database" may be overkill, because LSPs (those which support definitions/declarations, etc.) are essentially doing what graph databases are supposed to do. They build graphs of symbols in the repository and provide linked information such as call hierarchies, symbol definitions and references, and more. Codecompanion.nvim recently added a new tool that allows the LLM to retrieve information from LSP, and I have a personal toy project that gives the LLM access to a DAP session. Both of them utilise existing tool chains to provide context in a graph-like manner. Maybe a semantic search tool (like VectorCode) can supplement this on closed-source libraries where only the documentation is available, but that's a whole different level because it'll involve the cooperation of multiple tools, which a lot of the LLMs are struggling to do. That said, I'm slowly refactoring VectorCode to support multiple database backends (currently there's only chromadb <1.0.0). When that's done, I'd be interested to look more closely at graph databases. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
First, congratulations on
VectorCode
; its semantic capabilities are top-notch.I believe
VectorCode
could achieve a new level of code intelligence by adopting this hybrid approach. This architecture not only enhances its standalone capabilities but also provides a direct solution to some of the most requested—and difficult—features in today's AI coding assistants.The Architecture: A Three-Step Pipeline
This model uses each tool for what it does best, creating a comprehensive understanding of code from text to execution flow.
1. Parse with Tree-sitter 🌳
As the foundational step, Tree-sitter reads raw source code and parses it into a detailed syntax tree. This provides the clean, structured data needed for all further analysis.
2. Build the Structure with a Graph Database 🔗
This is where my key finding comes into play. By traversing the Tree-sitter output, we can populate a graph database with the code's architecture:
CALLS
,IMPORTS
,INHERITS_FROM
.The graph database becomes the definitive source for structural truth.
3. Understand Meaning with a Vector Database 🧠
The vector database continues its vital role in storing embeddings for semantic search. Each node in our graph (e.g., the
process_user_data
function node) would be linked to its corresponding vector embedding, fusing the structural and semantic models.Solving Unanswered Feature Needs (e.g., in Aider)
AI coding tools like Aider often receive feature requests for better repository-wide context. Users want the AI to understand the entire codebase, not just the files they've manually added to the chat. The graph architecture solves this elegantly.
The "Repo Map" is the Graph Itself
The most common request is for a "repository map." With this architecture, the graph database is the repo map—a live, queryable model of the entire project.
MATCH (caller)-[:CALLS]->(this_function) RETURN caller
) instantly provides the LLM with a complete list of callers. This is the exact context it needs to make informed decisions.Answering Complex, Multi-File Questions
This architecture allows the AI to answer questions and perform tasks that require understanding relationships across many files.
User
class and update all of its usages throughout the project."User
class node in the graph.IMPORTS
orCALLS
relationship with theUser
class.This proactive, graph-based context gathering is far more robust and scalable than manually listing files or relying on simple text searches. It directly addresses the core need for an AI assistant that truly understands the architecture of the project it's working on.
Beta Was this translation helpful? Give feedback.
All reactions