Making Clear What You Are Actually Doing Here #58

owlwres · 2025-04-02T10:47:57Z

owlwres
Apr 2, 2025

This plugin makes some sort of embedding of the source files into some vector space so that some least distance metric can more easily find files related to a particular query.

I think that it's important to make clear what is happening here.

As far as I understand SentenceTransformer is for natural language embeddings.
What consequence does this have for the quality of our embeddings? You make mention of some basic chunking procedure, but don't go into detail on what is actually being done.

I'd like to find, for a lay person, what effect should the selected embedding have on the effectiveness of the vectorization for search, can you please make more accessible the technical details.

Davidyz · 2025-04-02T12:08:44Z

Davidyz
Apr 2, 2025
Maintainer

Hi, that's a fair question for both potential contributors and people who want to know this for privacy reasons. I'll begin by admitting that the documentation isn't the best, partially because I didn't expect so many people to be interested in this. I'll explain the whole process (vectorise and query) here so that you (and other people interested in this) can have a better idea about what happens behind the scenes and any privacy concerns that you may have.

This project uses ChromeDB as the backend. It's a vector database that stores the embedding vectors. It contains some telemetry (which I tried my best, from my end, to disable). It handles the storage and retrieval of the data. For each project, vectorcode creates a new collection for it and puts all embeddings for this project in this collection. This isolates projects from each other so that your queries are project-specific.

When you vectorise files, the files will first go through a chunker that splits large documents into smaller pieces so that embedding models can properly handle them without losing too much information. There are 2 different chunking strategies for files:

Character-count-based naive chunking: evenly split the file into chunks by character counts. That is, if the chunk_size option is set to 2500, the chunker will yield a chunk for every 2500 characters in the file. This also implements an overlapping mechanism (the overlap_ratio option), which helps prevent the loss of information due to splitting texts from mid-sentence;
Treesitter-based semantic chunking: If vectorcode managed to guess the filetype for a file (based on the filename suffix), it'll use treesitter to traverse the nodes in the AST and split the source code (by nodes in the AST) into chunks. This is also bound by the chunk_size option. This chunker is more likely to produce semantically meaningful chunks. If vectorcode fails to guess the filetype, it'll fall back to the naive chunking mentioned above.

You can preview the chunks by running vectorcode chunks path_to_file. This will invoke a chunker and output chunks in the way they would be chunked when added to the database.

As for embedding, Sentence Transformers is chosen mostly because it's easy to use. It's not only for natural languages, and there are models supported by sentence transformers that specialise in code embeddings. Also, using an embedding function that works for natural language means that you can vectorise documentation and doc strings. This makes the embedding model more generalisable than a model that only uses AST. The following is a screenshot of me using codecompanion.nvim + VectorCode to talk to an LLM about the ArchLinux Wiki (installed locally from this package).

you can do the same on neovim documentation or even this repo.

Sentence transformers are also not your only option. Any embedding model supported by chromadb should work (although I've only tried sentence transformer and ollama), so you have the option to switch to your own choice. I was quite limited in terms of compute power (2c4t with no discrete GPU), so I didn't have the opportunity to experiment with more embedding models, so I can't comment on the effectiveness of using alternative embedding models for now, but I will do that once I can run some evaluations on better hardware.

When making queries, the query messages will be chunked too (if they're too long), using the naive chunking method. Each query keyword (or chunk) will get its own best-matched chunks. Each of the ${(\text{query}, \text{chunk})}$ pair will be assigned a similarity score. Since each vectorised file may correspond to multiple chunks, the similarities will be averaged for each file (with outliers removed) to produce scores for each document. The score is either produced from a reranker model or simply using the distance returned by chromadb. The relevant source code is here.

I hope this answers most of your questions.

0 replies

rockerBOO · 2025-04-02T17:36:52Z

rockerBOO
Apr 2, 2025

I'm in the same boat as OP. I think you probably are doing a good explanation of the process but my eyes are glazing over a bit. I do AI development too but I sort of want a high level workflow, maybe showing a few pictures would help with a diagramming tool. I would be happy to contribute if you make a high level view as it would make it easier to ground the user in what is happening and what is required.

Codebase: what are the high level chooses. What files does it decide to choose? How does it break up a file? Maybe show some examples of it breaking up the file in images. What are some key things to consider when adding a codebase?

RAG: why did you chose chroma over other vector embedding databases? Is there an idea to make this using different vector databases?

LLM: what LLM's do you support integrating with? Can we use a local LLM? How is the RAG integrated with the LLM? What are some considerations of the RAG integrations? Can I use this RAG integration with the LLM without plugins?

Plugin integration: How does the neovim plugin interact with Neovim? How does it work with other LLM plugins? What are some core concerns with integration with other LLM plugins? Does the RAG factor mean it is pluggable with other services?

Extendability: Can I use Vectorcode as a HTTP service? Do LLM providers have integration points for RAG? Do local RAG clients integrate with Vectorcode?

These questions are much more high level in the response would be appropriate, so less implementation details but give users and contributors an grounding without having to parse through a lot of the details.

For example maybe something like:

But maybe also a FAQ like section.

Mostly I would like to feel comfortable taking the time to integrate this to use on my repositories but feel like i understand what I will need to do before going into it, and the minimal integration path to getting it working with a repository.

Install Vectorcode (pipx install vectorcode)
Install and configure ChromaDB (pip install chromadb)
Configure Vectorcode for your repository (not sure this step off hand)
Configure Vectorcode to work with a LLM (plugin or manually, maybe a branching path here)

Also concerns about running a pip install for a global package as it's not possible on my system so venv or using something like uv to manage some venv or dependency for installing it and then accessing it elsewhere.

I will end with I haven't read enough into this project to give an idea of what I asking for has been directly addressed already, or if the flow is easy to follow or not in its current form. I have a desire to integrate a RAG into some private or semi-private repos to get better integration or documentation support. I am currently looking into different options. But the RAG space is a little daunting to get to understanding of how the pieces work together beyond the embeddings can be put into the LLM idea. Having a non-commital path that gives people a trial of to run on a Repo without a lot of configuration might be helpful too.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Making Clear What You Are Actually Doing Here #58

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Making Clear What You Are Actually Doing Here #58

Uh oh!

owlwres Apr 2, 2025

Replies: 2 comments

Uh oh!

Davidyz Apr 2, 2025 Maintainer

Uh oh!

rockerBOO Apr 2, 2025

owlwres
Apr 2, 2025

Davidyz
Apr 2, 2025
Maintainer

rockerBOO
Apr 2, 2025