Replies: 2 comments
-
Hi, that's a fair question for both potential contributors and people who want to know this for privacy reasons. I'll begin by admitting that the documentation isn't the best, partially because I didn't expect so many people to be interested in this. I'll explain the whole process (vectorise and query) here so that you (and other people interested in this) can have a better idea about what happens behind the scenes and any privacy concerns that you may have. This project uses ChromeDB as the backend. It's a vector database that stores the embedding vectors. It contains some telemetry (which I tried my best, from my end, to disable). It handles the storage and retrieval of the data. For each project, vectorcode creates a new collection for it and puts all embeddings for this project in this collection. This isolates projects from each other so that your queries are project-specific. When you vectorise files, the files will first go through a chunker that splits large documents into smaller pieces so that embedding models can properly handle them without losing too much information. There are 2 different chunking strategies for files:
You can preview the chunks by running As for embedding, Sentence Transformers is chosen mostly because it's easy to use. It's not only for natural languages, and there are models supported by sentence transformers that specialise in code embeddings. Also, using an embedding function that works for natural language means that you can vectorise documentation and doc strings. This makes the embedding model more generalisable than a model that only uses AST. The following is a screenshot of me using codecompanion.nvim + VectorCode to talk to an LLM about the ArchLinux Wiki (installed locally from this package).
Sentence transformers are also not your only option. Any embedding model supported by chromadb should work (although I've only tried sentence transformer and ollama), so you have the option to switch to your own choice. I was quite limited in terms of compute power (2c4t with no discrete GPU), so I didn't have the opportunity to experiment with more embedding models, so I can't comment on the effectiveness of using alternative embedding models for now, but I will do that once I can run some evaluations on better hardware. When making queries, the query messages will be chunked too (if they're too long), using the naive chunking method. Each query keyword (or chunk) will get its own best-matched chunks. Each of the I hope this answers most of your questions. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This plugin makes some sort of embedding of the source files into some vector space so that some least distance metric can more easily find files related to a particular query.
I think that it's important to make clear what is happening here.
As far as I understand SentenceTransformer is for natural language embeddings.
What consequence does this have for the quality of our embeddings? You make mention of some basic chunking procedure, but don't go into detail on what is actually being done.
I'd like to find, for a lay person, what effect should the selected embedding have on the effectiveness of the vectorization for search, can you please make more accessible the technical details.
Beta Was this translation helpful? Give feedback.
All reactions