-
Couldn't load subscription status.
- Fork 2.4k
Description
What specific problem does this solve?
Problem: Users are limited to Qdrant as the only vector database option for codebase indexing, creating several issues:
Who is affected: All users wanting to use codebase indexing, especially:
- Individual developers who want a zero-configuration solution
- Teams with existing vector database infrastructure (ChromaDB, Pinecone, etc.)
- Users on resource-constrained systems who can't run additional services
- Enterprise users with specific compliance/infrastructure requirements
When this happens:
- During initial setup of codebase indexing
- When trying to integrate with existing ML/AI pipelines
- When deploying on resource-limited environments
- When corporate policies restrict certain database choices
Current behavior:
- Users MUST set up and maintain a Qdrant instance (Docker or cloud)
- No way to reuse existing vector DB infrastructure
- Requires additional resources and configuration
- Some users reported issues with Qdrant setup (Issue Codebase indexing qdrant docker compose can not start indexing #4441)
Expected behavior:
- Users can choose from multiple vector database options
- Support for embedded databases that require no separate service
- Ability to use existing infrastructure
Impact:
- Setup time: 30-60 minutes for Qdrant vs 5 minutes for embedded solutions
- Resource usage: Additional 500MB+ RAM for Qdrant service
- Barrier to adoption: Many users skip codebase indexing due to setup complexity
- Infrastructure costs: Unnecessary duplication for teams with existing vector DBs
Additional context (optional)
- Discussion Indexing Code Base #411 shows strong community interest in alternatives, specifically LanceDB
- Continue.dev successfully uses LanceDB for the same use case
- Community member created workarounds (https://github.com/OJamals/Modal) showing demand
- Blog post demonstrating LanceDB for code RAG: https://blog.lancedb.com/rag-codebase-1/
Roo Code Task Links (Optional)
N/A
Request checklist
- I've searched existing Issues and Discussions for duplicates
- This describes a specific problem with clear impact and context
Interested in implementing this?
- Yes, I'd like to help implement this feature
Implementation requirements
- I understand this needs approval before implementation begins
How should this be solved? (REQUIRED if contributing, optional otherwise)
Solution: Implement a vector database adapter pattern
-
Create abstract interface:
- Define
VectorDBAdapterbase class with standard methods - Methods: create_index, add_embeddings, search, update, delete
- Consistent error handling and response formats
- Define
-
Implement adapters for priority databases:
- LanceDB: Embedded, no server needed, proven in Continue.dev
- ChromaDB: Popular choice, good Python integration
- SQLite+Vector: Minimal dependencies using sqlite-vss
- Qdrant: Keep existing implementation as one option
-
Configuration approach:
- Add
vector_db_providersetting in config - Provider-specific settings in nested config object
- Auto-detection of available providers on startup
- Add
-
User interaction:
- Dropdown in settings to select vector DB
- Provider-specific configuration fields appear dynamically
- Clear setup instructions for each provider
- Migration tool for switching between providers
How will we know it works? (Acceptance Criteria - REQUIRED if contributing, optional otherwise)
Given I have codebase indexing enabled
When I select "LanceDB" as my vector database
Then indexing works without requiring external services
And search results are comparable to Qdrant implementation
And switching between providers preserves my indexed data
But performance doesn't degrade significantly
Given I have an existing ChromaDB instance
When I configure codebase indexing to use it
Then it connects to my existing database
And creates collections without affecting other data
And respects my existing authentication settings
Given I'm using SQLite vector extension
When I index a large codebase (10k+ files)
Then indexing completes successfully
And search queries return in under 2 seconds
But I get a warning if codebase size might impact performance
Technical considerations (REQUIRED if contributing, optional otherwise)
Implementation approach:
- Factory pattern for creating appropriate adapter instances
- Async/await support for all database operations
- Consistent embedding dimension handling across providers
- Batch processing for efficient indexing
Architecture changes:
- New
vector_db/module with adapter implementations - Modify
CodebaseIndexclass to use adapters - Update configuration schema and validation
Dependencies:
- LanceDB:
pip install lancedb(lightweight) - ChromaDB:
pip install chromadb(includes dependencies) - SQLite:
pip install sqlite-vss(minimal)
Testing strategy:
- Unit tests for each adapter with mocked databases
- Integration tests with real databases in CI
- Performance benchmarks comparing providers
Trade-offs and risks (REQUIRED if contributing, optional otherwise)
Alternatives considered:
- MCP server approach - Too complex for users, requires additional setup
- External indexing service - Loses tight integration benefits
- Supporting only one alternative - Doesn't solve the flexibility problem
Risks:
- Maintenance burden: Each adapter needs updates when APIs change
- Mitigation: Start with 2-3 most requested options
- Performance variations: Different DBs have different performance characteristics
- Mitigation: Clear documentation on use cases for each
- Migration complexity: Moving between providers could be challenging
- Mitigation: Build migration tool from the start
- Testing complexity: Need to test multiple database backends
- Mitigation: Shared test suite with provider-specific fixtures
Breaking changes:
- Configuration format will change (but can auto-migrate)
- Existing Qdrant indexes remain compatible
Edge cases:
- Very large codebases might not work well with SQLite
- Embedding dimension mismatches between providers
- Network issues with cloud providers (Pinecone)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status