Skip to content

Support for Alternative Vector Databases in Codebase Indexing #6223

@Valgard

Description

@Valgard

What specific problem does this solve?

Problem: Users are limited to Qdrant as the only vector database option for codebase indexing, creating several issues:

Who is affected: All users wanting to use codebase indexing, especially:

  • Individual developers who want a zero-configuration solution
  • Teams with existing vector database infrastructure (ChromaDB, Pinecone, etc.)
  • Users on resource-constrained systems who can't run additional services
  • Enterprise users with specific compliance/infrastructure requirements

When this happens:

  • During initial setup of codebase indexing
  • When trying to integrate with existing ML/AI pipelines
  • When deploying on resource-limited environments
  • When corporate policies restrict certain database choices

Current behavior:

Expected behavior:

  • Users can choose from multiple vector database options
  • Support for embedded databases that require no separate service
  • Ability to use existing infrastructure

Impact:

  • Setup time: 30-60 minutes for Qdrant vs 5 minutes for embedded solutions
  • Resource usage: Additional 500MB+ RAM for Qdrant service
  • Barrier to adoption: Many users skip codebase indexing due to setup complexity
  • Infrastructure costs: Unnecessary duplication for teams with existing vector DBs

Additional context (optional)

Roo Code Task Links (Optional)

N/A

Request checklist

  • I've searched existing Issues and Discussions for duplicates
  • This describes a specific problem with clear impact and context

Interested in implementing this?

  • Yes, I'd like to help implement this feature

Implementation requirements

  • I understand this needs approval before implementation begins

How should this be solved? (REQUIRED if contributing, optional otherwise)

Solution: Implement a vector database adapter pattern

  1. Create abstract interface:

    • Define VectorDBAdapter base class with standard methods
    • Methods: create_index, add_embeddings, search, update, delete
    • Consistent error handling and response formats
  2. Implement adapters for priority databases:

    • LanceDB: Embedded, no server needed, proven in Continue.dev
    • ChromaDB: Popular choice, good Python integration
    • SQLite+Vector: Minimal dependencies using sqlite-vss
    • Qdrant: Keep existing implementation as one option
  3. Configuration approach:

    • Add vector_db_provider setting in config
    • Provider-specific settings in nested config object
    • Auto-detection of available providers on startup
  4. User interaction:

    • Dropdown in settings to select vector DB
    • Provider-specific configuration fields appear dynamically
    • Clear setup instructions for each provider
    • Migration tool for switching between providers

How will we know it works? (Acceptance Criteria - REQUIRED if contributing, optional otherwise)

Given I have codebase indexing enabled
When I select "LanceDB" as my vector database
Then indexing works without requiring external services
And search results are comparable to Qdrant implementation
And switching between providers preserves my indexed data
But performance doesn't degrade significantly

Given I have an existing ChromaDB instance
When I configure codebase indexing to use it
Then it connects to my existing database
And creates collections without affecting other data
And respects my existing authentication settings

Given I'm using SQLite vector extension
When I index a large codebase (10k+ files)
Then indexing completes successfully
And search queries return in under 2 seconds
But I get a warning if codebase size might impact performance

Technical considerations (REQUIRED if contributing, optional otherwise)

Implementation approach:

  • Factory pattern for creating appropriate adapter instances
  • Async/await support for all database operations
  • Consistent embedding dimension handling across providers
  • Batch processing for efficient indexing

Architecture changes:

  • New vector_db/ module with adapter implementations
  • Modify CodebaseIndex class to use adapters
  • Update configuration schema and validation

Dependencies:

  • LanceDB: pip install lancedb (lightweight)
  • ChromaDB: pip install chromadb (includes dependencies)
  • SQLite: pip install sqlite-vss (minimal)

Testing strategy:

  • Unit tests for each adapter with mocked databases
  • Integration tests with real databases in CI
  • Performance benchmarks comparing providers

Trade-offs and risks (REQUIRED if contributing, optional otherwise)

Alternatives considered:

  1. MCP server approach - Too complex for users, requires additional setup
  2. External indexing service - Loses tight integration benefits
  3. Supporting only one alternative - Doesn't solve the flexibility problem

Risks:

  • Maintenance burden: Each adapter needs updates when APIs change
    • Mitigation: Start with 2-3 most requested options
  • Performance variations: Different DBs have different performance characteristics
    • Mitigation: Clear documentation on use cases for each
  • Migration complexity: Moving between providers could be challenging
    • Mitigation: Build migration tool from the start
  • Testing complexity: Need to test multiple database backends
    • Mitigation: Shared test suite with provider-specific fixtures

Breaking changes:

  • Configuration format will change (but can auto-migrate)
  • Existing Qdrant indexes remain compatible

Edge cases:

  • Very large codebases might not work well with SQLite
  • Embedding dimension mismatches between providers
  • Network issues with cloud providers (Pinecone)

Metadata

Metadata

Assignees

Labels

Issue - In ProgressSomeone is actively working on this. Should link to a PR soon.enhancementNew feature or requestfeature requestFeature request, not a bugproposal

Type

No type

Projects

Status

Issue [In Progress]

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions