SmartSearch is a advanced tool designed for efficient indexing and searching of large codebases. It combines the power of semantic and keyword-based searches with Large Language Models (LLMs) like CodeBERT, enabling developers to extract relevant code snippets quickly and effectively.
-
Local Deployment (Secure): Deploy locally to maintain full control over sensitive data and ensure a secure environment.
-
Model Transparency: Provides more insight into model decisions, enhancing trust and interpretability.
-
Multi-Language Support: Initially designed for JavaScript, Python, PHP, and Java, with the flexibility to easily extend support for additional languages.
-
FileChunking: Processes large files into smaller, logical chunks for improved search accuracy and performance.
-
Sophisticated Scoring Method: Ranks search results based on relevance, ensuring precise and useful outcomes.
-
Start IDE & open a Terminal
-
Build & Start Docker
docker-compose up --build
(Use
--no-cacheto build fully without cache)This may take some time, depending on your machine + internet connection.
-
Open a second terminal window and use the following commands to test if everything was installed correctly:
docker-compose run --rm app python3 tests.py
-
Add Source Code
Before using the tool, make sure to add a folder named
sourcecodeinto the root (/sourcecode/) and add your codebase files into this folder.
Once the tool is built, you can start indexing and searching your codebase with a few simple commands
Make sure to specify the directory to index (sourcecode - where the source code files are located).
docker-compose run --rm app python3 codebase_search.py --index sourcecodedocker-compose run --rm app python3 codebase_search.py --search "query"Examples:
"Where is the servername of the database?""Where can I find the Coupon class?""Where is the Login happening?""Where are POST and GET requests handled?"
To train your own model using your custom training data:
- Update the
train_data.csvfile in thetrainfolder with your training dataset, or specify your own file location. - Adjust parameters such as learning rate, batch size, and epochs in the
train_model.pyfile.
docker-compose run --rm app python3 train/train_model.pyBefore using the tool, make sure your system meets the following requirements:
- A NVIDIA GPU that supports CUDA.
- Windows 10/11 with WSL 2 enabled.
- NVIDIA Game Ready Driver version 465.89 or higher (this is the version that supports WSL 2)
Download and install Docker Desktop for Windows.
- Ensure you enable the WSL 2 feature in Docker Desktop settings:
- Right-click the Docker icon in the system tray and select Settings.
- Go to General and enable the Use the WSL 2 based engine option.
- In the Resources section, ensure that the integration with your installed WSL distributions is enabled.
-
Download and install the latest NVIDIA driver (Game Ready Driver) that supports CUDA from NVIDIA GeForce Experience or the NVIDIA website.
-
Make sure that you have a compatible GPU
Currently, NVIDIA CUDA with WSL 2 + Docker has a bug which blocks GPU access. To fix this, use the most recent Docker Desktop version and make the following changes:
- Inside the
docker-desktopfolder, find the file/etc/nvidia-container-runtime/config.tomland changeno-cgroupsfromtruetofalse.
Hyper-Training uses 3 different parameters each and several combinations to find the best model. (Takes a long time to run!)
docker-compose run --rm app python3 train/hyper_train_model.pyThis feature still needs to be fine-tuned, and is a work in progress.
This tool has been designed to handle JavaScript, PHP, and Python codebases. As such, several constraints are specifically tailored to these programming languages.
- All documents are indexed in an Elasticsearch backend, which must be running and accessible.
- Proper indexing of documents ensures compatibility with semantic retrieval and question-answering pipelines.
- Uses
CodeBERTorRoBERTafor embedding retrieval and question answering. - Requires pre-trained or fine-tuned models, which should be available locally or downloaded online.
- Content is split into logical chunks with a limit of 512 tokens to comply with model constraints (e.g.,
CodeBERTorRoBERTa).
- Skips unreadable or inaccessible files and logs errors.
- Combines:
- Semantic retrieval scores.
- Exact match scores.
- File name relevance.
- A
synonymsdictionary is required for query expansion.
The following file types are not supported and are skipped during indexing:
.DS_Store,.bin,.exe,.dll,.so(system and binary files).jpg,.png,.gif(image files).zip(compressed archives).html(markup files)
The tool processes source code for:
- JavaScript
- Python
- PHP
- Java
Adjustments are to be made within split_into_chunks function.
- GPU is recommended for optimal performance with
torch-based models.
- Support More Languages: Add support for C++, Ruby, Go, and other languages
- Better Chunking: Improve handling of large or mixed-language files
- Faster Indexing: Enable parallel or distributed indexing for large repositories
- Additional Backends: Support other storage systems like FAISS or Milvus
- Smarter Query Expansion: Use advanced models for more accurate search
- Improved Scoring: Enhance result ranking with neural ranking models
- Custom Synonyms/Stopwords: Extend the already used synonyms (stopword) lists.
- Real-Time Updates: Enable live indexing of new or updated files
Our goal is to have every commit to the sourcecode/ directory automatically indexed by Elasticsearch, and then make the updated search index available as a Docker image. This way, anyone can run a pre-indexed Elasticsearch container without manually re-running the indexing step.
Here’s how the pipeline works, step by step:
The CI/CD is defined in .github/workflows/index-on-push.yml.
It triggers on push to the sourcecode/ folder.
The workflow spins up Elasticsearch as a service (version 7.17.9 in our example).
We wait for it to become healthy.
We run python codebase_search.py --index ./sourcecode to index the newly added or changed source files.
That indexing logic uses Haystack to store documents in Elasticsearch.
After indexing, the ES container now holds a fresh index in /usr/share/elasticsearch/data.
We copy that directory onto the GitHub runner using a docker cp command.
We have a Dockerfile.preindexed based on docker.elastic.co/elasticsearch/elasticsearch.
In the build step, we COPY the es_data folder (which contains the newly indexed documents) into the image’s /usr/share/elasticsearch/data.
This means the resulting Docker image has the entire Elasticsearch index baked in.
We log in to GHCR using a GitHub secret (a Personal Access Token with write:packages scope).
We push the newly built image to ghcr.io/<owner>/<repo>/smartsearch-es-preindexed:latest.
If a developer wants to run a code search locally, they just run docker pull on that image, start it, and Elasticsearch is up with the latest docs indexed—no manual steps.
Install NVIDIA CUDA Toolkit:
-
Download link for the CUDA Toolkit 12.4 Update 1
-
For Ubuntu:
sudo apt install nvidia-cuda-toolkit
If the toolkit has been installed under both Windows and Ubuntu:
-
Check if installed correctly:
nvcc --version
-
Currently, NVIDIA CUDA with WSL 2 + Docker has a bug which blocks GPU access. To fix this, use the most recent Docker Desktop version and make the following changes:
- Inside the file
/etc/nvidia-container-runtime/config.toml, changeno-cgroupsfromtruetofalse.
- Inside the file
-
Check your CUDA version with:
nvidia-smi
-
Check NVIDIA Docker Hub and use the matching CUDA version in
Dockerfile.-
In this project, CUDA 12.4.1 is used:
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
-
-
Validate that Docker has GPU access:
docker exec -it app ls /dev/nvidia*
- If there is such a directory, NVIDIA toolkit has been installed correctly.
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi
- If NVIDIA toolkit has been installed correctly, you should see information about your GPU.
If you are not using Docker Desktop, use this guide by NVIDIA: Configuring Docker
To use Python commands in the terminal:
docker-compose run app /bin/bashPostman ElasticSearch checks:
http://localhost:9200/codebase_index/_counthttp://localhost:9200/codebase_index/_search?q=*&size=10
Search specific file:
-
http://localhost:9200/codebase_index/_searchRaw query:
{ "query": { "wildcard": { "name": "*dbaccess.php*" } }, "size": 10 } -
Example command:
docker-compose run --rm app python3 codebase_search.py --search "How are new users created?"
Cleanup: If there are any orphan containers or services:
docker-compose down --remove-orphans