ragBIS is a standalone Python package that scrapes openBIS documentation, processes the content, and generates embeddings for use in RAG (Retrieval Augmented Generation) applications.
- Web Scraping: Automatically scrapes openBIS documentation from ReadtheDocs
- Content Processing: Intelligently chunks content while preserving document structure
- Embedding Generation: Creates embeddings using Ollama's
nomic-embed-text
model - Data Export: Saves processed data in JSON and CSV formats for easy consumption
- Python 3.8 or higher
- Ollama installed and running
- The
nomic-embed-text
model installed in Ollama
- Install Ollama from https://ollama.ai/
- Pull the required embedding model:
ollama pull nomic-embed-text
- Clone or download this project
- Navigate to the ragBIS_project directory
- Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
pip install ragbis
Run ragBIS with default settings to scrape and process openBIS documentation:
python -m ragbis
This will:
- Scrape the openBIS documentation from the default URL
- Save raw content to
./data/raw/
- Process and generate embeddings
- Save processed data to
./data/processed/
python -m ragbis --help
Available options:
--url URL
: Base URL to scrape (default: https://openbis.readthedocs.io/en/latest/)--output-dir DIR
: Output directory for data (default: ./data)--max-pages N
: Maximum number of pages to scrape (default: 100)--delay SECONDS
: Delay between requests (default: 0.5)--force-rebuild
: Force rebuild even if processed data exists--min-chunk-size N
: Minimum chunk size in characters (default: 100)--max-chunk-size N
: Maximum chunk size in characters (default: 1000)--chunk-overlap N
: Chunk overlap in characters (default: 50)--verbose
: Enable verbose logging
Scrape with custom settings:
python -m ragbis --max-pages 200 --output-dir ./my_data --verbose
Force rebuild existing data:
python -m ragbis --force-rebuild
Custom chunking parameters:
python -m ragbis --min-chunk-size 200 --max-chunk-size 1500 --chunk-overlap 100
ragBIS creates the following directory structure:
data/
├── raw/ # Raw scraped content
│ ├── index.txt
│ ├── installation.txt
│ └── ...
└── processed/ # Processed data for RAG
├── chunks.json # Main data file with embeddings
└── chunks.csv # Metadata without embeddings
- chunks.json: Contains all processed chunks with embeddings, titles, URLs, and content
- chunks.csv: Contains chunk metadata without embeddings for easy inspection
The processed data from ragBIS is designed to be used with chatBIS, the conversational interface. The chatBIS repo is accesible here. After running ragBIS, you can:
- Copy the
data
directory to your chatBIS project - Or point chatBIS to the ragBIS output directory
OLLAMA_HOST
: Ollama server host (default: localhost)OLLAMA_PORT
: Ollama server port (default: 11434)
You can modify the scraping behavior by editing the scraper configuration in the source code:
- Target different documentation versions
- Adjust content selectors for different site layouts
- Modify delay and retry settings
-
Ollama Connection Error
- Ensure Ollama is running:
ollama serve
- Check if the model is installed:
ollama list
- Install the model if missing:
ollama pull nomic-embed-text
- Ensure Ollama is running:
-
Memory Issues
- Reduce
--max-pages
for large documentation sites - Increase
--min-chunk-size
to create fewer chunks - Process in smaller batches
- Reduce
-
Network Issues
- Increase
--delay
between requests - Check your internet connection
- Verify the documentation URL is accessible
- Increase
Enable verbose logging to debug issues:
python -m ragbis --verbose
pytest
black src/
mypy src/
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For issues and questions:
- Create an issue on GitHub
- Check the troubleshooting section above
- Ensure Ollama is properly configured