ragBIS - Data Acquisition and Processing for openBIS Documentation

ragBIS is a standalone Python package that scrapes openBIS documentation, processes the content, and generates embeddings for use in RAG (Retrieval Augmented Generation) applications.

Features

Web Scraping: Automatically scrapes openBIS documentation from ReadtheDocs
Content Processing: Intelligently chunks content while preserving document structure
Embedding Generation: Creates embeddings using Ollama's nomic-embed-text model
Data Export: Saves processed data in JSON and CSV formats for easy consumption

Prerequisites

Python 3.8 or higher
Ollama installed and running
The nomic-embed-text model installed in Ollama

Installing Ollama and Required Models

Install Ollama from https://ollama.ai/
Pull the required embedding model:
```
ollama pull nomic-embed-text
```

Installation

From Source

Clone or download this project
Navigate to the ragBIS_project directory

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Using pip (if published)

pip install ragbis

Usage

Basic Usage

Run ragBIS with default settings to scrape and process openBIS documentation:

python -m ragbis

This will:

Scrape the openBIS documentation from the default URL
Save raw content to ./data/raw/
Process and generate embeddings
Save processed data to ./data/processed/

Command Line Options

python -m ragbis --help

Available options:

--url URL: Base URL to scrape (default: https://openbis.readthedocs.io/en/latest/)
--output-dir DIR: Output directory for data (default: ./data)
--max-pages N: Maximum number of pages to scrape (default: 100)
--delay SECONDS: Delay between requests (default: 0.5)
--force-rebuild: Force rebuild even if processed data exists
--min-chunk-size N: Minimum chunk size in characters (default: 100)
--max-chunk-size N: Maximum chunk size in characters (default: 1000)
--chunk-overlap N: Chunk overlap in characters (default: 50)
--verbose: Enable verbose logging

Examples

Scrape with custom settings:

python -m ragbis --max-pages 200 --output-dir ./my_data --verbose

Force rebuild existing data:

python -m ragbis --force-rebuild

Custom chunking parameters:

python -m ragbis --min-chunk-size 200 --max-chunk-size 1500 --chunk-overlap 100

Output Structure

ragBIS creates the following directory structure:

data/
├── raw/                    # Raw scraped content
│   ├── index.txt
│   ├── installation.txt
│   └── ...
└── processed/             # Processed data for RAG
    ├── chunks.json        # Main data file with embeddings
    └── chunks.csv         # Metadata without embeddings

Output Files

chunks.json: Contains all processed chunks with embeddings, titles, URLs, and content
chunks.csv: Contains chunk metadata without embeddings for easy inspection

Integration with chatBIS

The processed data from ragBIS is designed to be used with chatBIS, the conversational interface. The chatBIS repo is accesible here. After running ragBIS, you can:

Copy the data directory to your chatBIS project
Or point chatBIS to the ragBIS output directory

Configuration

Environment Variables

OLLAMA_HOST: Ollama server host (default: localhost)
OLLAMA_PORT: Ollama server port (default: 11434)

Customizing the Scraper

You can modify the scraping behavior by editing the scraper configuration in the source code:

Target different documentation versions
Adjust content selectors for different site layouts
Modify delay and retry settings

Troubleshooting

Common Issues

Ollama Connection Error
- Ensure Ollama is running: ollama serve
- Check if the model is installed: ollama list
- Install the model if missing: ollama pull nomic-embed-text
Memory Issues
- Reduce --max-pages for large documentation sites
- Increase --min-chunk-size to create fewer chunks
- Process in smaller batches
Network Issues
- Increase --delay between requests
- Check your internet connection
- Verify the documentation URL is accessible

Logging

Enable verbose logging to debug issues:

python -m ragbis --verbose

Development

Running Tests

pytest

Code Formatting

black src/

Type Checking

mypy src/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Support

For issues and questions:

Create an issue on GitHub
Check the troubleshooting section above
Ensure Ollama is properly configured

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/ragbis		src/ragbis
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ragBIS - Data Acquisition and Processing for openBIS Documentation

Features

Prerequisites

Installing Ollama and Required Models

Installation

From Source

Using pip (if published)

Usage

Basic Usage

Command Line Options

Examples

Output Structure

Output Files

Integration with chatBIS

Configuration

Environment Variables

Customizing the Scraper

Troubleshooting

Common Issues

Logging

Development

Running Tests

Code Formatting

Type Checking

License

Contributing

Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

carlosmada22/ragBIS

Folders and files

Latest commit

History

Repository files navigation

ragBIS - Data Acquisition and Processing for openBIS Documentation

Features

Prerequisites

Installing Ollama and Required Models

Installation

From Source

Using pip (if published)

Usage

Basic Usage

Command Line Options

Examples

Output Structure

Output Files

Integration with chatBIS

Configuration

Environment Variables

Customizing the Scraper

Troubleshooting

Common Issues

Logging

Development

Running Tests

Code Formatting

Type Checking

License

Contributing

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages