LLM Evaluation Framework

A Python-based framework for evaluating Large Language Models (LLMs) based on Anthropic's research paper and using the DRACO AI dataset.

📋 Table of Contents

Quick Start
Features
Installation
Configuration
Usage
Project Structure
Example Plots
Contributing
Troubleshooting
License

🚀 Quick Start

Clone the repo
Set up environment variables
Install dependencies
Run python main.py

✨Features

🤖 Multiple Model Support: OpenAI, Anthropic, Together AI, Groq, OpenRouter, Gemini, HuggingFace
📊 Evaluation Metrics: Completeness, relevance, conciseness, confidence, factuality, judgement, and custom
🔍 RAG Implementation: FAISS vectorstore with BGE embeddings and reranking
🛠️ Tool Usage: Code execution, simulation running, SmolAgents integration
⚖️ Multiple Judges: Support for secondary judge models
📈 Statistical Analysis: Comprehensive statistics and visualization
🌐 Cross-Platform: Windows, macOS, and Linux support

📥 Installation

Clone the repository:

git clone https://github.com/nsourlos/LLM_evaluation_framework.git
cd LLM_evaluation_framework

Create and activate a virtual environment:

python -m venv DRACO
source DRACO/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the package in editable mode:

pip install -r requirements.txt 
#Optionally ipywidgets==8.1.7 for Running in Jupyter notebook
#Optionally flash-attn==2.6.3 for GPU support

(Optional) Use environment within Jupyter Notebook

pip install ipykernel
python -m ipykernel install --user --name=DRACO --display-name "Python (DRACO)"

(Optional) Set up code execution environment:
```
# When using code execution features, a separate environment is needed
# to safely run generated code without conflicts
conda create -n test_LLM python==3.10 -y
conda activate test_LLM
pip install -r data/requirements_code_execution.txt
```
- Note: If using venv instead of conda, paths in src/llm_eval/utils/paths.py must be modified to point to the correct venv location
This creates an isolated environment for running generated code, preventing potential conflicts with the main evaluation environment.

⚙️ Configuration

🔑 Environment Variables

Rename env_example to env and add your API keys:

OPENAI_API_KEY="your_openai_api_key"
GEMINI_API_KEY="your_gemini_api_key"
TOGETHER_API_KEY="your_together_api_key"
GROQ_API_KEY="your_groq_api_key"
ANTHROPIC_API_KEYO="your_anthropic_api_key"
HF_TOKEN="your_huggingface_token"
OPEN_ROUTER_API_KEY="your_openrouter_api_key"

📂 Path Configuration

Edit src/llm_eval/utils/paths.py to set your system-specific paths:

For the corresponding OS: Set base_path and venv_path

⚡ Parameters Configuration

Edit src/llm_eval/config.py to configure:

excel_file_name: Your dataset Excel file - This need to be added from the user
embedding_model: Model for RAG embeddings
reranker_model_name: Model for reranking
models: List of models to evaluate (e.g. OpenAI, Together, Gemini models)
judge_model: Models used to judge the results
commercial_api_providers: Use to distinguish commercial and HuggingFace models
max_output_tokens: Maximum tokens in judge LLM output
generate_max_tokens: Token limit for regular model responses
generation_max_tokens_thinking: Token limit for reasoning model responses
domain: Domain of evaluation (e.g. "Water" Engineering)
n_resamples: Number of times to resample the dataset
continue_from_resample: Which resample iteration to continue from
tool_usage: Enable/disable tool usage for answering questions
use_RAG: Enable/disable RAG (Retrieval Augmented Generation)
use_smolagents: Enable/disable SmolAgents for code execution

📝 Excel File Format

The input Excel file must contain at least two columns:

input: The questions or prompts to evaluate
output: The expected answers or ground truth

Additional columns may be added:

id: Column to uniquely identify questions
origin_file: The json file from which the question-answer pair was extracted
topic/subtopic: The topic/subtopic of the question
Reference: Information from where the question-answer pair was obtained

🚀 Usage

Configure parameters:
- Set up your environment variables in env_example and rename it to env
- Configure paths in src/llm_eval/utils/paths.py
- Modify prompts and list of metrics in src/llm_eval/evaluation/prompts.py
- Adjust parameters in src/llm_eval/config.py

Run the evaluation:

python main.py 
# Optionally `python main.py | tee data/log.txt` to save terminal output to txt file

The script will:

Load and process your Excel dataset
Run evaluations on specified models
Generate Excel results files
Create JSON files for statistics
Produce visualization plots

📁 Project Structure

llm_evaluation_framework/
├── src/
│   └── llm_eval/
│       ├── config.py               # All configuration parameters
│       ├── core/
│       │   ├── data_loader.py      # Functions for loading data and models
│       │   ├── model_utils.py      # Model initialization and utilities
│       ├── evaluation/
│       │   ├── evaluator.py        # Evaluation functions
│       │   └── prompts.py          # All evaluation prompt strings
│       ├── providers/
│       │   └── api_handlers.py     # Helper functions for LLM APIs
│       ├── tools/
│       │   ├── code_execution.py   # Logic for tool handling
│       │   └── tool_usage.py       # Tool usage definition and decision logic
│       └── utils/
│           ├── paths.py            # OS-specific path configurations
│           ├── plotting.py         # Visualization functions
│           ├── processing.py       # Processing and Excel file creation
│           ├── rag.py              # RAG implementation
│           ├── scoring.py          # Scoring utilities
│           └── statistics.py       # Statistical calculations
├── notebooks/
│   └── convert_DRACO_to_excel.ipynb     # Create Excel file from json files with question-answer pairs
├── data/
│   ├── requirements_code_execution.txt  # Dependencies for code execution environment
│   ├── network_0.inp                    # Input file for network comparison
│   ├── network_test.inp                 # Input file for network testing scenarios
│   ├── compare_networks_test.py         # Test script for network comparison functionality
│   └── compare_networks.py              # Main network comparison implementation
├── runpod/
│   ├── README_runpod.md                # RunPod instructions
│   └── runpod_initialize.ipynb         # Notebook that automatically initialize runpod and copies files to it
├── example_imgs/
│   ├── metric_comparison_grid.png                       #Example image of a comparison grid of models for different metrics
│   ├── model_performance_summary.png                    #Example image of metric comparisons between models for different metrics
│   ├── model_statistical_comparisons.png                #Example image of statistical comparisons between models   
│   ├── spider_chart_judge_deepseek-ai_DeepSeek-V3.png   #Example image of spider graph comparisons between metrics for different models
├── main.py                         # Main script
├── env_example                     # Environment variables (to be renamed to env)
├── requirements.txt                # Dependencies
└── README.md                       # This file

📊 Example Plots

The framework generates various visualization plots to help analyze the evaluation results. Here are some examples of a comparison of two models:

Model Performance Summary

Overall performance summary of evaluated models

Spider Chart Analysis

Spider chart showing metric distribution

Metric Comparison Grid

Comparison of different metrics across models

Statistical Comparisons

Statistical comparison between models with p-values

🤝 Contributing

When making changes:

Maintain backward compatibility
Preserve original function signatures
Keep all comments and logging

✅ To-Do

Remove Langsmith
Replace txt saves with logging

🔧 Troubleshooting

All operations are logged in txt files to track errors. To modify list of metrics to be evaluated, change the list_of_metrics in prompts.py

📄 License

To be decided ....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Evaluation Framework

📋 Table of Contents

🚀 Quick Start

✨Features

📥 Installation

⚙️ Configuration

🔑 Environment Variables

📂 Path Configuration

⚡ Parameters Configuration

📝 Excel File Format

🚀 Usage

📁 Project Structure

📊 Example Plots

Model Performance Summary

Spider Chart Analysis

Metric Comparison Grid

Statistical Comparisons

🤝 Contributing

✅ To-Do

🔧 Troubleshooting

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data		data
example_imgs		example_imgs
notebooks		notebooks
runpod		runpod
src/llm_eval		src/llm_eval
.gitignore		.gitignore
README.md		README.md
env_example		env_example
main.py		main.py
requirements.txt		requirements.txt

nsourlos/LLM_evaluation_framework

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation Framework

📋 Table of Contents

🚀 Quick Start

✨Features

📥 Installation

⚙️ Configuration

🔑 Environment Variables

📂 Path Configuration

⚡ Parameters Configuration

📝 Excel File Format

🚀 Usage

📁 Project Structure

📊 Example Plots

Model Performance Summary

Spider Chart Analysis

Metric Comparison Grid

Statistical Comparisons

🤝 Contributing

✅ To-Do

🔧 Troubleshooting

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages