A Python-based framework for evaluating Large Language Models (LLMs) based on Anthropic's research paper and using the DRACO AI dataset.
- Quick Start
- Features
- Installation
- Configuration
- Usage
- Project Structure
- Example Plots
- Contributing
- Troubleshooting
- License
- Clone the repo
- Set up environment variables
- Install dependencies
- Run
python main.py
- π€ Multiple Model Support: OpenAI, Anthropic, Together AI, Groq, OpenRouter, Gemini, HuggingFace
- π Evaluation Metrics: Completeness, relevance, conciseness, confidence, factuality, judgement, and custom
- π RAG Implementation: FAISS vectorstore with BGE embeddings and reranking
- π οΈ Tool Usage: Code execution, simulation running, SmolAgents integration
- βοΈ Multiple Judges: Support for secondary judge models
- π Statistical Analysis: Comprehensive statistics and visualization
- π Cross-Platform: Windows, macOS, and Linux support
-
Clone the repository:
git clone https://github.com/nsourlos/LLM_evaluation_framework.git cd LLM_evaluation_framework
-
Create and activate a virtual environment:
python -m venv DRACO source DRACO/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the package in editable mode:
pip install -r requirements.txt #Optionally ipywidgets==8.1.7 for Running in Jupyter notebook #Optionally flash-attn==2.6.3 for GPU support
-
(Optional) Use environment within Jupyter Notebook
pip install ipykernel python -m ipykernel install --user --name=DRACO --display-name "Python (DRACO)"
-
(Optional) Set up code execution environment:
# When using code execution features, a separate environment is needed # to safely run generated code without conflicts conda create -n test_LLM python==3.10 -y conda activate test_LLM pip install -r data/requirements_code_execution.txt
- Note: If using venv instead of conda, paths in src/llm_eval/utils/paths.py must be modified to point to the correct venv location
This creates an isolated environment for running generated code, preventing potential conflicts with the main evaluation environment.
- Rename
env_example
toenv
and add your API keys:
OPENAI_API_KEY="your_openai_api_key"
GEMINI_API_KEY="your_gemini_api_key"
TOGETHER_API_KEY="your_together_api_key"
GROQ_API_KEY="your_groq_api_key"
ANTHROPIC_API_KEYO="your_anthropic_api_key"
HF_TOKEN="your_huggingface_token"
OPEN_ROUTER_API_KEY="your_openrouter_api_key"
Edit src/llm_eval/utils/paths.py
to set your system-specific paths:
- For the corresponding OS: Set
base_path
andvenv_path
Edit src/llm_eval/config.py
to configure:
excel_file_name
: Your dataset Excel file - This need to be added from the userembedding_model
: Model for RAG embeddingsreranker_model_name
: Model for rerankingmodels
: List of models to evaluate (e.g. OpenAI, Together, Gemini models)judge_model
: Models used to judge the resultscommercial_api_providers
: Use to distinguish commercial and HuggingFace modelsmax_output_tokens
: Maximum tokens in judge LLM outputgenerate_max_tokens
: Token limit for regular model responsesgeneration_max_tokens_thinking
: Token limit for reasoning model responsesdomain
: Domain of evaluation (e.g. "Water" Engineering)n_resamples
: Number of times to resample the datasetcontinue_from_resample
: Which resample iteration to continue fromtool_usage
: Enable/disable tool usage for answering questionsuse_RAG
: Enable/disable RAG (Retrieval Augmented Generation)use_smolagents
: Enable/disable SmolAgents for code execution
The input Excel file must contain at least two columns:
input
: The questions or prompts to evaluateoutput
: The expected answers or ground truth
Additional columns may be added:
id
: Column to uniquely identify questionsorigin_file
: The json file from which the question-answer pair was extractedtopic/subtopic
: The topic/subtopic of the questionReference
: Information from where the question-answer pair was obtained
-
Configure parameters:
- Set up your environment variables in
env_example
and rename it toenv
- Configure paths in
src/llm_eval/utils/paths.py
- Modify prompts and list of metrics in
src/llm_eval/evaluation/prompts.py
- Adjust parameters in
src/llm_eval/config.py
- Set up your environment variables in
-
Run the evaluation:
python main.py # Optionally `python main.py | tee data/log.txt` to save terminal output to txt file
The script will:
- Load and process your Excel dataset
- Run evaluations on specified models
- Generate Excel results files
- Create JSON files for statistics
- Produce visualization plots
llm_evaluation_framework/
βββ src/
β βββ llm_eval/
β βββ config.py # All configuration parameters
β βββ core/
β β βββ data_loader.py # Functions for loading data and models
β β βββ model_utils.py # Model initialization and utilities
β βββ evaluation/
β β βββ evaluator.py # Evaluation functions
β β βββ prompts.py # All evaluation prompt strings
β βββ providers/
β β βββ api_handlers.py # Helper functions for LLM APIs
β βββ tools/
β β βββ code_execution.py # Logic for tool handling
β β βββ tool_usage.py # Tool usage definition and decision logic
β βββ utils/
β βββ paths.py # OS-specific path configurations
β βββ plotting.py # Visualization functions
β βββ processing.py # Processing and Excel file creation
β βββ rag.py # RAG implementation
β βββ scoring.py # Scoring utilities
β βββ statistics.py # Statistical calculations
βββ notebooks/
β βββ convert_DRACO_to_excel.ipynb # Create Excel file from json files with question-answer pairs
βββ data/
β βββ requirements_code_execution.txt # Dependencies for code execution environment
β βββ network_0.inp # Input file for network comparison
β βββ network_test.inp # Input file for network testing scenarios
β βββ compare_networks_test.py # Test script for network comparison functionality
β βββ compare_networks.py # Main network comparison implementation
βββ runpod/
β βββ README_runpod.md # RunPod instructions
β βββ runpod_initialize.ipynb # Notebook that automatically initialize runpod and copies files to it
βββ example_imgs/
β βββ metric_comparison_grid.png #Example image of a comparison grid of models for different metrics
β βββ model_performance_summary.png #Example image of metric comparisons between models for different metrics
β βββ model_statistical_comparisons.png #Example image of statistical comparisons between models
β βββ spider_chart_judge_deepseek-ai_DeepSeek-V3.png #Example image of spider graph comparisons between metrics for different models
βββ main.py # Main script
βββ env_example # Environment variables (to be renamed to env)
βββ requirements.txt # Dependencies
βββ README.md # This file
The framework generates various visualization plots to help analyze the evaluation results. Here are some examples of a comparison of two models:
Overall performance summary of evaluated models
Spider chart showing metric distribution
Comparison of different metrics across models
Statistical comparison between models with p-values
When making changes:
- Maintain backward compatibility
- Preserve original function signatures
- Keep all comments and logging
- Remove Langsmith
- Replace txt saves with logging
All operations are logged in txt files to track errors. To modify list of metrics to be evaluated, change the list_of_metrics in prompts.py
To be decided ....