Skip to content

nsourlos/LLM_evaluation_framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Evaluation Framework

forthebadge

License: MIT Maintenance

A Python-based framework for evaluating Large Language Models (LLMs) based on Anthropic's research paper and using the DRACO AI dataset.

πŸ“‹ Table of Contents

πŸš€ Quick Start

  1. Clone the repo
  2. Set up environment variables
  3. Install dependencies
  4. Run python main.py

✨Features

  • πŸ€– Multiple Model Support: OpenAI, Anthropic, Together AI, Groq, OpenRouter, Gemini, HuggingFace
  • πŸ“Š Evaluation Metrics: Completeness, relevance, conciseness, confidence, factuality, judgement, and custom
  • πŸ” RAG Implementation: FAISS vectorstore with BGE embeddings and reranking
  • πŸ› οΈ Tool Usage: Code execution, simulation running, SmolAgents integration
  • βš–οΈ Multiple Judges: Support for secondary judge models
  • πŸ“ˆ Statistical Analysis: Comprehensive statistics and visualization
  • 🌐 Cross-Platform: Windows, macOS, and Linux support

πŸ“₯ Installation

  1. Clone the repository:

    git clone https://github.com/nsourlos/LLM_evaluation_framework.git
    cd LLM_evaluation_framework
  2. Create and activate a virtual environment:

    python -m venv DRACO
    source DRACO/bin/activate  # On Windows, use `venv\Scripts\activate`
  3. Install the package in editable mode:

    pip install -r requirements.txt 
    #Optionally ipywidgets==8.1.7 for Running in Jupyter notebook
    #Optionally flash-attn==2.6.3 for GPU support
  4. (Optional) Use environment within Jupyter Notebook

    pip install ipykernel
    python -m ipykernel install --user --name=DRACO --display-name "Python (DRACO)"
  5. (Optional) Set up code execution environment:

    # When using code execution features, a separate environment is needed
    # to safely run generated code without conflicts
    conda create -n test_LLM python==3.10 -y
    conda activate test_LLM
    pip install -r data/requirements_code_execution.txt

    This creates an isolated environment for running generated code, preventing potential conflicts with the main evaluation environment.


βš™οΈ Configuration

πŸ”‘ Environment Variables

  1. Rename env_example to env and add your API keys:
OPENAI_API_KEY="your_openai_api_key"
GEMINI_API_KEY="your_gemini_api_key"
TOGETHER_API_KEY="your_together_api_key"
GROQ_API_KEY="your_groq_api_key"
ANTHROPIC_API_KEYO="your_anthropic_api_key"
HF_TOKEN="your_huggingface_token"
OPEN_ROUTER_API_KEY="your_openrouter_api_key"

πŸ“‚ Path Configuration

Edit src/llm_eval/utils/paths.py to set your system-specific paths:

  • For the corresponding OS: Set base_path and venv_path

⚑ Parameters Configuration

Edit src/llm_eval/config.py to configure:

  • excel_file_name: Your dataset Excel file - This need to be added from the user
  • embedding_model: Model for RAG embeddings
  • reranker_model_name: Model for reranking
  • models: List of models to evaluate (e.g. OpenAI, Together, Gemini models)
  • judge_model: Models used to judge the results
  • commercial_api_providers: Use to distinguish commercial and HuggingFace models
  • max_output_tokens: Maximum tokens in judge LLM output
  • generate_max_tokens: Token limit for regular model responses
  • generation_max_tokens_thinking: Token limit for reasoning model responses
  • domain: Domain of evaluation (e.g. "Water" Engineering)
  • n_resamples: Number of times to resample the dataset
  • continue_from_resample: Which resample iteration to continue from
  • tool_usage: Enable/disable tool usage for answering questions
  • use_RAG: Enable/disable RAG (Retrieval Augmented Generation)
  • use_smolagents: Enable/disable SmolAgents for code execution

πŸ“ Excel File Format

The input Excel file must contain at least two columns:

  • input: The questions or prompts to evaluate
  • output: The expected answers or ground truth

Additional columns may be added:

  • id: Column to uniquely identify questions
  • origin_file: The json file from which the question-answer pair was extracted
  • topic/subtopic: The topic/subtopic of the question
  • Reference: Information from where the question-answer pair was obtained

πŸš€ Usage

  1. Configure parameters:

  2. Run the evaluation:

    python main.py 
    # Optionally `python main.py | tee data/log.txt` to save terminal output to txt file

The script will:

  • Load and process your Excel dataset
  • Run evaluations on specified models
  • Generate Excel results files
  • Create JSON files for statistics
  • Produce visualization plots

πŸ“ Project Structure

llm_evaluation_framework/
β”œβ”€β”€ src/
β”‚   └── llm_eval/
β”‚       β”œβ”€β”€ config.py               # All configuration parameters
β”‚       β”œβ”€β”€ core/
β”‚       β”‚   β”œβ”€β”€ data_loader.py      # Functions for loading data and models
β”‚       β”‚   β”œβ”€β”€ model_utils.py      # Model initialization and utilities
β”‚       β”œβ”€β”€ evaluation/
β”‚       β”‚   β”œβ”€β”€ evaluator.py        # Evaluation functions
β”‚       β”‚   └── prompts.py          # All evaluation prompt strings
β”‚       β”œβ”€β”€ providers/
β”‚       β”‚   └── api_handlers.py     # Helper functions for LLM APIs
β”‚       β”œβ”€β”€ tools/
β”‚       β”‚   β”œβ”€β”€ code_execution.py   # Logic for tool handling
β”‚       β”‚   └── tool_usage.py       # Tool usage definition and decision logic
β”‚       └── utils/
β”‚           β”œβ”€β”€ paths.py            # OS-specific path configurations
β”‚           β”œβ”€β”€ plotting.py         # Visualization functions
β”‚           β”œβ”€β”€ processing.py       # Processing and Excel file creation
β”‚           β”œβ”€β”€ rag.py              # RAG implementation
β”‚           β”œβ”€β”€ scoring.py          # Scoring utilities
β”‚           └── statistics.py       # Statistical calculations
β”œβ”€β”€ notebooks/
β”‚   └── convert_DRACO_to_excel.ipynb     # Create Excel file from json files with question-answer pairs
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ requirements_code_execution.txt  # Dependencies for code execution environment
β”‚   β”œβ”€β”€ network_0.inp                    # Input file for network comparison
β”‚   β”œβ”€β”€ network_test.inp                 # Input file for network testing scenarios
β”‚   β”œβ”€β”€ compare_networks_test.py         # Test script for network comparison functionality
β”‚   └── compare_networks.py              # Main network comparison implementation
β”œβ”€β”€ runpod/
β”‚   β”œβ”€β”€ README_runpod.md                # RunPod instructions
β”‚   └── runpod_initialize.ipynb         # Notebook that automatically initialize runpod and copies files to it
β”œβ”€β”€ example_imgs/
β”‚   β”œβ”€β”€ metric_comparison_grid.png                       #Example image of a comparison grid of models for different metrics
β”‚   β”œβ”€β”€ model_performance_summary.png                    #Example image of metric comparisons between models for different metrics
β”‚   β”œβ”€β”€ model_statistical_comparisons.png                #Example image of statistical comparisons between models   
β”‚   β”œβ”€β”€ spider_chart_judge_deepseek-ai_DeepSeek-V3.png   #Example image of spider graph comparisons between metrics for different models
β”œβ”€β”€ main.py                         # Main script
β”œβ”€β”€ env_example                     # Environment variables (to be renamed to env)
β”œβ”€β”€ requirements.txt                # Dependencies
└── README.md                       # This file

πŸ“Š Example Plots

The framework generates various visualization plots to help analyze the evaluation results. Here are some examples of a comparison of two models:

Model Performance Summary

Model Performance Summary Overall performance summary of evaluated models

Spider Chart Analysis

Spider Chart Spider chart showing metric distribution

Metric Comparison Grid

Metric Comparison Grid Comparison of different metrics across models

Statistical Comparisons

Statistical Comparisons Statistical comparison between models with p-values


🀝 Contributing

When making changes:

  1. Maintain backward compatibility
  2. Preserve original function signatures
  3. Keep all comments and logging

βœ… To-Do

  • Remove Langsmith
  • Replace txt saves with logging

πŸ”§ Troubleshooting

All operations are logged in txt files to track errors. To modify list of metrics to be evaluated, change the list_of_metrics in prompts.py


πŸ“„ License

License: MIT

To be decided ....