Skip to content

A production-grade platform to evaluate and compare the performance of Large Language Models (LLMs) like OpenAI, Anthropic, and Google’s PaLM. It features real time analytics, hallucination detection, and cost performance benchmarking using standardized datasets (e.g., GSM8K).

License

Notifications You must be signed in to change notification settings

DavidShableski/llm-evaluation-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Evaluation Framework

Python React TypeScript FastAPI

A comprehensive, enterprise-ready framework for systematically evaluating Large Language Models (LLMs) with real-time analytics, cost optimization, and hallucination detection.

▶️ Watch the demo video on YouTube

Features

Advanced Evaluation Metrics

  • Multi-Model Support: Seamless integration with OpenAI, Anthropic, Google, and other major LLM providers
  • Accuracy Benchmarking: Comprehensive testing on standardized datasets (GSM8K included)
  • Hallucination Detection: Automated factual consistency analysis using state-of-the-art techniques
  • Performance Analytics: Response time, token usage, and cost efficiency tracking

Interactive Dashboard & Analytics

  • Real-time Visualization: Dynamic charts and performance trends using Recharts
  • Comparative Analysis: Side-by-side model performance comparisons
  • Export Capabilities: JSON, CSV, and Excel export functionality
  • Advanced Filtering: Multi-dimensional data filtering and search

Production-Ready Architecture

  • FastAPI Backend: High-performance async API with automatic documentation
  • React TypeScript Frontend: Modern, responsive UI with Material-UI components
  • Modular Design: Clean separation of concerns with service-layer architecture
  • Error Handling: Comprehensive error boundaries and retry mechanisms

Project Structure

LLM-Eval-Framework/
├── backend/                 # Python FastAPI backend
│   ├── app/
│   │   ├── models/         # Data models
│   │   ├── services/       # LLM integration services
│   │   ├── evaluators/     # Evaluation logic
│   │   └── api/           # API endpoints
│   ├── tests/             # Backend tests
│   └── requirements.txt   # Python dependencies
├── frontend/              # React TypeScript frontend
│   ├── src/
│   │   ├── components/    # React components
│   │   ├── pages/        # Page components
│   │   ├── services/     # API services
│   │   └── types/        # TypeScript types
│   ├── public/           # Static assets
│   └── package.json      # Frontend dependencies
├── data/                 # Benchmark datasets
│   └── test.jsonl       # GSM8K grade school math test dataset
├── results/              # Evaluation results storage
└── docs/                # Documentation

Quick Start

Backend Setup

  1. Navigate to the backend directory:

    cd backend
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up environment variables:

    cp .env.example .env
    # Edit .env with your API keys - NEVER commit .env files to version control
  5. Run the backend server:

    uvicorn app.main:app --reload

Frontend Setup

  1. Navigate to the frontend directory:

    cd frontend
  2. Install dependencies:

    npm install
  3. Start the development server:

    npm start

Dataset

The framework includes the GSM8K (Grade School Math 8K) test dataset, which contains 1,319 grade school math word problems. This dataset is sourced from OpenAI's grade-school-math repository and replaces the previous sample_dataset.json.

Dataset Source: OpenAI Grade School Math - test.jsonl

Each problem in the dataset follows this format:

  • question: A math word problem
  • answer: The step-by-step solution with the final numerical answer

Usage

  1. Access the web interface at http://localhost:3000
  2. Configure your LLM API keys in the settings
  3. Select benchmark datasets for evaluation (including the GSM8K dataset)
  4. Run evaluations and view results in the interactive dashboard
  5. Analyze trade-offs between accuracy, cost, and hallucination rates

API Documentation

Once the backend is running, visit http://localhost:8000/docs for interactive API documentation.

Security

This project handles sensitive API keys and processes user-uploaded data. Please review our Security Policy before contributing.

Security Best Practices

  • Never commit API keys or secrets to version control
  • Keep dependencies updated (pip audit, npm audit)
  • Validate all user inputs
  • Use HTTPS in production
  • Follow the principle of least privilege

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Run security checks (pip audit, npm audit)
  6. Submit a pull request

Please read our Security Policy before contributing.

License

MIT License - see LICENSE file for details.

About

A production-grade platform to evaluate and compare the performance of Large Language Models (LLMs) like OpenAI, Anthropic, and Google’s PaLM. It features real time analytics, hallucination detection, and cost performance benchmarking using standardized datasets (e.g., GSM8K).

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published