LLM Evaluation Framework

A comprehensive, enterprise-ready framework for systematically evaluating Large Language Models (LLMs) with real-time analytics, cost optimization, and hallucination detection.

▶️ Watch the demo video on YouTube

Features

Advanced Evaluation Metrics

Multi-Model Support: Seamless integration with OpenAI, Anthropic, Google, and other major LLM providers
Accuracy Benchmarking: Comprehensive testing on standardized datasets (GSM8K included)
Hallucination Detection: Automated factual consistency analysis using state-of-the-art techniques
Performance Analytics: Response time, token usage, and cost efficiency tracking

Interactive Dashboard & Analytics

Real-time Visualization: Dynamic charts and performance trends using Recharts
Comparative Analysis: Side-by-side model performance comparisons
Export Capabilities: JSON, CSV, and Excel export functionality
Advanced Filtering: Multi-dimensional data filtering and search

Production-Ready Architecture

FastAPI Backend: High-performance async API with automatic documentation
React TypeScript Frontend: Modern, responsive UI with Material-UI components
Modular Design: Clean separation of concerns with service-layer architecture
Error Handling: Comprehensive error boundaries and retry mechanisms

Project Structure

LLM-Eval-Framework/
├── backend/                 # Python FastAPI backend
│   ├── app/
│   │   ├── models/         # Data models
│   │   ├── services/       # LLM integration services
│   │   ├── evaluators/     # Evaluation logic
│   │   └── api/           # API endpoints
│   ├── tests/             # Backend tests
│   └── requirements.txt   # Python dependencies
├── frontend/              # React TypeScript frontend
│   ├── src/
│   │   ├── components/    # React components
│   │   ├── pages/        # Page components
│   │   ├── services/     # API services
│   │   └── types/        # TypeScript types
│   ├── public/           # Static assets
│   └── package.json      # Frontend dependencies
├── data/                 # Benchmark datasets
│   └── test.jsonl       # GSM8K grade school math test dataset
├── results/              # Evaluation results storage
└── docs/                # Documentation

Quick Start

Backend Setup

Navigate to the backend directory:
```
cd backend
```

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:

cp .env.example .env
# Edit .env with your API keys - NEVER commit .env files to version control

Run the backend server:
```
uvicorn app.main:app --reload
```

Frontend Setup

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```
Start the development server:
```
npm start
```

Dataset

The framework includes the GSM8K (Grade School Math 8K) test dataset, which contains 1,319 grade school math word problems. This dataset is sourced from OpenAI's grade-school-math repository and replaces the previous sample_dataset.json.

Dataset Source: OpenAI Grade School Math - test.jsonl

Each problem in the dataset follows this format:

question: A math word problem
answer: The step-by-step solution with the final numerical answer

Usage

Access the web interface at http://localhost:3000
Configure your LLM API keys in the settings
Select benchmark datasets for evaluation (including the GSM8K dataset)
Run evaluations and view results in the interactive dashboard
Analyze trade-offs between accuracy, cost, and hallucination rates

API Documentation

Once the backend is running, visit http://localhost:8000/docs for interactive API documentation.

Security

This project handles sensitive API keys and processes user-uploaded data. Please review our Security Policy before contributing.

Security Best Practices

Never commit API keys or secrets to version control
Keep dependencies updated (pip audit, npm audit)
Validate all user inputs
Use HTTPS in production
Follow the principle of least privilege

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Run security checks (pip audit, npm audit)
Submit a pull request

Please read our Security Policy before contributing.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
backend		backend
data		data
docs		docs
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Evaluation Framework

Features

Advanced Evaluation Metrics

Interactive Dashboard & Analytics

Production-Ready Architecture

Project Structure

Quick Start

Backend Setup

Frontend Setup

Dataset

Usage

API Documentation

Security

Security Best Practices

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

DavidShableski/llm-evaluation-framework

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation Framework

Features

Advanced Evaluation Metrics

Interactive Dashboard & Analytics

Production-Ready Architecture

Project Structure

Quick Start

Backend Setup

Frontend Setup

Dataset

Usage

API Documentation

Security

Security Best Practices

Contributing

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages