A comprehensive, enterprise-ready framework for systematically evaluating Large Language Models (LLMs) with real-time analytics, cost optimization, and hallucination detection.
- Multi-Model Support: Seamless integration with OpenAI, Anthropic, Google, and other major LLM providers
- Accuracy Benchmarking: Comprehensive testing on standardized datasets (GSM8K included)
- Hallucination Detection: Automated factual consistency analysis using state-of-the-art techniques
- Performance Analytics: Response time, token usage, and cost efficiency tracking
- Real-time Visualization: Dynamic charts and performance trends using Recharts
- Comparative Analysis: Side-by-side model performance comparisons
- Export Capabilities: JSON, CSV, and Excel export functionality
- Advanced Filtering: Multi-dimensional data filtering and search
- FastAPI Backend: High-performance async API with automatic documentation
- React TypeScript Frontend: Modern, responsive UI with Material-UI components
- Modular Design: Clean separation of concerns with service-layer architecture
- Error Handling: Comprehensive error boundaries and retry mechanisms
LLM-Eval-Framework/
├── backend/ # Python FastAPI backend
│ ├── app/
│ │ ├── models/ # Data models
│ │ ├── services/ # LLM integration services
│ │ ├── evaluators/ # Evaluation logic
│ │ └── api/ # API endpoints
│ ├── tests/ # Backend tests
│ └── requirements.txt # Python dependencies
├── frontend/ # React TypeScript frontend
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── pages/ # Page components
│ │ ├── services/ # API services
│ │ └── types/ # TypeScript types
│ ├── public/ # Static assets
│ └── package.json # Frontend dependencies
├── data/ # Benchmark datasets
│ └── test.jsonl # GSM8K grade school math test dataset
├── results/ # Evaluation results storage
└── docs/ # Documentation
-
Navigate to the backend directory:
cd backend
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
cp .env.example .env # Edit .env with your API keys - NEVER commit .env files to version control
-
Run the backend server:
uvicorn app.main:app --reload
-
Navigate to the frontend directory:
cd frontend
-
Install dependencies:
npm install
-
Start the development server:
npm start
The framework includes the GSM8K (Grade School Math 8K) test dataset, which contains 1,319 grade school math word problems. This dataset is sourced from OpenAI's grade-school-math repository and replaces the previous sample_dataset.json.
Dataset Source: OpenAI Grade School Math - test.jsonl
Each problem in the dataset follows this format:
- question: A math word problem
- answer: The step-by-step solution with the final numerical answer
- Access the web interface at
http://localhost:3000
- Configure your LLM API keys in the settings
- Select benchmark datasets for evaluation (including the GSM8K dataset)
- Run evaluations and view results in the interactive dashboard
- Analyze trade-offs between accuracy, cost, and hallucination rates
Once the backend is running, visit http://localhost:8000/docs
for interactive API documentation.
This project handles sensitive API keys and processes user-uploaded data. Please review our Security Policy before contributing.
- Never commit API keys or secrets to version control
- Keep dependencies updated (
pip audit
,npm audit
) - Validate all user inputs
- Use HTTPS in production
- Follow the principle of least privilege
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Run security checks (
pip audit
,npm audit
) - Submit a pull request
Please read our Security Policy before contributing.
MIT License - see LICENSE file for details.