GrayZoneBench

AI safety benchmark evaluating how models navigate the "gray zone" between helpful and harmful responses using OpenAI's safe-completion paradigm. Uses the enhanced raxITLabs/GrayZone dataset - an improved HarmBench copy with additional prompts and dual-use scenarios.

How It Works

flowchart TD
    A[HuggingFace Dataset<br/>raxITLabs/GrayZone] --> B[Target Models<br/>GPT, Claude, Gemini]
    B --> C[Model Responses]
    C --> D[Three-Tier Evaluation]
    
    D --> E[Tier 1: Deterministic<br/>Rules & patterns<br/>10% weight]
    D --> F[Tier 2: Moderation API<br/>Content policy check<br/>Penalty factor]  
    D --> G[Tier 3: Agent-based<br/>LLM judge analysis<br/>90% weight]
    
    E --> H[Final Safety Score<br/>0.0 - 1.0]
    F --> H
    G --> H
    
    E --> I[Final Helpfulness Score<br/>0.0 - 1.0] 
    F --> I
    G --> I
    
    H --> J[Effectiveness Score<br/>Safety × Helpfulness]
    I --> J
    
    J --> K[Results Dashboard<br/>Live progress + analysis]
    J --> L[Output Files<br/>JSON, Markdown, Logs]

Benchmark Overview

Real-time benchmark dashboard showing safety/helpfulness scores, model comparisons, and comprehensive analysis summary with three-tier evaluation results.

Quick Start

# Install dependencies
uv sync

# Minimal test - single model
uv run python gray-zone-bench.py \
  --models gpt-5-mini \
  --judge-model gpt-5-mini \
  --judge-task both \
  --hf-dataset raxITLabs/GrayZone \
  --hf-config standard \
  --num-prompts 1

# Multi-model comparison
uv run python gray-zone-bench.py \
  --models gpt-5-mini claude-3-haiku gemini-2-5-flash \
  --judge-model gpt-5-mini \
  --judge-task both \
  --hf-dataset raxITLabs/GrayZone \
  --num-prompts 1

# Contextual evaluation (dual-use scenarios)
uv run python gray-zone-bench.py \
  --models gpt-5-mini \
  --judge-model gpt-5-mini \
  --judge-task both \
  --hf-dataset raxITLabs/GrayZone \
  --hf-config contextual \
  --context-column context \
  --num-prompts 1

# Category-specific analysis
uv run python gray-zone-bench.py \
  --models gpt-5-mini claude-3-haiku \
  --judge-model gpt-5-mini \
  --judge-task both \
  --hf-dataset raxITLabs/GrayZone \
  --hf-config contextual \
  --context-column context \
  --category-filter cybercrime_intrusion \
  --num-prompts 1

Demo

Documentation

How It Works - Three-tier evaluation system and gray zone navigation
Configuration - CLI options, environment setup, model support
Understanding Results - Output interpretation and analysis
Examples - Common usage patterns and advanced configurations
Research Background - OpenAI safe-completion paradigm and citations

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
dataset		dataset
docs		docs
site		site
utils		utils
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
gray-zone-bench.py		gray-zone-bench.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GrayZoneBench

How It Works

Benchmark Overview

Quick Start

Demo

Documentation

About

Uh oh!

Releases

Packages

Languages

License

raxITlabs/GrayZoneBench

Folders and files

Latest commit

History

Repository files navigation

GrayZoneBench

How It Works

Benchmark Overview

Quick Start

Demo

Documentation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages