The LLM Distillation Tool
WhiteLightning distills massive, state-of-the-art language models into lightweight, hyper-efficient text classifiers. It's a command-line tool that lets you create specialized models that run anywhere—from the cloud to the edge—using the universal ONNX format for maximum compatibility.
We use large, powerful frontier models as "teachers" to train much smaller, task-specific "student" models. WhiteLightning automates this process for text classification, allowing you to create high-performance classifiers with a fraction of the computational footprint.
WhiteLightning exports every trained model to ONNX (Open Neural Network Exchange). This standard format makes your models instantly portable. Run them natively in Python, JavaScript, C++, Rust, Java, and more, ensuring total flexibility for any project. Learn more at onnx.ai.
WhiteLightning is designed as a "generic" Docker image that works seamlessly across macOS, Linux, and Windows with identical commands:
- Zero Configuration: No need for complex
--user
flags or platform-specific commands - Automatic Permission Handling: Intelligently detects your system and sets correct file ownership
- Universal Commands: Same
docker run
command works everywhere - Smart User Management: Internally manages user creation and permission mapping
- Secure by Default: Always runs as non-root user with proper privilege dropping
- Multiple Model Architectures: Generate models for binary and multiclass classification with different activation functions.
- Instant Cross-Platform Deployment: Export to ONNX for use in any environment or language.
- Lightweight & Incredibly Fast: Optimized for high-speed inference with minimal resource consumption.
- Framework Agnostic: The final ONNX model has zero dependencies on TensorFlow or PyTorch. It's pure, portable compute.
- Multilingual Support: Generate training data and classifiers in a wide variety of languages.
- Smart & Automatic: Intelligently generates and refines prompts based on your classification task.
-
Clone the repository:
git clone https://github.com/Inoxoft/whitelightning.git cd whitelightning
-
Get an OpenRouter API key at openrouter.ai/settings/keys.
-
Run the Docker image:
Mac:
docker run --rm \ -v "$(pwd)":/app/models \ -e OPEN_ROUTER_API_KEY="YOUR_OPEN_ROUTER_KEY_HERE" \ ghcr.io/inoxoft/whitelightning:latest \ python -m text_classifier.agent \ -p "Categorize customer reviews as positive, neutral, or negative"
Linux:
docker run --rm \ -v "$(pwd)":/app/models \ -e OPEN_ROUTER_API_KEY="YOUR_OPEN_ROUTER_KEY_HERE" \ ghcr.io/inoxoft/whitelightning:latest \ python -m text_classifier.agent \ -p "Categorize customer reviews as positive, neutral, or negative"
Windows (PowerShell):
docker run --rm \ -v "${PWD}:/app/models" \ -e OPEN_ROUTER_API_KEY="YOUR_OPEN_ROUTER_KEY_HERE" \ ghcr.io/inoxoft/whitelightning:latest \ python -m text_classifier.agent \ -p "Categorize customer reviews as positive, neutral, or negative"
-
That's it! You'll see the generation process in your terminal.
When it's finished, list the files in your directory (
ls -l
). You'll find all the assets for your new model, ready to go:🎮 Try your trained model right here: WhiteLightning Playground
NEW! Skip LLM data generation and train directly on your existing datasets. WhiteLightning automatically analyzes your data structure and creates optimized models from real domain data.
# Create folder for your data
mkdir own_data
cp your_dataset.csv own_data/
# Train on your data (faster, cheaper, more accurate!)
docker run --rm \
-v "$(pwd)":/app/models \
-e OPEN_ROUTER_API_KEY="YOUR_OPEN_ROUTER_KEY_HERE" \
ghcr.io/inoxoft/whitelightning:latest \
python -m text_classifier.agent \
-p "Categorize customer reviews as positive, neutral, or negative" \
--use-own-dataset="/app/models/own_data/your_dataset.csv"
Benefits:
- ⚡ 3-5x Faster: No data generation needed
- 💰 95% Cheaper: Only uses LLM for data analysis (~$0.01 vs $1-10)
- 🎯 Higher Accuracy: Real domain data vs synthetic
- 📁 Multiple Formats: Supports CSV, JSON, JSONL, and TXT files
- 🔍 Auto-Detection: Automatically identifies text/label columns and classification type
config.json # Configuration and analysis
training_data.csv # Generated training data
edge_case_data.csv # Challenging test cases
model.onnx # ONNX model file
model_scaler.json # StandardScaler parameters
model_vocab.json # TF-IDF vocabulary
See our Complete Documentation for guides on how to use these files in your language of choice (C++, Rust, iOS, Android, and more).
The power of WhiteLightning is the -p
(prompt) argument. You can create a classifier for almost anything just by describing it. Here are some ideas to get you started:
-
Spam Filter:
-p "Classify emails as 'spam' or 'not_spam'"
-
Topic Classifier:
-p "Determine if a news headline is about 'tech', 'sports', 'world_news', or 'finance'"
-
Toxicity Detector:
-p "Detect whether a user comment is 'toxic' or 'safe'"
-
Urgency Detection:
-p "Categorize a support ticket's urgency as 'high', 'medium', or 'low'"
-
Intent Recognition:
-p "Classify the user's intent as 'book_flight', 'check_status', or 'customer_support'"
The possibilities are endless. For more inspiration and advanced prompt engineering techniques, check out our Complete Documentation.
Don't want to manually construct Docker commands? Use our Interactive Command Generator to build your personalized WhiteLightning commands with a user-friendly interface:
- 📝 Simple Configuration: Enter your API key and describe your classification task
- ⚙️ Advanced Options: Configure model type, activation functions, language settings, and more
- 🖥️ Platform Detection: Automatically generates the correct command format for macOS, Linux, or Windows
- 📋 One-Click Copy: Copy the generated command directly to your clipboard
- 💡 Smart Defaults: Intelligent parameter suggestions based on your task description
Features:
- Model Type Selection: Choose between TensorFlow, PyTorch, or Scikit-learn
- Activation Functions: Auto-detect or manually select sigmoid/softmax
- Custom Datasets: Easy configuration for using your own data files
- Language Support: Set primary language for multilingual classification
- Performance Tuning: Adjust batch size, refinement cycles, and feature limits
Perfect for:
- First-time users who want guided setup
- Complex configurations with multiple parameters
- Teams sharing standardized commands
- Quick experimentation with different settings
Want to test your ONNX models across multiple programming languages? Check out our WhiteLightning Test Framework - a comprehensive cross-language testing suite that validates your models in:
- 8 Programming Languages: Python, Java, C++, C, Node.js, Rust, Dart, and Swift
- Performance Benchmarking: Detailed timing, memory usage, and throughput analysis
- Automated Testing: GitHub Actions workflows for continuous validation
- Real-world Scenarios: Test with custom inputs and edge cases
Perfect for ensuring your WhiteLightning models work consistently across all target platforms and deployment environments.
Need comprehensive guides and documentation? Check out our WhiteLightning Site - this repository hosts the official website for WhiteLightning at https://whitelightning.ai, a cutting-edge LLM distillation tool with detailed documentation, tutorials, and implementation guides.
Looking for pre-trained models or want to share your own? Visit our WhiteLightning Model Library - a centralized repository for uploading, downloading, and managing trained machine learning models. Perfect for sharing community contributions and accessing ready-to-use classifiers.
Train your models directly in GitHub Actions! This repository includes a pre-configured workflow that lets you:
- 🤖 Train Models in the Cloud: No local setup required - train directly in GitHub's infrastructure
- ⚙️ Customizable Parameters: Set classification prompt, refinement cycles, language, and mock mode
- 🔧 Manual Triggers: Run training on-demand via GitHub's "Run workflow" button
- 📦 Automatic Artifacts: Generated models (ONNX, vocab, scaler) are automatically saved as downloadable artifacts
- ✅ Built-in Validation: ONNX model validation and inference testing included
To use: 2. Go to Actions → "Test Model Training" → "Run workflow"" 3. Customize training parameters or use defaults 4. Download generated models from the workflow artifacts
Perfect for teams, CI/CD pipelines, or when you need cloud-based model training!
File Permissions: WhiteLightning automatically handles all file permission issues across platforms. Generated files will have correct ownership on your host system without any additional configuration.
Windows Path Issues:
Use PowerShell and ${PWD}
instead of $(pwd)
in your commands.
Container Access Issues: If you encounter any Docker-related issues, ensure Docker is running and you have proper permissions to run Docker commands.
Want to build from source or customize the Docker image? Check out the Local Setup Guide.
We welcome all contributions! The best way to start is by joining our Discord Server and chatting with the team. We're happy to help you get started.
This project is licensed under the GPLv3 License - see the LICENSE file for details.
# Basic usage (automatic activation detection)
python text_classifier/agent.py -p "Classify movie reviews as positive, negative, or neutral"
# Using your own dataset (automatic detection)
python text_classifier/agent.py -p "Emotion classifier" --use-own-dataset=data/emotions.csv
# Override activation function (advanced users)
python text_classifier/agent.py -p "Emotion classifier" --use-own-dataset=data/emotions.csv --activation sigmoid
# Available activation options
--activation auto # Smart automatic detection (default)
--activation sigmoid # For multi-label classification
--activation softmax # For single-label classification
Sigmoid (--activation sigmoid
):
- ✅ Multi-label: One sample can have multiple labels
- ✅ Independent classes:
"action,comedy,drama"
- ✅ Tags, symptoms, characteristics
- ✅ Example: Movie genres, article tags, medical symptoms
Softmax (--activation softmax
):
- ✅ Single-label: One sample has exactly one label
- ✅ Mutually exclusive:
"positive"
OR"negative"
OR"neutral"
- ✅ Categories, emotions, languages
- ✅ Example: Sentiment analysis, document categories
Auto (--activation auto
):
- 🤖 System analyzes your data structure
- 🔍 Detects comma-separated labels → sigmoid
- 📊 Detects single labels → softmax
- 💡 Shows reasoning and alternatives