๐ฅ The FIRST MLX implementation of GRPO - Train reasoning models like DeepSeek-R1 ๐ฅ
Build your own o1-style reasoning AI using the same technique that powers DeepSeek-R1
๐ Quick Start โข ๐ง What is GRPO? โข โก Performance โข ๐ฏ Examples
DeepSeek-R1 just shocked the AI world by matching o1 performance using GRPO (Group Relative Policy Optimization). Now you can:
- ๐ง Train o1-style reasoning models - Same technique as DeepSeek-R1
- โก On your Mac - Native Apple Silicon optimization via MLX
- ๐ฐ No human feedback needed - Programmable rewards instead of expensive RLHF
- ๐ฏ Multi-step reasoning - Perfect for math, coding, and complex problems
- ๐ Production ready - Robust checkpointing and speculative decoding
"GRPO is the technique behind DeepSeek-R1's breakthrough performance" - Recent AI research shows GRPO enables direct optimization using programmable reward functions, making it more scalable than traditional RLHF approaches
Group Relative Policy Optimization is the secret sauce behind DeepSeek-R1's reasoning abilities:
- ๐ Compares multiple responses to the same question within each batch
- ๐ฏ Learns from relative quality - promotes better answers, demotes worse ones
- ๐ Online learning - improves iteratively using the model's own generated data
- ๐๏ธ Programmable rewards - no need for expensive human preference data
- ๐งฎ Perfect for reasoning - excels at multi-step problems like math and coding
The GRPO update compares multiple answers to a single question within a batch, teaching the model to become more like correct answers and less like incorrect ones.
Get your GRPO reasoning model running in 3 minutes:
# 1. Clone and install
git clone https://github.com/adeelahmad/mlx-grpo.git
cd mlx-grpo
pip install mlx mlx-lm numpy rich datasets
# 2. Train a math reasoning model (like DeepSeek-R1)
python mlx_grpo_trainer_aligned.py \
--model_path microsoft/DialoGPT-medium \
--train_dataset_path ./data/math_problems.jsonl \
--reward_content_type math_eval \
--num_training_steps 5000
# 3. Test your reasoning model
python test_reasoning.py --model ./output_model
That's it! ๐ You now have a reasoning model trained with the same technique as DeepSeek-R1.
Traditional Training | MLX-GRPO on Mac | Advantage |
---|---|---|
Requires expensive GPUs | Runs on any Mac with Apple Silicon | ๐ฐ Cost savings |
Complex CUDA setup | Zero configuration needed | ๐ Easy setup |
High memory usage | MLX optimized memory management | ๐ฑ Efficient |
Slow on consumer hardware | Native Apple Silicon acceleration | โก Fast training |
MLX is Apple's machine learning framework designed specifically for efficient training and inference on Apple Silicon.
python mlx_grpo_trainer_aligned.py \
--model_path microsoft/DialoGPT-medium \
--train_dataset_path ./data/math_qa.jsonl \
--reward_content_type math_eval \
--reward_format_weight 0.3 \
--reward_content_weight 0.7 \
--num_training_steps 8500
Trains a model to show step-by-step mathematical reasoning
python mlx_grpo_trainer_aligned.py \
--model_path microsoft/DialoGPT-large \
--train_dataset_path ./data/reasoning.jsonl \
--reward_content_type jaccard \
--num_training_steps 10000
Optimizes for the <think>...</think><answer>...</answer>
format used by o1 and R1
python mlx_grpo_trainer_aligned.py \
--dataset_name "your-mcq-dataset" \
--reward_content_type choice_correctness \
--num_training_steps 6000
Perfect for training on standardized tests and benchmarks
- ๐ Format Rewards: Ensures proper
<think>...</think><answer>...</answer>
structure - ๐งฎ Math Evaluation: Automatically checks mathematical correctness
- ๐ Jaccard Similarity: Measures word overlap with reference answers
- โ Choice Correctness: Perfect for multiple-choice problems
- ๐ง Custom Rewards: Build your own reward functions
- ๐พ Atomic Checkpointing: Never lose training progress
- โก Speculative Decoding: 2x faster inference with draft models
- ๐จ Rich CLI: Beautiful progress bars and logging
- ๐ Auto-Resume: Continues exactly where you left off
- ๐ Weights & Biases: Optional experiment tracking
# All training parameters
@dataclass
class TrainingArgs:
model_path: str = "../Model"
output_dir: str = "../OutputModel"
num_training_steps: int = 8500
reward_content_type: str = "jaccard" # jaccard, math_eval, choice_correctness
reward_format_weight: float = 0.5
reward_content_weight: float = 0.5
# ... and many more!
๐ All Training Parameters
Parameter | Description | Default |
---|---|---|
--output_dir |
Directory for checkpoints and outputs | ../OutputModel |
--model_path |
Path or ID of the base MLX model | ../Model |
--train_dataset_path |
Local training JSONL file | ../dataset_512/train.jsonl |
--val_dataset_path |
Local validation JSONL file | ../dataset_512/valid.jsonl |
--num_training_steps |
Number of optimizer steps | 8500 |
--reward_content_type |
Content reward: jaccard , math_eval , choice_correctness |
jaccard |
--reward_format_weight |
Weight for format reward (0.0 - 1.0) | 0.5 |
--reward_content_weight |
Weight for content reward (0.0 - 1.0) | 0.5 |
See TrainingArgs
dataclass in the code for the complete list
- โ GRPO - Same as DeepSeek-R1 (trending #1 on Twitter)
- โ Chain-of-Thought - o1-style reasoning format
- โ Apple Silicon ML - Fastest growing ML platform
- โ Reward-Free RL - No expensive human feedback needed
- ๐ฅ DeepSeek-R1 just dominated benchmarks using GRPO
- ๐ Apple MLX adoption growing rapidly
- ๐ก Reasoning models are the hottest topic in AI
- ๐ฐ Cost-effective alternative to GPT-4/Claude for reasoning
- ๐ GRPO Explained - DeepLearning.AI Course
- ๐ฌ DeepSeek-R1 Technical Report - How they used GRPO
- ๐ MLX Documentation - Apple's ML framework
- ๐ฌ HuggingFace GRPO Guide - Alternative implementation
- ๐ Apple Silicon Mac (M1, M2, M3, M4) or any MLX-supported hardware
- ๐ Python โฅ3.8
- ๐ฆ Dependencies:
mlx
,mlx-lm
,numpy
,rich
,datasets
- ๐พ Optional:
psutil
,wandb
for enhanced monitoring
We โค๏ธ contributions! This is a hot research area with lots of room for improvement:
- ๐ด Fork the repo
- ๐ฟ Create feature branch (
git checkout -b amazing-feature
) - ๐ซ Commit changes (
git commit -m 'Add amazing feature'
) - ๐ Push to branch (
git push origin amazing-feature
) - ๐ Open Pull Request
- ๐ง New reward functions for specific domains
- โก Performance optimizations for MLX
- ๐ Better evaluation metrics
- ๐จ Enhanced CLI visualization
- ๐ More training examples and tutorials
MIT License - see LICENSE file for details.
- ๐ Apple for the incredible MLX framework
- ๐ค HuggingFace for MLX-LM and datasets
- ๐จ Textualize for the beautiful Rich library
- ๐ง DeepSeek for pioneering GRPO in their R1 model
- ๐ฌ Research community advancing reinforcement learning for LLMs
โญ Star us if you're excited about training reasoning models on Mac! โญ
Built with ๐ง for the future of AI reasoning
๐ฅ Trending: #GRPO
#DeepSeekR1
#MLX
#AppleSilicon
#ReasoningAI
#MachineLearning