[paper]
LessIsMore is a novel sparse attention mechanism that dramatically improves the efficiency of large reasoning models without sacrificing accuracy. The key insight is that existing sparse attention methods fail on reasoning tasks because they make localized token selection decisions for each attention head separately, leading to accumulated errors over long reasoning sequences. Instead, LessIsMore leverages two key observations: (1) attention heads in reasoning tasks show significant overlap in which tokens they find important (spatial locality), and (2) recently generated tokens consistently receive high attention across multiple future steps (recency locality).
By globally aggregating token selections across all attention heads and reserving a stable portion of the budget for recent tokens, LessIsMore achieves lossless accuracy on challenging reasoning benchmarks like AIME-24 while using up to 87.5% fewer tokens and delivering 1.1× average decoding speedup. Unlike other sparse attention methods that extend generation length due to selection errors, LessIsMore achieves 1.13x end-to-end generation speedup without extending generation lengths, making it a practical solution for deploying large reasoning models with significantly reduced computational overhead.
Figure 1: The selection process of LessIsMore is three-fold: (1) under token budget $K=4, r=0.25$, compute attention score matrix $W$ and extract the top-k ($k=K(1-r)$,
Figure 2: Reasoning accuracy results of LessIsMore(ours), Quest, TidalDecode, SeerAttention-r, and Full Attention across for multiple main-stream reasoning tasks. Across all evaluated tasks, LessIsMore consistently achievs the lossless accuracy with small token budgets (1K or 2K), always outperforming all other baselines.
Figure 3: Efficiency-accuracy tradeoff comparison on AIME-24 using GQA-based LLama-3.1-8B. Each point represents the end-to-end average per-token decoding latency across the corresponding average generation length. LessIsMore (orange squares) consistently achieves higher accuracy than TidalDecode (blue circles) while maintaining lower latency across all token budgets (1K, 2K, 4K, 6K). The closer to the top-left corner, the better the method performs. Full Attention baseline (triangle) provides the accuracy upper bound but with higher computational cost.
Figure 4: The AIME-24 accuracy followed by corresponding average reasoning length (in K) of different approaches on Qwen3-8B. The highest accuracy and the lowest generation length of each column are in bold, excluding the Full Attention row.
- Clone the submodules
git clone https://github.com/DerrickYLJ/LessIsMore.git
git submodule update --init --recursive
- Install dependency libraries
conda create -yn lessismore python=3.10
conda activate lessismore
pip install -e . && pip install flash-attn==2.3.0 --no-build-isolation
python setup.py develop
# Install CMake (with version >= 3.26.4)
conda install cmake
# build libraft
cd kernels/3rdparty/raft
./build.sh libraft
- Build end-to-end operators with PyBind
# This will automatically build and link the operators
cd tidal/ops
bash setup.sh
Run reasoning tasks by submitting a slurm job:
sbatch experiment/reasoning/run_eval_slurm.sh
Kernels and end-to-end effiency are evaluated on A5000 GPU with CUDA version of 12.2.
To reproduce the end-to-end efficiency results, please execute:
cd scripts
bash bench_efficiency_e2e.sh
@misc{yang2025moretrainingfreesparseattention,
title={Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning},
author={Lijie Yang and Zhihao Zhang and Arti Jain and Shijie Cao and Baihong Yuan and Yiwei Chen and Zhihao Jia and Ravi Netravali},
year={2025},
eprint={2508.07101},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.07101},
}
LessIsMore adopts the code snippets from TidalDecode and Quest. It uses the script from LIMO and SGLang for reasoning evaluation. Ours kernels and end-to-end system are implemented based on FlashInfer. Thanks for all amazing works from the community!


