Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

TL;DR

LessIsMore is a novel sparse attention mechanism that dramatically improves the efficiency of large reasoning models without sacrificing accuracy. The key insight is that existing sparse attention methods fail on reasoning tasks because they make localized token selection decisions for each attention head separately, leading to accumulated errors over long reasoning sequences. Instead, LessIsMore leverages two key observations: (1) attention heads in reasoning tasks show significant overlap in which tokens they find important (spatial locality), and (2) recently generated tokens consistently receive high attention across multiple future steps (recency locality).

By globally aggregating token selections across all attention heads and reserving a stable portion of the budget for recent tokens, LessIsMore achieves lossless accuracy on challenging reasoning benchmarks like AIME-24 while using up to 87.5% fewer tokens and delivering 1.1× average decoding speedup. Unlike other sparse attention methods that extend generation length due to selection errors, LessIsMore achieves 1.13x end-to-end generation speedup without extending generation lengths, making it a practical solution for deploying large reasoning models with significantly reduced computational overhead.

Figure 1: The selection process of LessIsMore is three-fold: (1) under token budget $K=4, r=0.25$, compute attention score matrix $W$ and extract the top-k ($k=K(1-r)$, $r$ is the ratio of the most recent tokens that will be, by default, reserved as sliding window) token indices for each attention head as $\rho_{head}$; (2) flatten and union the selected indices for all heads, keeping the first k indices; (3) concatenate them with the most recent tokens, resulting in final token set $\rho$. The sparse attention layers only load the tensors of tokens in $\rho$ from KV cache for all attention heads until the next selection layer or the end of the decoding step.*

Accuracy

Figure 2: Reasoning accuracy results of LessIsMore(ours), Quest, TidalDecode, SeerAttention-r, and Full Attention across for multiple main-stream reasoning tasks. Across all evaluated tasks, LessIsMore consistently achievs the lossless accuracy with small token budgets (1K or 2K), always outperforming all other baselines.

Latency

Figure 3: Efficiency-accuracy tradeoff comparison on AIME-24 using GQA-based LLama-3.1-8B. Each point represents the end-to-end average per-token decoding latency across the corresponding average generation length. LessIsMore (orange squares) consistently achieves higher accuracy than TidalDecode (blue circles) while maintaining lower latency across all token budgets (1K, 2K, 4K, 6K). The closer to the top-left corner, the better the method performs. Full Attention baseline (triangle) provides the accuracy upper bound but with higher computational cost.

Generation of LessIsMore

Figure 4: The AIME-24 accuracy followed by corresponding average reasoning length (in K) of different approaches on Qwen3-8B. The highest accuracy and the lowest generation length of each column are in bold, excluding the Full Attention row.

Installation

Clone the submodules

git clone https://github.com/DerrickYLJ/LessIsMore.git
git submodule update --init --recursive

Install dependency libraries

conda create -yn lessismore python=3.10
conda activate lessismore
pip install -e . && pip install flash-attn==2.3.0 --no-build-isolation
python setup.py develop

# Install CMake (with version >= 3.26.4)
conda install cmake

# build libraft
cd kernels/3rdparty/raft
./build.sh libraft

Build end-to-end operators with PyBind

# This will automatically build and link the operators
cd tidal/ops
bash setup.sh

Performance Evaluation

Run reasoning tasks by submitting a slurm job:

sbatch experiment/reasoning/run_eval_slurm.sh

Efficiency Evaluation

Kernels and end-to-end effiency are evaluated on A5000 GPU with CUDA version of 12.2.

End-to-end Efficiency

To reproduce the end-to-end efficiency results, please execute:

cd scripts
bash bench_efficiency_e2e.sh

Citation

@misc{yang2025moretrainingfreesparseattention,
      title={Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning}, 
      author={Lijie Yang and Zhihao Zhang and Arti Jain and Shijie Cao and Baihong Yuan and Yiwei Chen and Zhihao Jia and Ravi Netravali},
      year={2025},
      eprint={2508.07101},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.07101}, 
}

Related Projects

LessIsMore adopts the code snippets from TidalDecode and Quest. It uses the script from LIMO and SGLang for reasoning evaluation. Ours kernels and end-to-end system are implemented based on FlashInfer. Thanks for all amazing works from the community!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
examples		examples
experiments		experiments
kernels		kernels
scripts		scripts
src		src
tidal		tidal
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

TL;DR

Accuracy

Latency

Generation of LessIsMore

Installation

Performance Evaluation

Efficiency Evaluation

End-to-end Efficiency

Citation

Related Projects

About

Uh oh!

Releases

Packages

Languages

License

DerrickYLJ/LessIsMore

Folders and files

Latest commit

History

Repository files navigation

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

TL;DR

Accuracy

Latency

Generation of LessIsMore

Installation

Performance Evaluation

Efficiency Evaluation

End-to-end Efficiency

Citation

Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages