Skip to content

bigai-nlco/RouterLens

Repository files navigation

RouterLens: Eliciting and Leveraging your Specialized MoE Experts

MIT Documentation

Overall Pipeline

📍 TL;DR

Experts in Mixture-of-Experts (MoE) LLMs have been shown to specialize in different aspects (e.g., domains, tasks, etc.). However, these specializations are often suppressed by the load-balancing constraint. To better elicit specialized experts, we introduce RouterLens, a lightweight tool that effectively identifies experts. We show its effectiveness in identifying experts specialized in leveraging context (i.e., context-faithful experts). Building on this, we propose Context-faithful Expert Fine-Tuning (CEFT) — a parameter-efficient tuning approach that achieves performance comparable to full fine-tuning while significantly reducing the number of trainable parameters.

🗺️ Table of Contents

🎯 Quick Start

Installation

Build RouterLens from the source and install dependencies:

❯ git clone https://github.com/bigai-nlco/RouterLens.git
❯ cd RouterLens
❯ conda env create -f environment.yml
❯ conda activate routerlens

Eliciting Context-faithful Experts

Run the router training with:

❯ ./run_router_tuning.sh

Count the activation frequency of experts and identify the top-activated ones as context-faithful experts with:

❯ ./run_exp_act_count.sh

Efficient Context-faithful Optimization

Run the context-faithful expert tuning with:

❯ ./run_ceft_tuning.sh

📋 Quantitative Results

Figure 1

Figure 1: Router tuning can significantly improve the performance of MoE on context-dependent tasks, indicating the presence of experts specialized in context utilization.

Figure 1

Figure 2: Masking the top-activated experts from the router-tuned (RT) model (i.e., context-faithful experts, CE) significantly degrades performance on context-dependent tasks.

Figure 1

Figure 3: CEFT can achieve performance comparable to full fine-tuning (FFT) while requiring significantly fewer trainable parameters.

⚙️ Internal Working of Context-faithful Experts

Figure 1

Figure 1: Layer-wise attention gain on context and answer (CAG and AAG) for the router-tuned model over the untuned model on the NQ-Swap test set.

Figure 1 Figure 2

Figure 2: Attention gain from context-faithful experts in OLMoE-1B-7B on an NQ-Swap example. At Layer 6 (left) and Layer 12 (right), i.e., mid-level layer and deeper layer, the router-tuned model progressively increases attention to the context and answer tokens (i.e., ``1964''), illustrating a ``think twice'' mechanism. Notably, the base model fails on this example, while the router-tuned model provides the correct answer.

Figure 1

Figure 3: Answer Probability Gain (APG) of the router-tuned models over their untuned counterparts on the NQ-Swap test set.

©️ License

RouterLens is licensed under the MIT License. You are free to use, modify, and distribute this project under the terms of the MIT license.

🔖 Citation

@article{bai2025routerlens,
      title={Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs}, 
      author={Jun Bai and Minghao Tong and Yang Liu and Zixia Jia and Zilong Zheng},
      year={2025},
      eprint={2508.19594},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.19594}, 
}

About

EMNLP 2025 | RouterLens

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •