The code for the ACL 2025 conference paper "TokAlign: Efficient Vocabulary Adaptation via Token Alignment".
We propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. The following figure illustrates the method of TokAlign:
conda create -n tokalign python=3.10
conda activate tokalign
pip install -r requirements.txt
-
Download and merge multilingual, code, and math data, e.g., CulturaX, the-stack and proof-pile-2 from HuggingFace. We provide a small corpus in the "./data/pretrain-corpus" directory for example.
-
Tokenize corpus and prepare files of GloVe vector training and evaluation
# Replace the path with your corpus and tokenizers' path
vim script/convert2glove_corpus.sh
bash script/convert2glove_corpus.sh
git clone https://github.com/stanfordnlp/GloVe.git
# Train GloVe vectors for source vocabulary and target vocabulary
bash script/token_align.sh
# Change the path to the alignment matrix path for evaluation, and choose an evaluation method (BLEU-1 or Bert-score).
vim script/eval_align.sh
bash script/eval_align.sh
# Modify the path of alignment matrix
vim script/init_model.sh
bash script/init_model.sh
# First tokenize the training dataset used for vocabulary adaptation
vim script/tokenize_dataset.sh
bash script/tokenize_dataset.sh
# Replace some paths and hyper-parameters with yours, and start the vocabulary adaptation process
vim script/vocab_adaptation.sh
bash script/vocab_adaptation.sh
We open-source the following models:
Name | LLaMA3 Tokenizer | Qwen2 Tokenizer | Gemma Tokenizer |
---|---|---|---|
TokAlign | 🤗 | 🤗 | 🤗 |
+ Token-level Distill | 🤗 | 🤗 | 🤗 |
Table 1. Models from
Name | LLaMA3 Tokenizer | Qwen2 Tokenizer | Gemma Tokenizer |
---|---|---|---|
TokAlign | 🤗 | 🤗 | 🤗 |
+ Token-level Distill | 🤗 | 🤗 | 🤗 |
Table 2. Models from
@inproceedings{li-etal-2025-TokAlign,
author = {Chong Li and
Jiajun Zhang and
Chengqing Zong},
title = "TokAlign: Efficient Vocabulary Adaptation via Token Alignment",
booktitle = "Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
}