TokAlign: Efficient Vocabulary Adaptation via Token Alignment

The code for the ACL 2025 conference paper "TokAlign: Efficient Vocabulary Adaptation via Token Alignment".

Overview

We propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. The following figure illustrates the method of TokAlign:

How to run?

Set up an virtual environment

conda create -n tokalign python=3.10
conda activate tokalign
pip install -r requirements.txt

Prepare tokenized data

Download and merge multilingual, code, and math data, e.g., CulturaX, the-stack and proof-pile-2 from HuggingFace. We provide a small corpus in the "./data/pretrain-corpus" directory for example.
Tokenize corpus and prepare files of GloVe vector training and evaluation

# Replace the path with your corpus and tokenizers' path
vim script/convert2glove_corpus.sh 
bash script/convert2glove_corpus.sh

Train GloVe vectors and obtain token alignment matrix

git clone https://github.com/stanfordnlp/GloVe.git
# Train GloVe vectors for source vocabulary and target vocabulary
bash script/token_align.sh

Evaluation of one-to-one token alignment matrix learned

# Change the path to the alignment matrix path for evaluation, and choose an evaluation method (BLEU-1 or Bert-score).
vim script/eval_align.sh
bash script/eval_align.sh

Initialize the model weight with the token alignment matrix

# Modify the path of alignment matrix
vim script/init_model.sh 
bash script/init_model.sh

Vocabulary Adaptation

# First tokenize the training dataset used for vocabulary adaptation
vim script/tokenize_dataset.sh
bash script/tokenize_dataset.sh

# Replace some paths and hyper-parameters with yours, and start the vocabulary adaptation process
vim script/vocab_adaptation.sh
bash script/vocab_adaptation.sh

📎 Models

We open-source the following models:

Name	LLaMA3 Tokenizer	Qwen2 Tokenizer	Gemma Tokenizer
TokAlign	🤗	🤗	🤗
+ Token-level Distill	🤗	🤗	🤗

Table 1. Models from $Pythia_{1b}$

Name	LLaMA3 Tokenizer	Qwen2 Tokenizer	Gemma Tokenizer
TokAlign	🤗	🤗	🤗
+ Token-level Distill	🤗	🤗	🤗

Table 2. Models from $Pythia_{6.9b}$

How to cite our paper?

@inproceedings{li-etal-2025-TokAlign,
  author    = {Chong Li and
               Jiajun Zhang and
               Chengqing Zong},
  title = "TokAlign: Efficient Vocabulary Adaptation via Token Alignment",
  booktitle = "Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2025",
  address = "Vienna, Austria",
  publisher = "Association for Computational Linguistics",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Overview

How to run?

Set up an virtual environment

Prepare tokenized data

Train GloVe vectors and obtain token alignment matrix

Evaluation of one-to-one token alignment matrix learned

Initialize the model weight with the token alignment matrix

Vocabulary Adaptation

📎 Models

How to cite our paper?

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figure		figure
script		script
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

ZNLP/TokAlign

Folders and files

Latest commit

History

Repository files navigation

TokAlign: Efficient Vocabulary Adaptation via Token Alignment

Overview

How to run?

Set up an virtual environment

Prepare tokenized data

Train GloVe vectors and obtain token alignment matrix

Evaluation of one-to-one token alignment matrix learned

Initialize the model weight with the token alignment matrix

Vocabulary Adaptation

📎 Models

How to cite our paper?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages