MokA

MokA: Multimodal Low-Rank Adaptation for MLLMs

🚀 Quick Start

🛠️ Requirements and Installation

Basic Dependencies:

Python == 3.9
Pytorch == 2.1.0
transformers == 4.37.2
deepspeed == 0.12.6

🥑 Used pre-trained weights:

Multi-modal Encoder Weights:

download visual encoder openai-clip-vit-large-patch14
download audio encoder Fine-tuned BEATs_iter3+ (AS2M)

LLM Weights:

download LLaMA-2-Chat-HF

🌴 Prepare datasets

In this repo, we take the audio-visual-text case as an example. Pretrain based on llama2-7b-chat-hf model.

Download image and video pretrain dataset from Video-LLaVA;
Download audio pretrain dataset from AudioCaps;
The used fine-tuning dataset is MUSIC-AVQA. Prepare the corresponding data and annotation Here.

Set the path of pretrain dataset at:

dataset/pretrain_dataset.py

Set the path of finetuning dataset at

dataset/unified_dataset.py

🔑 Training

replace necessary path of google-bert-base-uncased, clip-vit-large-patch14 and BEATs in:

models/multimodal_encoder.py
models/unified_arch.py

🔥 Stage 1: pre-train projectors

It takes about 24h to pre-train the visual projector, using 20 A100 40g GPUs:

sh scripts/pretrain/pretrain_visual.sh

It takes about 1h to pre-train the audio projector, using 16 A100 40g GPUs:

sh scripts/pretrain/pretrain_audio.sh

We also release our pre-trained projectors for llama2-7b-chat-hf: Download audio projector checkpoint, visual projector checkpoint.

🔥 Stage 2: fine-tuning

Set the path of pre-trained projectors of line 134-135 at:

sh scripts/finetune/finetune.py

Here we take MUSIC-AVQA as an example, and it takes about 5-6h, using 16 A100 40g GPUs:

sh scripts/finetune/ft.sh

🤖 Inference

Here we take MUSIC-AVQA as an example, run

sh scripts/finetune/infer.sh

🤖 Evaluation

Here we take MUSIC-AVQA as an example, run

python evaluation.py

📃 BibTeX

@article{wei2025moka,
  title={MokA: Multimodal Low-Rank Adaptation for MLLMs},
  author={Wei, Yake and Miao, Yu and Zhou, Dongzhan and Hu, Di},
  journal={arXiv preprint arXiv:2506.05191},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
configs		configs
dataset		dataset
deepspeed		deepspeed
models		models
peft_hyper		peft_hyper
results		results
scripts		scripts
utils		utils
.DS_Store		.DS_Store
README.md		README.md
evaluation.py		evaluation.py
multi_node.sh		multi_node.sh
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MokA

MokA: Multimodal Low-Rank Adaptation for MLLMs

🚀 Quick Start

🛠️ Requirements and Installation

🥑 Used pre-trained weights:

🌴 Prepare datasets

🔑 Training

🔥 Stage 1: pre-train projectors

🔥 Stage 2: fine-tuning

🤖 Inference

🤖 Evaluation

📃 BibTeX

About

Uh oh!

Releases

Packages

Languages

GeWu-Lab/MokA

Folders and files

Latest commit

History

Repository files navigation

MokA

MokA: Multimodal Low-Rank Adaptation for MLLMs

🚀 Quick Start

🛠️ Requirements and Installation

🥑 Used pre-trained weights:

🌴 Prepare datasets

🔑 Training

🔥 Stage 1: pre-train projectors

🔥 Stage 2: fine-tuning

🤖 Inference

🤖 Evaluation

📃 BibTeX

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages