|
|
Basic Dependencies:
- Python == 3.9
- Pytorch == 2.1.0
- transformers == 4.37.2
- deepspeed == 0.12.6
Multi-modal Encoder Weights:
- download visual encoder openai-clip-vit-large-patch14
- download audio encoder Fine-tuned BEATs_iter3+ (AS2M)
LLM Weights:
- download LLaMA-2-Chat-HF
In this repo, we take the audio-visual-text case as an example. Pretrain based on llama2-7b-chat-hf model.
- Download image and video pretrain dataset from Video-LLaVA;
- Download audio pretrain dataset from AudioCaps;
- The used fine-tuning dataset is MUSIC-AVQA. Prepare the corresponding data and annotation Here.
Set the path of pretrain dataset at:
dataset/pretrain_dataset.py
Set the path of finetuning dataset at
dataset/unified_dataset.py
replace necessary path of google-bert-base-uncased, clip-vit-large-patch14 and BEATs in:
models/multimodal_encoder.py
models/unified_arch.py
It takes about 24h to pre-train the visual projector, using 20 A100 40g GPUs:
sh scripts/pretrain/pretrain_visual.sh
It takes about 1h to pre-train the audio projector, using 16 A100 40g GPUs:
sh scripts/pretrain/pretrain_audio.sh
We also release our pre-trained projectors for llama2-7b-chat-hf: Download audio projector checkpoint, visual projector checkpoint.
Set the path of pre-trained projectors of line 134-135 at:
sh scripts/finetune/finetune.py
Here we take MUSIC-AVQA as an example, and it takes about 5-6h, using 16 A100 40g GPUs:
sh scripts/finetune/ft.sh
Here we take MUSIC-AVQA as an example, run
sh scripts/finetune/infer.sh
Here we take MUSIC-AVQA as an example, run
python evaluation.py
@article{wei2025moka,
title={MokA: Multimodal Low-Rank Adaptation for MLLMs},
author={Wei, Yake and Miao, Yu and Zhou, Dongzhan and Hu, Di},
journal={arXiv preprint arXiv:2506.05191},
year={2025}
}