If you find this project useful, please give us a star 🌟.
🎉 Discrete Neural Codec With 24 Tokens Per Second (24KHZ) for Spoken Language Modeling!
Different color lines indicate the data flow used in inference and only for training. During inference, the audio is processed through the encoder and VQ1 to generate discrete quantization, which is then refined by the MLP. The decoder and fine-tuned BigVGAN subsequently reconstruct the Mel-spectrogram and audio.
To install HHCodec, follow these steps:
conda create -n hhcodec python=3.10 # it must >3.10 because use bigvgan
conda activate hhcodec
git clone https://github.com/opendilab/HH-Codec.git
cd HH-Codec
pip install -e .
# Install Dependencies for UTMOS Evaluation
pip install fairseq
# If you encounter conflicts, try:
pip install pip==24.0
Ensure your dataset is preprocessed by following the instructions in dataset
Before starting training, update the configuration settings
# Open and modify the following file "configs/train.yaml"
# Adjust parameters such as:
# - log settings
# - train_path
# - save_dir
# - device (e.g., CPU/GPU)
Once the dataset is prepared and the configuration is set, launch the training process:
cd HH-Codec
python train.py fit --config configs/train.yaml
You can simply use the training set from step 1, the configuration from step 2, and the training script from step 3 to reproduce the results of the model described in the paper with a single run. Since we are still refining the algorithm, an updated set of optimal model weights will be released after the final version of the paper is accepted by the journal.
wav, sr = torchaudio.load(audio_path).to(device))
wav = convert_audio(wav, sr, 24000, 1).unsqueeze(0).unsqueeze(0)
# Generating discrete codecs
_, _, _, _, quant, _, index = model.encode(audio)
# Get quant from index only
quant = model.quantize.indices_to_codes(index)
# Reconstruct audio from raw wav
reconstructed_mel, reconstructed_audios = model.decode(quant)
@article{xue2025hh,
title={HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling},
author={Xue, Rongkun and Niu, Yazhe and Hu, Shuai and Yin, Zixin and Yao, Yongqiang and Yang, Jing},
journal={arXiv preprint arXiv:2507.18897},
year={2025}
}
This project has been developed partially based on the following pioneering works on GitHub repositories. We express our profound gratitude for these foundational resources:
All code within this repository is under Apache License 2.0.