-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Description
Feature request
According to this notebook from the original bark repo:
Advanced Long-Form Generation
Somtimes Bark will hallucinate a little extra audio at the end of the prompt. We can solve this issue by lowering the threshold for bark to stop generating text. We use the min_eos_p kwarg in generate_text_semantic
This rests on an early stopping strategy yet to be implemented in the transformers implementation of Bark. min_eos_p is used during the first sub-model generation, i.e BarkSemanticModel.generate. It would be great to add this feature to improve Bark generation.
Motivation
Sometimes, generated speech have weird artefact at the end of the speech, due to this missing feature.
Your contribution
Some pointers:
- Where
min_eos_pis used in the original code: here. - In the HF implementation, the semantic model is called during generation here, which then called
BarkSemanticModel.generatehere. - Ideally, we'd add a custom stopping criteria, but as indicated in custom stopping_critriea function doesn't receive logits scores (receives None instead) #23674 it's not yet possible to use
scoreswithout settingreturn_dict_in_generate=True, output_scores=True. - Instead, let's add here a logits processor which will set every tokens probability other than the EOS token to
-inf, when the probabiliy of the EOS token id is superior tomin_eos_p.
min_eos_p should be set by default to None (to keep backward compatibility) in BarkSemanticGenerationConfig here and could be possibly passed to BarkModel.generate kwargs without causing issues!