Skip to content

Fix: SpeechToTextTool returns incoherent transcripts #1480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Martin-Labenne
Copy link

This PR improves SpeechToTextTool to ensure it works out of the box as expected, addressing issue #1478.

At a high level:

  • Explicitly pass the correct whisper sampling_rate to the pre-processor and ensure the attention_mask is properly generated then passed through to the model’s forward method to avoid unreliable behavior during inference.
  • Introduces explicit language control when instantiating the class (ex. language='en') while preserving the option for auto-detection when desired (language is omitted or None).
  • Ensures audio inputs are resampled to the expected model sampling rate to avoid silent transcription failures.
  • Adds support for longer audio files to to get transcription completeness.

Note: This is my first contribution to an open-source project, I'm excited to be part of it and open to any feedback or suggestions. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant