Fix: SpeechToTextTool returns incoherent transcripts #1480

Martin-Labenne · 2025-06-24T13:46:47Z

This PR improves SpeechToTextTool to ensure it works out of the box as expected, addressing issue #1478.

At a high level:

Explicitly pass the correct whisper sampling_rate to the pre-processor and ensure the attention_mask is properly generated then passed through to the model’s forward method to avoid unreliable behavior during inference.
Introduces explicit language control when instantiating the class (ex. language='en') while preserving the option for auto-detection when desired (language is omitted or None).
Ensures audio inputs are resampled to the expected model sampling rate to avoid silent transcription failures.
Adds support for longer audio files to to get transcription completeness.

Note: This is my first contribution to an open-source project, I'm excited to be part of it and open to any feedback or suggestions. Thanks a lot!

…nt errors

Martin-Labenne added 6 commits June 24, 2025 13:01

fix(stt): Pass sampling_rate to WhisperFeatureExtractor to avoid sile…

64d01a6

…nt errors

fix(stt): Set attention_mask to ensure reliable model behavior

4b17ce4

fix(stt): allow explicit language setting via forced_decoder_ids

0571ad1

fix(stt): improve transcript quality via audio resampling

72d6c47

fix(stt): add support for long audio inputs (more than 30s)

6c401f5

fix(audio): move resample logic to AgentAudio to pass tool validation

1077038

Provide feedback